EnchantingHippopotamus83

3 Questions, 12 Answers

Active since 01 March 2023

Last activity one year ago

Reputation

Badges 1

11 × Eureka!

Questions 3
Answers 12

0 Votes

7 Answers

1K Views

0 Votes 7 Answers 1K Views

Hi, My Agent Is Running On A Docker (When I Invoke Clearml-Agent I Pass --Docker <My Docker Name>). I Would Like To Pass Arguments To The Docker. Searching This Channel, I Found

Hi, my agent is running on a docker (when i invoke clearml-agent i pass --docker ). i would like to pass arguments to the docker. searching this channel, i f...

mlops

one year ago

0 Votes

2 Answers

1K Views

0 Votes 2 Answers 1K Views

Hello, Tldr, Delete Metrics, Plots And Debug Samples From A Previous Task More Details: I Have A Task I Ran That Trained A Model And Then Tested It. The Test Results Were Reported By The Logger In Metrics, Plots And Debug Samples. I Wrote Code That Gets

hello, TLDR, delete metrics, plots and debug samples from a previous task More details: I have a task i ran that trained a model and then tested it. the test...

clearml

one year ago

0 Votes

13 Answers

1K Views

0 Votes 13 Answers 1K Views

Hi, What Could Be The Reason That A Task Ran On An Agent Just Stopped Updating? The Status Is Still "Running" But It Doesn'T Seems Like It. The Agent Is Running On A Docker On A Gpu. It Completed 92 Epochs And Started 93. Run Started At 18:37 Feb 27, Last

Hi, What could be the reason that a Task ran on an agent just stopped updating? the status is still "Running" but it doesn't seems like it. the agent is runn...

mlops

one year ago

0 Hello, Tldr, Delete Metrics, Plots And Debug Samples From A Previous Task More Details: I Have A Task I Ran That Trained A Model And Then Tested It. The Test Results Were Reported By The Logger In Metrics, Plots And Debug Samples. I Wrote Code That Gets

i was hoping there is a way to keep the artifacts but "clean" the reported metrics, plots and debug samples.
thanks anyway
@<1523701070390366208:profile|CostlyOstrich36> @<1577830978284425216:profile|ContemplativeButterfly4>

one year ago

0 Hi, What Could Be The Reason That A Task Ran On An Agent Just Stopped Updating? The Status Is Still "Running" But It Doesn'T Seems Like It. The Agent Is Running On A Docker On A Gpu. It Completed 92 Epochs And Started 93. Run Started At 18:37 Feb 27, Last

looked in clearml_server/logs/apiserver.log:
last report on 2023-02-28 08:39:27,981. nothing wrong.
looking for the last update message on 03:21:
[2023-02-28 03:21:21,380] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 46ms
[2023-02-28 03:21:25,103] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms
[2023-02-28 03:21:25,119] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 7ms
[2023-02-28 03:21:25,128] [9] [INFO] [clearml.service_re...

one year ago

yes. task's last update was on 3:21 Feb 28.
here are some lines from the log:
[2023-02-28 03:41:29,212] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms
[2023-02-28 04:53:02,019] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms

way after the task's last update i can see couple of WARNINGS in log. to be honest, im not sure if the regard to the same task of a new one, nevertheless ill add them. maybe they will help (replaced company value with <xxx> ):
...

one year ago

0 Hi, My Agent Is Running On A Docker (When I Invoke Clearml-Agent I Pass --Docker <My Docker Name>). I Would Like To Pass Arguments To The Docker. Searching This Channel, I Found

yes. the lines above were from the task log. let me add more info from the task log:
task yyy pulled from zzz by worker www # first line
Running Task xxx inside default docker: <my docker name> arguments: [] # second line on the task log
Executing: ['docker', 'run', '-t', '--gpus', '"device=1"', '--shm-size', '8G', ...] # begining of the third line
agent.extra_docker_arguments.0 = --shm-size # later on
agent.extra_docker_arguments.1 = 8G # later on

one year ago

0 Hi, My Agent Is Running On A Docker (When I Invoke Clearml-Agent I Pass --Docker <My Docker Name>). I Would Like To Pass Arguments To The Docker. Searching This Channel, I Found

default_docker: {
    arguments: ["--shm-size", 8G]
}

the above seems to do the trick.
second line on the web console output:
Running Task xxx inside default docker: <my docker name> arguments: ['--shm-size', '8G']
later on:
agent.default_docker.arguments.0 = --shm-size
agent.default_docker.arguments.1 = 8G
later on:
docker_cmd = <my docker name> --shm-size 8G

thank you for your help @<1523701070390366208:profile|CostlyOstrich36> :)

one year ago

0 Hi, My Agent Is Running On A Docker (When I Invoke Clearml-Agent I Pass --Docker <My Docker Name>). I Would Like To Pass Arguments To The Docker. Searching This Channel, I Found

as for agent.default_docker.arguments:
add to the conf?

default_docker: {
        arguments: ["--shm-size=8G",]
    }

one year ago

this is not the worker output /tmp/.clearml_agent_out.t3g81c0n.txt ?
im kinda new to clearml so fogive me for mixing up terms

one year ago

it was the only task @<1523701087100473344:profile|SuccessfulKoala55>
did you encounter something like this?
just a recap, task status was running, but seems to be stuck. nvidia-smi showed gpu still has memory allocated, ruling out the server web disconnecting from the agent and the agent finished. If someone did use the GPU outside clearML, i would expect some sort of CUDA crash in the agent's run

one year ago

@<1523701070390366208:profile|CostlyOstrich36>

one year ago

self hosted server

one year ago

well if it is the case, that's the first out of many experiments on almost the same code. Let's hope i will not see this issue again.
@<1523701087100473344:profile|SuccessfulKoala55> @<1523701070390366208:profile|CostlyOstrich36> - thank you for your time and help

one year ago

0 Hi All Here

Hi,
I recently migrated my clearml server to a different machine. i copied all the data folder as recommended above. on the new clearml server i can see all my old experiments and datasets. unfortunately, when running a task with a dataset from the previous machine, the tasks fails and writes the old server ip.
2023-03-12 12:55:59,934 - clearml.storage - ERROR - Could not download None .............
i replaced everywhere i could find the old ip with the new one.
i...

one year ago