Reputation
Badges 1
981 × Eureka!You mean it will resolve by itself in the following days or should I do something? Or there is nothing to do and it will stay this way?
AgitatedDove14 I have a machine with two gpus and one agent per GPU. I provide the same trains.conf to both agents, so they use the same directory for caching venvs. Can it be problematic?
(BTW: it will work with elevated credentials, but probably not recommended)
What does that mean? Not sure to understand
So I need to have this merging of small configuration files to build the bigger one
CostlyOstrich36 I updated both agents to 1.1.2 and still go the same problem unfortunately. Since I can download the full log file from the Web UI, I guess the agents are reporting correctly?
Could it be that the elasticsearch does not return all the requested logs when it is queried from the WebUI to display it in the console?
Now that I think about it, I remember that on the changelog of the clearml-server 1.2.0 the following is listed:
` Fix UI Workers & Queues and Experiment Table pages ...
Is there a typo in your message? I don't see the difference between what I wrote and what you suggested: TRAINS_WORKER_NAME = "trains-agent":$DYNAMIC_INSTANCE_ID
Although task.data.last_iteration  is correct when resuming, there is still this doubling effect when logging metrics after resuming 😞
Whohoo! Thanks 👌
Ok I have a very different problem now: I did the following to restart the ES cluster:docker-compose down docker-compose up -dAnd now the cluster is empty. I think docker simply created a new volume instead of reusing the previous one, which was always the case so far.
I carry this code from older versions of trains to be honest, I don't remember precisely why I did that
The cloning is done in another task, which has the argv parameters I want the cloned task to inherit from
So I cannot ssh anymore to the agent after starting clearml-session on it
I have a mental model of the clearml-agent as a module to spin my code somewhere, and the python version running my code should not depend of the python version running the clearml-agent (especially for experiments running in containers)
There it is: https://github.com/allegroai/clearml/issues/493
The jump in the loss when resuming at iteration 31 is probably another issue -> for now I can conclude that:
I need to set sdk.development.report_use_subprocess = false I need to call task.set_initial_iteration(0)
mmh it looks like what I was looking for, I will give it a try 🙂
I also tried task.set_initial_iteration(-task.data.last_iteration) , hoping it would counteract the bug, didn’t work
AgitatedDove14 I do continue an aborted Task yes - So I shouldn’t even need to call the task.set_initial_iteration function, interesting! Do you have any ideas what could be a reason of the behavior I am observing? I am trying to find ways to debug it
Now I'm curious, what did you end up doing ?
in my repo I maintain a bash script to setup a separate python env. then in my task I spawn a subprocess and I don't pass the env variables, so that the subprocess properly picks up the separate python env
Yes, I would like to update all references to the old bucket unfortunately… I think I’ll simply delete the old s3 bucket, wait or his name to be available again and recreate it where on the other aws account and move the data there. This way I don’t have to mess with clearml data - I am afraid to do something wrong and loose data
Ha nice, makes perfect sense thanks AgitatedDove14 !
So probably only the main process (rank=0) should attach the ClearMLLogger?
v0.17.5rc2
It could be yes but the difference between now and last_report_time doesn’t match the difference I observe
Why is it required in the case where boto3 can figure them out itself within the ec2 instance?