Reputation
Badges 1
981 × Eureka!Hi CostlyOstrich36 , this weekend I took a look at the diffs with the previous version ( https://github.com/allegroai/clearml-server/compare/1.1.1...1.2.0# ) and I saw several changes related to the scrolling/logging:
apiserver/bll/event/ http://log_events_iterator.py apiserver/bll/event/ http://events_iterator.py apiserver/config/default/services/_mongo.conf apiserver/database/model/ http://base.py apiserver/services/ http://events.pyI suspect that one of these changes might be responsible ...
Well actually I do see many errors like that in the browser console:
So two possible cases for trains-agent-1: either:
It picks a new experiment -> show randomly one of the two experiments in the "workers" tab no new experiment in default queue to start -> show randomly no experiment or the one that it is running
by mistake I have two agents started in one machine
Answering myself: Yes, Task.set_base_docker RTFM!!!
You mean it will resolve by itself in the following days or should I do something? Or there is nothing to do and it will stay this way?
AgitatedDove14 I have a machine with two gpus and one agent per GPU. I provide the same trains.conf to both agents, so they use the same directory for caching venvs. Can it be problematic?
(BTW: it will work with elevated credentials, but probably not recommended)
What does that mean? Not sure to understand
So I need to have this merging of small configuration files to build the bigger one
CostlyOstrich36 I updated both agents to 1.1.2 and still go the same problem unfortunately. Since I can download the full log file from the Web UI, I guess the agents are reporting correctly?
Could it be that the elasticsearch does not return all the requested logs when it is queried from the WebUI to display it in the console?
Now that I think about it, I remember that on the changelog of the clearml-server 1.2.0 the following is listed:
` Fix UI Workers & Queues and Experiment Table pages ...
Is there a typo in your message? I don't see the difference between what I wrote and what you suggested: TRAINS_WORKER_NAME = "trains-agent":$DYNAMIC_INSTANCE_ID
Although task.data.last_iteration is correct when resuming, there is still this doubling effect when logging metrics after resuming 😞
Whohoo! Thanks 👌
Ok I have a very different problem now: I did the following to restart the ES cluster:docker-compose down docker-compose up -dAnd now the cluster is empty. I think docker simply created a new volume instead of reusing the previous one, which was always the case so far.
I carry this code from older versions of trains to be honest, I don't remember precisely why I did that
The cloning is done in another task, which has the argv parameters I want the cloned task to inherit from
So I cannot ssh anymore to the agent after starting clearml-session on it
I have a mental model of the clearml-agent as a module to spin my code somewhere, and the python version running my code should not depend of the python version running the clearml-agent (especially for experiments running in containers)
There it is: https://github.com/allegroai/clearml/issues/493
The jump in the loss when resuming at iteration 31 is probably another issue -> for now I can conclude that:
I need to set sdk.development.report_use_subprocess = false I need to call task.set_initial_iteration(0)
mmh it looks like what I was looking for, I will give it a try 🙂
I also tried task.set_initial_iteration(-task.data.last_iteration) , hoping it would counteract the bug, didn’t work
AgitatedDove14 I do continue an aborted Task yes - So I shouldn’t even need to call the task.set_initial_iteration function, interesting! Do you have any ideas what could be a reason of the behavior I am observing? I am trying to find ways to debug it
Now I'm curious, what did you end up doing ?
in my repo I maintain a bash script to setup a separate python env. then in my task I spawn a subprocess and I don't pass the env variables, so that the subprocess properly picks up the separate python env
Yes, I would like to update all references to the old bucket unfortunately… I think I’ll simply delete the old s3 bucket, wait or his name to be available again and recreate it where on the other aws account and move the data there. This way I don’t have to mess with clearml data - I am afraid to do something wrong and loose data