Now, I know the experiments having the most metrics. I want to downsample these metrics by 10, ie only keep iterations that are multiple of 10. How can I query (to delete) only the documents ending with 0?
Hi AgitatedDove14 , I upgraded to 1.3.1 and the bug of missing logs in the console is still there… 😞
I made another recording so that you can understand what it is about:
I enqueue a task the task starts, the logs shown in the console are very sparse I scroll up and down to try to fetch missing logs, without success I download the logs, open the file and there I see the full logs
Hi DeterminedCrab71 Version: 1.1.1-135 • 1.1.1 • 2.14
Thanks for the clarification SuccessfulKoala55 ! A follow-up question:
I would like to install several packages (opencv, numpy, torch) in the system-site-packages so that they are available in each experiment (to reduce setup time of the experiments). Installing them globally via
(docker was install with sudo snap install docker )
AgitatedDove14 Yes I have the xpack security disabled, as in the link you shared (note that its xpack.security.enabled: "false" with brackets around false), but this command throws:
{"error":{"root_cause":[{"type":"parse_exception","reason":"request body is required"}],"type":"parse_exception","reason":"request body is required"},"status":400}
haa got it, I am on a self hosted server, that’s why I don’t see it
Hi CostlyOstrich36 , this weekend I took a look at the diffs with the previous version ( https://github.com/allegroai/clearml-server/compare/1.1.1...1.2.0# ) and I saw several changes related to the scrolling/logging:
apiserver/bll/event/ http://log_events_iterator.py apiserver/bll/event/ http://events_iterator.py apiserver/config/default/services/_mongo.conf apiserver/database/model/ http://base.py apiserver/services/ http://events.pyI suspect that one of these changes might be responsible ...
Well actually I do see many errors like that in the browser console:
So two possible cases for trains-agent-1: either:
It picks a new experiment -> show randomly one of the two experiments in the "workers" tab no new experiment in default queue to start -> show randomly no experiment or the one that it is running
by mistake I have two agents started in one machine
Answering myself: Yes, Task.set_base_docker RTFM!!!
You mean it will resolve by itself in the following days or should I do something? Or there is nothing to do and it will stay this way?
AgitatedDove14 I have a machine with two gpus and one agent per GPU. I provide the same trains.conf to both agents, so they use the same directory for caching venvs. Can it be problematic?
(BTW: it will work with elevated credentials, but probably not recommended)
What does that mean? Not sure to understand
So I need to have this merging of small configuration files to build the bigger one
CostlyOstrich36 I updated both agents to 1.1.2 and still go the same problem unfortunately. Since I can download the full log file from the Web UI, I guess the agents are reporting correctly?
Could it be that the elasticsearch does not return all the requested logs when it is queried from the WebUI to display it in the console?
Now that I think about it, I remember that on the changelog of the clearml-server 1.2.0 the following is listed:
` Fix UI Workers & Queues and Experiment Table pages ...
Is there a typo in your message? I don't see the difference between what I wrote and what you suggested: TRAINS_WORKER_NAME = "trains-agent":$DYNAMIC_INSTANCE_ID
Although task.data.last_iteration is correct when resuming, there is still this doubling effect when logging metrics after resuming 😞
Whohoo! Thanks 👌
Ok I have a very different problem now: I did the following to restart the ES cluster:docker-compose down docker-compose up -dAnd now the cluster is empty. I think docker simply created a new volume instead of reusing the previous one, which was always the case so far.
I carry this code from older versions of trains to be honest, I don't remember precisely why I did that
The cloning is done in another task, which has the argv parameters I want the cloned task to inherit from
So I cannot ssh anymore to the agent after starting clearml-session on it
I have a mental model of the clearml-agent as a module to spin my code somewhere, and the python version running my code should not depend of the python version running the clearml-agent (especially for experiments running in containers)