What deployment are you using? Docker-compose?
SuccessfulKoala55 How should I pass this variable? Do I need to create a file apiserver.conf
in folder /opt/clearml/config
and write there just CLEARML_USE_GUNICORN=1
. Do I need to restart a server after that?
Do we even have an option to assign id to each agent? https://clear.ml/docs/latest/docs/clearml_agent/clearml_agent_daemon
Did you change anything in the compose file or are you using the default settings?
SuccessfulKoala55 We are encountering some strange problem. We are spinning N agents using script, in a loop
But not all agents are visible as workers (we check it both in UI, but also running workers_list = client.workers.get_all()
).
Do you think that is it possibility that too much of them are connecting at once and we can solve that by setting a delay between running subsequent agents?
You need to set that in the environment section of the apiserver service in the docker-compose.yaml file. And yes, you'll need to run docker-compose up again
Well, my first question would be what is the worker name/id assigned to each one? Using the same ID might hide some of them?
SuccessfulKoala55 we did it through default Docker-compose file.
If there a way to give more resources for server to help it somehow?
SuccessfulKoala55 could we run a server with some verbose logging?
SuccessfulKoala55 hmm, we are trying to do something like that and we are encountering problems. We are doing big hyperparameter optimization on 200 workers and some tasks are failing (while with less workers they are not failing). Also, UI also has some problems with that. Maybe there are some settings that should be corrected in comparison to classic configuration?
OK, I think what you need to do is scale up the number of apiserver worker processes - pass the CLEARML_USE_GUNICORN=1
environment variable to the apiserver service, this should start 8 processes (by default) instead of one, and see if it helps. By the way, while this number (number of processes) can be set even higher, at some point, I assume you'll start having issues with load on the elasticsearch service, which is not that easy to scale up.
That depends on what the workers are doing, but in general such a spec should definitely work