
Reputation
Badges 1
70 × Eureka!I'm running the following agent:clearml-agent --config-file /clearml-cache/config/clearml-cpu.conf daemon --queue cpu default services --docker ubuntu:20.04 --cpu-only --services-mode 4 --detached
The goal is to have an agent that can run multiple cpu only tasks at the same time. I notices that when enqueueing multiple tasks, all except for one stay pending
until the first one finished downloading all packages and started with code execution. And then task by task switch to "run...
The agents also share the clearml.conf
file which causes some issue with the worker_id/worker_name. They all want to be ubuntu:gpu0. Any idea how I can randomize it? Setting the CLEARML_WORKER_ID env var somehow does not work
` # pip cache folder mapped into docker, used for python package caching
docker_pip_cache = /clearml-cache/pip-cache
# apt cache folder mapped into docker, used for ubuntu package caching
docker_apt_cache = /clearml-cache/apt-cache
docker_internal_mounts {
apt_cache: "/clearml-cache/apt-cache"
pip_cache: "/clearml-cache/pip-cache"
vcs_cache: "/clearml-cache/vcs-cache"
venv_build: "/clearml-cache/venvs-builds"
pip_download: "/cl...
Hi AgitatedDove14 one more question about efficient caching, is it possible to cache/share docker images between agents?
My code produces now an error inside one of the threads, but that should be an issue on my side. Still this issue inside a child thread was not detected as failure and the training task resulted in "completed". This error happens now with the Task.init inside the if __name__ == "__main__":
as seen above in the code snippet.
curl -XPUT -H 'Content-Type: application/json' 'localhost:9200/_settings' -d '{"index" : {"number_of_replicas" : 0}}
This command made all my indices beside the broken one which is still red, come green again. It comes from https://stackoverflow.com/questions/63403972/elasticsearch-index-in-red-health/63405623#63405623 .
Exactly, all agents should share the cache that is mounted via nfs. I think it is working now 🙂
We run a lot of pipelines that are cpu only with some parallel steps. Its just about improving the execution time
AgitatedDove14 one more thing regarding the initial question,apt-cache
, pip-cache
, pip-download-cache
, vcs-cache
and venvs-cache
contain data on the shared clearml-cache
but venvs-build
does not? What sort of data would be stored in the venvs-build
folder? I do have venvs_dir = /clearml-cache/venvs-builds
specified in the clearml.conf
but this would be still part of the clearml.conf right? I would prefer a way around clearml.conf to avoid resolving the variables
Actually I saw that the RuntimeError: context has already been set
appears when the task is initialised outside if __name__ == "__main__":
I can figure out a way to resolve it, but is there any other way to get env vars / any value or secret from the host to the docker of a task?
This happens inside the agent, since I use task.execute_remotely() I guess. The agent runs on ubuntu 18.04 and not in docker mode
using this code in https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/scripts/pretrain.py
` if name == "main":
task = Task.init(project_name="dummy",
task_name="pretraining",
task_type=Task.TaskTypes.training,
reuse_last_task_id=False)
task.connect(args)
print('Arguments: {}'.format(args))
# only create the task, we will actually execute it later
task.execute_remotely()
spawn_dist.run...
` if name == "main":
task = Task.init(project_name="dummy",
task_name="pretraining",
task_type=Task.TaskTypes.training,
reuse_last_task_id=False)
task.connect(args)
print('Arguments: {}'.format(args))
# only create the task, we will actually execute it later
task.execute_remotely()
spawn_dist.run(args) `I added it to this script and use it as a starting point https://github.com/facebookresearch/fastMRI/bl...
Ok it is more a docker issue, I guess it is not feasible reading the thread.
It is working now, it seemed like I pointed to a wrong entrypoint.sh
in the docker-compose file. Still strange...
We do have a queue called office and another queue called default, so the agent is not listening for queues that are not defined. Or do I misunderstand something? The server has all queues defined that the agents are using
SuccessfulKoala55 do you have any example? I guess a lot of people face this issue
CostlyOstrich36 Thank you for your response, is there something like a public project roadmap?
One more thing: The dockerized version is still not working as I want it to. If I use any specific docker image like docker: nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04
on a host machine with NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3
I always get a similar error as above where a lib is missing. If I use the example from http://clear.ml clearml-agent daemon --gpus 0 --queue default --docker nvidia/cuda
I always get this error ` docker: Error...
Did you wait for all the other indices to reach yellow status?
yes I waited until everything was yellow
hm... Now with commenting it out I have the following problem:docker_pip_cache = /clearml-cache/pip-cache
On host:drwxrwxrwx 5 root root 5 Mar 10 17:17 pip-cache
in task logs:chown: changing ownership of '/root/.cache/pip': Operation not permitted
the cache on the host is mounted as nfs and the nfs server was configured to not allow the clients to do root operations
I think Anna means that if artifacts and models are stored on the clearml fileserver their path will contain the IP or domain of the fileserver. If you then move the fileserver to a different host, all the urls are broken since the host changed.
That's it? no apparent error?
After the logs on the top there was only logs on "info" level with PluginsService