Reputation
Badges 1
70 × Eureka!` # pip cache folder mapped into docker, used for python package caching
docker_pip_cache = /clearml-cache/pip-cache
# apt cache folder mapped into docker, used for ubuntu package caching
docker_apt_cache = /clearml-cache/apt-cache
docker_internal_mounts {
apt_cache: "/clearml-cache/apt-cache"
pip_cache: "/clearml-cache/pip-cache"
vcs_cache: "/clearml-cache/vcs-cache"
venv_build: "/clearml-cache/venvs-builds"
pip_download: "/cl...
Exactly, all agents should share the cache that is mounted via nfs. I think it is working now 🙂
Hey AgitatedDove14 , I fixed my code issue and are now able to train on multiple gpus using the https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/fastmri/spawn_dist.py . Since I create the ClearML Task in the main thread I now can't see any training plots and probably also not the output model. What would be the right approach? I would like to avoid using Task.current_task().upload_artifact()
or manual logging. I really enjoy the automatic detection
Can you send some more comprehensive log - perhaps there are other messages that are related
which logs do you wish?
That's it? no apparent error?
After the logs on the top there was only logs on "info" level with PluginsService
since it is a single node, I guess it will not possible to recover or partially recover the index right?
Yes, this happened when the disk got filled up to 100%
using top
inside the elasticsearch container shows elastic+ 20  0  17.0g  8.7g 187584 S  2.3 27.2  1:09.18 java
that the 8g are reserved. So setting ES_JAVA_OPTS: -Xms8g -Xmx8g
should work.
Very good news!
 so you say deleting other old indices that I don't need could help?
This did not help, I still have the same issue
Did you wait for all the other indices to reach yellow status?
yes I waited until everything was yellow
SuccessfulKoala55 so you say deleting other old indices that I don't need could help?
Hey Natan, good point! But I have actually set both
so now there is the user conflict between the host and the agent inside the container
The strange thing was that my agents where running in the morning but just disappeared in the clearml server ui under workers-and-queues . So I did docker-compose down / up and then I got this error.
This happens inside the agent, since I use task.execute_remotely() I guess. The agent runs on ubuntu 18.04 and not in docker mode
RuntimeError: stack expects each tensor to be equal size, but got [15, 640, 372, 2] at entry 0 and [15, 322, 640, 2] at entry 1 Detected an exited process, so exiting main terminating child processes exiting
I think Anna means that if artifacts and models are stored on the clearml fileserver their path will contain the IP or domain of the fileserver. If you then move the fileserver to a different host, all the urls are broken since the host changed.
Executing: ['docker', 'run',......] chown: changing ownership of '/root/.cache/pip': Operation not permitted Get:1
focal-security InRelease [114 kB] Get:2
focal InRelease [265 kB] Get:3
focal-updates InRelease [114 kB
It is at the top of the logs
SuccessfulKoala55 I'm currently inside the docker container to recover the ckpt files. But /root/.clearml/venvs-builds
seems to be empty. Any idea where I could then find the ckpt files?
I'm running the following agent:clearml-agent --config-file /clearml-cache/config/clearml-cpu.conf daemon --queue cpu default services --docker ubuntu:20.04 --cpu-only --services-mode 4 --detached
The goal is to have an agent that can run multiple cpu only tasks at the same time. I notices that when enqueueing multiple tasks, all except for one stay pending
until the first one finished downloading all packages and started with code execution. And then task by task switch to "run...
SuccessfulKoala55 Hey, for us artifact download urls, model download urls, images in plots and debug image urls are broken. In the linked example I can see a solution for the debug images and potentially plot images but cant find the artifacts and model urls inside ES. Are those urls maybe stored inside the mongodb? Any idea where to find them?
I can see the following using docker ps:d5330ec8c47d allegroai/clearml-agent "/usr/agent/entrypoi…" 3 weeks ago  Up 3 weeks  clearml
I execute the following to access the containerdocker exec -u root -t -i clearml /bin/bash
I went to /root/.clearml/venv-builds
but it is empty
thanks for the info, thats really bad 😬 I thought that the output_uri defaults to the fileserver 🙄
thanks a lot, yes it was the daemon :man-facepalming: I already could recover one checkpoint!
SuccessfulKoala55 do you have any example? I guess a lot of people face this issue
Hi AgitatedDove14 , I get an error when running a task on my worker. I have looked into /home/user/.clearml/venvs-builds
but it is empty. Any idea why this happens? I actually don’t know what I changed to cause this issue… I’m running clearml-agent v1.0.0
clearml_agent: ERROR: Command '['python3.6', '-m', 'virtualenv', '/home/user/.clearml/venvs-builds/3.6']' returned non-zero exit status 1.
Ok it is more a docker issue, I guess it is not feasible reading the thread.