Reputation
Badges 1
70 × Eureka!So I don't need docker_internal_mounts
at all?
or only not for apt and pip?
probably found the issue
Exactly, all agents should share the cache that is mounted via nfs. I think it is working now 🙂
The agents also share the clearml.conf
file which causes some issue with the worker_id/worker_name. They all want to be ubuntu:gpu0. Any idea how I can randomize it? Setting the CLEARML_WORKER_ID env var somehow does not work
I want to cache as much as possible and /clearml-cache/venvs-cach
(on the host) does contain caches venvs. But /clearml-cache/venvs-builds
is empty. My question was how to also cache venvs_builds
` # pip cache folder mapped into docker, used for python package caching
docker_pip_cache = /clearml-cache/pip-cache
# apt cache folder mapped into docker, used for ubuntu package caching
docker_apt_cache = /clearml-cache/apt-cache
docker_internal_mounts {
apt_cache: "/clearml-cache/apt-cache"
pip_cache: "/clearml-cache/pip-cache"
vcs_cache: "/clearml-cache/vcs-cache"
venv_build: "/clearml-cache/venvs-builds"
pip_download: "/cl...
That's it? no apparent error?
After the logs on the top there was only logs on "info" level with PluginsService
` elasticsearch:
networks:
- backend
container_name: clearml-elastic
environment:
ES_JAVA_OPTS: -Xms8g -Xmx8g
bootstrap.memory_lock: "true"
cluster.name: clearml
cluster.routing.allocation.node_initial_primaries_recoveries: "500"
cluster.routing.allocation.disk.watermark.low: 500mb
cluster.routing.allocation.disk.watermark.high: 500mb
cluster.routing.allocation.disk.watermark.flood_stage: 500mb
discovery.zen.minimum_master_no...
using top
inside the elasticsearch container shows elastic+ 20  0  17.0g  8.7g 187584 S  2.3 27.2  1:09.18 java
that the 8g are reserved. So setting ES_JAVA_OPTS: -Xms8g -Xmx8g
should work.
 so you say deleting other old indices that I don't need could help?
This did not help, I still have the same issue
Yes, this happened when the disk got filled up to 100%
I increased already the memory to 8GB after reading similar issues here on the slack`
Just making sure, how exactly did you do that?
docker-compose down
elasticsearch: networks: - backend container_name: clearml-elastic environment: ES_JAVA_OPTS: -Xms8g -Xmx8g `` docker-compose up -d
Did you wait for all the other indices to reach yellow status?
yes I waited until everything was yellow
Ok, if I would like to have a different behaviour I would need one agent per task, right?
W: chown to _apt:root of directory /var/cache/apt/archives/partial failed - SetupAPTPartialDirectory (1: Operation not permitted) W: chmod 0700 of directory /var/cache/apt/archives/partial failed - SetupAPTPartialDirectory (1: Operation not permitted) Collecting pip==20.1.1
hm... Now with commenting it out I have the following problem:docker_pip_cache = /clearml-cache/pip-cache
On host:drwxrwxrwx 5 root root 5 Mar 10 17:17 pip-cache
in task logs:chown: changing ownership of '/root/.cache/pip': Operation not permitted
CostlyOstrich36 Thank you for your response, is there something like a public project roadmap?
I will try to recover it, but anyway the learning is to fully separate the fileserver and any output location from mongo, redis and elastic. Also maybe it makes sense the improve the ES setup to have replicas
curl -XPUT -H 'Content-Type: application/json' 'localhost:9200/_settings' -d '{"index" : {"number_of_replicas" : 0}}
This command made all my indices beside the broken one which is still red, come green again. It comes from https://stackoverflow.com/questions/63403972/elasticsearch-index-in-red-health/63405623#63405623 .
Hey AgitatedDove14 , I fixed my code issue and are now able to train on multiple gpus using the https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/fastmri/spawn_dist.py . Since I create the ClearML Task in the main thread I now can't see any training plots and probably also not the output model. What would be the right approach? I would like to avoid using Task.current_task().upload_artifact()
or manual logging. I really enjoy the automatic detection
I'm running now the the code shown above and will let you know if there is still an issue
Executing: ['docker', 'run',......] chown: changing ownership of '/root/.cache/pip': Operation not permitted Get:1
focal-security InRelease [114 kB] Get:2
focal InRelease [265 kB] Get:3
focal-updates InRelease [114 kB
It is at the top of the logs
thanks a lot, yes it was the daemon :man-facepalming: I already could recover one checkpoint!
` root@ubuntu:/opt/clearml# sudo docker logs clearml-elastic
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
{"type": "server", "timestamp": "2021-11-09T12:49:13,403Z", "level": "INFO", "component": "o.e.e.NodeEnvironment", "cluster.name": "clearml", "node.name": "clearml", "message": "using [1] data paths, mounts [[/usr/share/elasticsearch/data (//some_ip/clearml-server-data)]], net usable_space [3.4tb]...
SuccessfulKoala55 Hey, for us artifact download urls, model download urls, images in plots and debug image urls are broken. In the linked example I can see a solution for the debug images and potentially plot images but cant find the artifacts and model urls inside ES. Are those urls maybe stored inside the mongodb? Any idea where to find them?
so now there is the user conflict between the host and the agent inside the container
clearml-agent daemon --gpus 0 --queue default --docker nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.0 causes not using the GPUs because of missing libs.