Reputation
Badges 1
70 × Eureka!I think Anna means that if artifacts and models are stored on the clearml fileserver their path will contain the IP or domain of the fileserver. If you then move the fileserver to a different host, all the urls are broken since the host changed.
SuccessfulKoala55 Hey, for us artifact download urls, model download urls, images in plots and debug image urls are broken. In the linked example I can see a solution for the debug images and potentially plot images but cant find the artifacts and model urls inside ES. Are those urls maybe stored inside the mongodb? Any idea where to find them?
thanks for the info, thats really bad 😬 I thought that the output_uri defaults to the fileserver 🙄
thanks a lot, yes it was the daemon :man-facepalming: I already could recover one checkpoint!
Also,
how much memory is allocated for ES? (it's in the docker-compose file)
I increased already the memory to 8GB after reading similar issues here on the slack
SuccessfulKoala55 do you have any example? I guess a lot of people face this issue
Solving the replica issue now allowed me to get better insights into why the one index is red.
` {
"index" : "events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b",
"shard" : 0,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2021-11-09T22:30:47.018Z",
"last_allocation_status" : "no_valid_shard_copy"
},
"can_allocate" : "no_valid_shard_copy",
"allocate_explanation" : "cannot allocate because a...
since it is a single node, I guess it will not possible to recover or partially recover the index right?
I will try to recover it, but anyway the learning is to fully separate the fileserver and any output location from mongo, redis and elastic. Also maybe it makes sense the improve the ES setup to have replicas
CostlyOstrich36 Thank you for your response, is there something like a public project roadmap?
The agents also share the clearml.conf
file which causes some issue with the worker_id/worker_name. They all want to be ubuntu:gpu0. Any idea how I can randomize it? Setting the CLEARML_WORKER_ID env var somehow does not work
Hi AgitatedDove14 one more question about efficient caching, is it possible to cache/share docker images between agents?
the cache on the host is mounted as nfs and the nfs server was configured to not allow the clients to do root operations
Exactly, all agents should share the cache that is mounted via nfs. I think it is working now 🙂
Hey Natan, good point! But I have actually set both
are they in conflict?
hm... Now with commenting it out I have the following problem:docker_pip_cache = /clearml-cache/pip-cache
On host:drwxrwxrwx 5 root root 5 Mar 10 17:17 pip-cache
in task logs:chown: changing ownership of '/root/.cache/pip': Operation not permitted
Executing: ['docker', 'run',......] chown: changing ownership of '/root/.cache/pip': Operation not permitted Get:1
focal-security InRelease [114 kB] Get:2
focal InRelease [265 kB] Get:3
focal-updates InRelease [114 kB
It is at the top of the logs
it appears at multiple places. Seems like the mapping of pip and apt cache does work but the access rights are now an issue
` # pip cache folder mapped into docker, used for python package caching
docker_pip_cache = /clearml-cache/pip-cache
# apt cache folder mapped into docker, used for ubuntu package caching
docker_apt_cache = /clearml-cache/apt-cache
docker_internal_mounts {
apt_cache: "/clearml-cache/apt-cache"
pip_cache: "/clearml-cache/pip-cache"
vcs_cache: "/clearml-cache/vcs-cache"
venv_build: "/clearml-cache/venvs-builds"
pip_download: "/cl...
W: chown to _apt:root of directory /var/cache/apt/archives/partial failed - SetupAPTPartialDirectory (1: Operation not permitted) W: chmod 0700 of directory /var/cache/apt/archives/partial failed - SetupAPTPartialDirectory (1: Operation not permitted) Collecting pip==20.1.1
So I don't need docker_internal_mounts
at all?
AgitatedDove14 one more thing regarding the initial question,apt-cache
, pip-cache
, pip-download-cache
, vcs-cache
and venvs-cache
contain data on the shared clearml-cache
but venvs-build
does not? What sort of data would be stored in the venvs-build
folder? I do have venvs_dir = /clearml-cache/venvs-builds
specified in the clearml.conf
or only not for apt and pip?
I do have this setting in my clearml.conf filevenvs_cache: { free_space_threshold_gb: 50.0 path: /clearml-cache/venvs-cache }
So it should cache the venvs right? I also see content in the /clearml-cache/venvs-cache
folder. Because I have venvs_cache configured there is nothing in venvs-build, since it uses the cache?
` 2021-05-06 13:46:34.032391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:a1:00.0 name: NVIDIA Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 47.46GiB deviceMemoryBandwidth: 625.94GiB/s
2021-05-06 13:46:34.032496: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: ...