Reputation
Badges 1
70 × Eureka!I can see the following using docker ps:d5330ec8c47d allegroai/clearml-agent "/usr/agent/entrypoi…" 3 weeks ago Up 3 weeks clearml
I execute the following to access the containerdocker exec -u root -t -i clearml /bin/bash
I went to /root/.clearml/venv-builds but it is empty
thanks a lot, yes it was the daemon :man-facepalming: I already could recover one checkpoint!
SuccessfulKoala55 I'm currently inside the docker container to recover the ckpt files. But /root/.clearml/venvs-builds seems to be empty. Any idea where I could then find the ckpt files?
thanks for the info, thats really bad 😬 I thought that the output_uri defaults to the fileserver 🙄
Hey AgitatedDove14 , I fixed my code issue and are now able to train on multiple gpus using the https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/fastmri/spawn_dist.py . Since I create the ClearML Task in the main thread I now can't see any training plots and probably also not the output model. What would be the right approach? I would like to avoid using Task.current_task().upload_artifact() or manual logging. I really enjoy the automatic detection
are they in conflict?
` # pip cache folder mapped into docker, used for python package caching
docker_pip_cache = /clearml-cache/pip-cache
# apt cache folder mapped into docker, used for ubuntu package caching
docker_apt_cache = /clearml-cache/apt-cache
docker_internal_mounts {
apt_cache: "/clearml-cache/apt-cache"
pip_cache: "/clearml-cache/pip-cache"
vcs_cache: "/clearml-cache/vcs-cache"
venv_build: "/clearml-cache/venvs-builds"
pip_download: "/cl...
using this code in https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/scripts/pretrain.py
` if name == "main":
task = Task.init(project_name="dummy",
task_name="pretraining",
task_type=Task.TaskTypes.training,
reuse_last_task_id=False)
task.connect(args)
print('Arguments: {}'.format(args))
# only create the task, we will actually execute it later
task.execute_remotely()
spawn_dist.run...
Actually I saw that the RuntimeError: context has already been set appears when the task is initialised outside if __name__ == "__main__":
AgitatedDove14 one more thing regarding the initial question,apt-cache , pip-cache , pip-download-cache , vcs-cache and venvs-cache contain data on the shared clearml-cache but venvs-build does not? What sort of data would be stored in the venvs-build folder? I do have venvs_dir = /clearml-cache/venvs-builds specified in the clearml.conf
Hey Natan, good point! But I have actually set both
Ok, if I would like to have a different behaviour I would need one agent per task, right?
Hi AgitatedDove14 one more question about efficient caching, is it possible to cache/share docker images between agents?
I can figure out a way to resolve it, but is there any other way to get env vars / any value or secret from the host to the docker of a task?
docker-compose with entrypoint.sh with python3 -m clearml_agent daemon --docker "${CLEARML_AGENT_DEFAULT_BASE_DOCKER:-$TRAINS_AGENT_DEFAULT_BASE_DOCKER}" --force-current-version ${CLEARML_AGENT_EXTRA_ARGS:-$TRAINS_AGENT_EXTRA_ARGS} --queue office
it appears at multiple places. Seems like the mapping of pip and apt cache does work but the access rights are now an issue
Ok it is more a docker issue, I guess it is not feasible reading the thread.
or only not for apt and pip?
so now there is the user conflict between the host and the agent inside the container
The agents also share the clearml.conf file which causes some issue with the worker_id/worker_name. They all want to be ubuntu:gpu0. Any idea how I can randomize it? Setting the CLEARML_WORKER_ID env var somehow does not work
but this would be still part of the clearml.conf right? I would prefer a way around clearml.conf to avoid resolving the variables
I like this approach more but it still requires resolved environment variables inside the clearml.conf
I do have this setting in my clearml.conf filevenvs_cache: { free_space_threshold_gb: 50.0 path: /clearml-cache/venvs-cache }So it should cache the venvs right? I also see content in the /clearml-cache/venvs-cache folder. Because I have venvs_cache configured there is nothing in venvs-build, since it uses the cache?
Executing: ['docker', 'run',......] chown: changing ownership of '/root/.cache/pip': Operation not permitted Get:1 focal-security InRelease [114 kB] Get:2 focal InRelease [265 kB] Get:3 focal-updates InRelease [114 kBIt is at the top of the logs