I followed https://github.com/NVIDIA/nvidia-docker/issues/1034#issuecomment-520282450 and now it seems to be setting up properly
not much info š
Can you manually run the docker ?
But I see in the agent logs:Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', ...
That didnāt gave useful infos, was that docker was not installed in the agent machine x)
JitteryCoyote63 you mean "docker" was not installed and it did not throw an error ?
I got some progress TimelyPenguin76 , Now the task runs and I get the error from docker:docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
Alright I have a followup question then: I used the param --user-folder ā~/projects/my-projectā, but any change I do is not reflected in this folder. I guess I am in the docker space, but this folder is not linked to my the folder on the machine. Is it possible to do so?
When you start the ClearML agent, the last line is the file for the agentās output, for linux is should be something like:
Running CLEARML-AGENT daemon in background mode, writing stdout/stderr to /tmp/.clearml_agent_daemon_***.txt
which ClearML agent version are you running?
š
So nvidia-container-toolkit
and systemctl restart dockerd
fixed it?
can you check the agentās logs? maybe we can find something there
Alright, how can I then mount a volume of the disk?
Ā you mean ādockerā was not installed and it did not throw an error ?
Yes docker was not installed in the machine
Yes you must make sure the docker can mount a persistent folder for you to work on.
Ok, it would be nice to have a --user-folder-mounted that do the linking automatically
Here are the logs of the agent :)
` (base) user@worker:~$ tail -f /tmp/.clearml_agent_daemon_outjdups8t2.txt
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false
+----------------------------------+--------+-------+
| id | name | tags |
+----------------------------------+--------+-------+
| 54e4a62a402d5135612ba7b12cfe4e57 | docker | |
+----------------------------------+--------+-------+
Starting infinite task polling loop...
tail: /tmp/.clearml_agent_daemon_outjdups8t2.txt: file truncated
task 44abb7247b0da367b6da05e65592cafb pulled from 54e4a62a402d5135612ba7b12cfe4e57 by worker office:worker-0:docker
Running task '44abb7247b0da367b6da05e65592cafb'
Storing stdout and stderr log to '/tmp/.clearml_agent_out.qds4cx.txt', '/tmp/.clearml_agent_out.qds4cx.txt'
Running Task 44abb7247b0da367b6da05e65592cafb inside docker: nvidia/cuda:10.1-runtime-ubuntu18.04 --network host
Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', '--network', 'host', '-e', 'CLEARML_WORKER_ID=office:worker-0:docker', '-e', 'CLEARML_DOCKER_IMAGE=nvidia/cuda:10.1-runtime-ubuntu18.04 --network host', '-v', '/home/user/.gitconfig:/root/.gitconfig', '-v', '/tmp/.clearml_agent.3sh28jd6.cfg:/root/clearml.conf', '-v', '/tmp/clearml_agent.ssh.gowxn7rp:/root/.ssh', '-v', '/home/user/.clearml/apt-cache:/var/cache/apt/archives', '-v', '/home/user/.clearml/pip-cache:/root/.cache/pip', '-v', '/home/user/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/home/user/.clearml/cache:/clearml_agent_cache', '-v', '/home/user/.clearml/vcs-cache:/root/.clearml/vcs-cache', '--rm', 'nvidia/cuda:10.1-runtime-ubuntu18.04', 'bash', '-c', 'echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; apt-get update ; apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip==20.2.3" ; $LOCAL_PYTHON -m pip install -U clearml-agent ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id 44abb7247b0da367b6da05e65592cafb']
Running Docker:
Executing: ('docker', 'run', '-t', '--gpus', '"device=0"', '--network', 'host', '-e', 'CLEARML_WORKER_ID=office:worker-0:docker', '-e', 'CLEARML_DOCKER_IMAGE=nvidia/cuda:10.1-runtime-ubuntu18.04 --network host', '-v', '/home/user/.gitconfig:/root/.gitconfig', '-v', '/tmp/.clearml_agent.3sh28jd6.cfg:/root/clearml.conf', '-v', '/tmp/clearml_agent.ssh.gowxn7rp:/root/.ssh', '-v', '/home/user/.clearml/apt-cache:/var/cache/apt/archives', '-v', '/home/user/.clearml/pip-cache:/root/.cache/pip', '-v', '/home/user/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/home/user/.clearml/cache:/clearml_agent_cache', '-v', '/home/user/.clearml/vcs-cache:/root/.clearml/vcs-cache', '--rm', 'nvidia/cuda:10.1-runtime-ubuntu18.04', 'bash', '-c', 'echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; apt-get update ; apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip==20.2.3" ; $LOCAL_PYTHON -m pip install -U clearml-agent ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id 44abb7247b0da367b6da05e65592cafb')
DONE: Running task '44abb7247b0da367b6da05e65592cafb', exit status -1 `
So that I donāt loose what I worked on when stopping the session, and if I need to, I can ssh to the machine and directly access the content inside the user folder
the first problem I had, that didnāt gave useful infos, was that docker was not installed in the agent machine x)
Alright I have a followup question then: I used the param --user-folder ā~/projects/my-projectā, but any change I do is not reflected in this folder. I guess I am in the docker space, but this folder is not linked to my the folder on the machine. Is it possible to do so?
Yes you must make sure the docker can mount a persistent folder for you to work on.
Let me check what's the easiest way to do that
Can you try upgrade to the latest? pip install clearml-agent==0.17.2
?
Let's assume the host has a folder for all users for persistence storage, for example '/mnt/user_data/and you have a user named 'myuser' and a matching subfolder '/mnt/user_data/myuser
Then we can do:clearml-session ... --docker "my_docker_image -v /mnt/user_data/:/host_mount/" --user-folder "/host_mount/myuser"
BTW: The next time you call clearml-session
these will become the default parameters, so no need to change anything š
Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', '--network', 'host', '-e', 'CLEARML_WORKER_ID=office:worker-0:docker', '-e', 'CLEARML_DOCKER_IMAGE=nvidia/cuda:10.1-runtime-ubuntu18.04 --network host', '-v', '/home/user/.gitconfig:/root/.gitconfig', '-v', '/tmp/.clearml_agent.toc3_yks.cfg:/root/clearml.conf', '-v', '/tmp/clearml_agent.ssh.1dsz4bz8:/root/.ssh', '-v', '/home/user/.clearml/apt-cache.2:/var/cache/apt/archives', '-v', '/home/user/.clearml/pip-cache:/root/.cache/pip', '-v', '/home/user/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/home/user/.clearml/cache:/clearml_agent_cache', '-v', '/home/user/.clearml/vcs-cache:/root/.clearml/vcs-cache', '--rm', 'nvidia/cuda:10.1-runtime-ubuntu18.04', 'bash', '-c', 'echo \'Binary::apt::APT::Keep-Downloaded-Packages "true";\' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; apt-get update ; apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip==20.2.3" ; $LOCAL_PYTHON -m pip install -U clearml-agent ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id 25c4287a041503d09f2922264133b889']
Yes docker was not installed in the machine
Okay make sense, we should definitely check that you have docker before starting the daemon š
Ok, it would be nice to have a --user-folder-mounted that do the linking automatically
It might be misleading if you are running on k8s cluster, where one cannot just -v mount
volume...
What do you think?
So I installed docker, added user to group allowed to run docker (not to have to run with sudo, otherwise it fails), then ran these two commands and it worked