Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', '--network', 'host', '-e', 'CLEARML_WORKER_ID=office:worker-0:docker', '-e', 'CLEARML_DOCKER_IMAGE=nvidia/cuda:10.1-runtime-ubuntu18.04 --network host', '-v', '/home/user/.gitconfig:/root/.gitconfig', '-v', '/tmp/.clearml_agent.toc3_yks.cfg:/root/clearml.conf', '-v', '/tmp/clearml_agent.ssh.1dsz4bz8:/root/.ssh', '-v', '/home/user/.clearml/apt-cache.2:/var/cache/apt/archives', '-v', '/home/user/.clearml/pip-cache:/root/.cache/pip', '-v', '/home/user/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/home/user/.clearml/cache:/clearml_agent_cache', '-v', '/home/user/.clearml/vcs-cache:/root/.clearml/vcs-cache', '--rm', 'nvidia/cuda:10.1-runtime-ubuntu18.04', 'bash', '-c', 'echo \'Binary::apt::APT::Keep-Downloaded-Packages "true";\' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; apt-get update ; apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip==20.2.3" ; $LOCAL_PYTHON -m pip install -U clearml-agent ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id 25c4287a041503d09f2922264133b889']
can you check the agent’s logs? maybe we can find something there
When you start the ClearML agent, the last line is the file for the agent’s output, for linux is should be something like:
Running CLEARML-AGENT daemon in background mode, writing stdout/stderr to /tmp/.clearml_agent_daemon_***.txt
Here are the logs of the agent :)
` (base) user@worker:~$ tail -f /tmp/.clearml_agent_daemon_outjdups8t2.txt
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false
+----------------------------------+--------+-------+
| id | name | tags |
+----------------------------------+--------+-------+
| 54e4a62a402d5135612ba7b12cfe4e57 | docker | |
+----------------------------------+--------+-------+
Starting infinite task polling loop...
tail: /tmp/.clearml_agent_daemon_outjdups8t2.txt: file truncated
task 44abb7247b0da367b6da05e65592cafb pulled from 54e4a62a402d5135612ba7b12cfe4e57 by worker office:worker-0:docker
Running task '44abb7247b0da367b6da05e65592cafb'
Storing stdout and stderr log to '/tmp/.clearml_agent_out.qds4cx.txt', '/tmp/.clearml_agent_out.qds4cx.txt'
Running Task 44abb7247b0da367b6da05e65592cafb inside docker: nvidia/cuda:10.1-runtime-ubuntu18.04 --network host
Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', '--network', 'host', '-e', 'CLEARML_WORKER_ID=office:worker-0:docker', '-e', 'CLEARML_DOCKER_IMAGE=nvidia/cuda:10.1-runtime-ubuntu18.04 --network host', '-v', '/home/user/.gitconfig:/root/.gitconfig', '-v', '/tmp/.clearml_agent.3sh28jd6.cfg:/root/clearml.conf', '-v', '/tmp/clearml_agent.ssh.gowxn7rp:/root/.ssh', '-v', '/home/user/.clearml/apt-cache:/var/cache/apt/archives', '-v', '/home/user/.clearml/pip-cache:/root/.cache/pip', '-v', '/home/user/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/home/user/.clearml/cache:/clearml_agent_cache', '-v', '/home/user/.clearml/vcs-cache:/root/.clearml/vcs-cache', '--rm', 'nvidia/cuda:10.1-runtime-ubuntu18.04', 'bash', '-c', 'echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; apt-get update ; apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip==20.2.3" ; $LOCAL_PYTHON -m pip install -U clearml-agent ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id 44abb7247b0da367b6da05e65592cafb']
Running Docker:
Executing: ('docker', 'run', '-t', '--gpus', '"device=0"', '--network', 'host', '-e', 'CLEARML_WORKER_ID=office:worker-0:docker', '-e', 'CLEARML_DOCKER_IMAGE=nvidia/cuda:10.1-runtime-ubuntu18.04 --network host', '-v', '/home/user/.gitconfig:/root/.gitconfig', '-v', '/tmp/.clearml_agent.3sh28jd6.cfg:/root/clearml.conf', '-v', '/tmp/clearml_agent.ssh.gowxn7rp:/root/.ssh', '-v', '/home/user/.clearml/apt-cache:/var/cache/apt/archives', '-v', '/home/user/.clearml/pip-cache:/root/.cache/pip', '-v', '/home/user/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/home/user/.clearml/cache:/clearml_agent_cache', '-v', '/home/user/.clearml/vcs-cache:/root/.clearml/vcs-cache', '--rm', 'nvidia/cuda:10.1-runtime-ubuntu18.04', 'bash', '-c', 'echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; apt-get update ; apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip==20.2.3" ; $LOCAL_PYTHON -m pip install -U clearml-agent ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id 44abb7247b0da367b6da05e65592cafb')
DONE: Running task '44abb7247b0da367b6da05e65592cafb', exit status -1 `
not much info 😕
Can you manually run the docker ?
I got some progress TimelyPenguin76 , Now the task runs and I get the error from docker:docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
But I see in the agent logs:Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', ...
which ClearML agent version are you running?
Can you try upgrade to the latest? pip install clearml-agent==0.17.2
?
I followed https://github.com/NVIDIA/nvidia-docker/issues/1034#issuecomment-520282450 and now it seems to be setting up properly
👍
So nvidia-container-toolkit
and systemctl restart dockerd
fixed it?
the first problem I had, that didn’t gave useful infos, was that docker was not installed in the agent machine x)
So I installed docker, added user to group allowed to run docker (not to have to run with sudo, otherwise it fails), then ran these two commands and it worked
Alright I have a followup question then: I used the param --user-folder “~/projects/my-project”, but any change I do is not reflected in this folder. I guess I am in the docker space, but this folder is not linked to my the folder on the machine. Is it possible to do so?
So that I don’t loose what I worked on when stopping the session, and if I need to, I can ssh to the machine and directly access the content inside the user folder
That didn’t gave useful infos, was that docker was not installed in the agent machine x)
JitteryCoyote63 you mean "docker" was not installed and it did not throw an error ?
Alright I have a followup question then: I used the param --user-folder “~/projects/my-project”, but any change I do is not reflected in this folder. I guess I am in the docker space, but this folder is not linked to my the folder on the machine. Is it possible to do so?
Yes you must make sure the docker can mount a persistent folder for you to work on.
Let me check what's the easiest way to do that
you mean “docker” was not installed and it did not throw an error ?
Yes docker was not installed in the machine
Yes you must make sure the docker can mount a persistent folder for you to work on.
Ok, it would be nice to have a --user-folder-mounted that do the linking automatically
Yes docker was not installed in the machine
Okay make sense, we should definitely check that you have docker before starting the daemon 😉
Ok, it would be nice to have a --user-folder-mounted that do the linking automatically
It might be misleading if you are running on k8s cluster, where one cannot just -v mount
volume...
What do you think?
Alright, how can I then mount a volume of the disk?
Let's assume the host has a folder for all users for persistence storage, for example '/mnt/user_data/and you have a user named 'myuser' and a matching subfolder '/mnt/user_data/myuser
Then we can do:clearml-session ... --docker "my_docker_image -v /mnt/user_data/:/host_mount/" --user-folder "/host_mount/myuser"
BTW: The next time you call clearml-session
these will become the default parameters, so no need to change anything 🙂