Hi LazyTurkey38 , how exactly are you running the agent? Also, are.you using a custom base docker image, and if so, which one?
It’s not that I think because it works if I run the same command manually.
btw,
I launch the agent
daemon
outside docker (with
--docker
) , that’s the way it is supposed to work right?
Yep that should work
is it ?
So you're running the agent daemon itself inside docker?
ugh, sudo actually makes it fail explicitly because
` error: Could not fetch origin
Repository cloning failed: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 1.
- Make sure you pushed the requested commit:
(repository='git@github.com:salimmj/clearml-demo.git', branch='main', commit_id='f76f3affd28d5558928d7ffd9a6797890ffdd708', tag='', docker_cmd='nvidia/cuda:11.4.0-runtime-ubuntu20.04', entry_point='mnist.py', working_dir='.') - Check if remote-worker has valid credentials [see worker configuration file] `
It seems to try to p[ull with SSH credentials, add your user/pass(or better APIkey) to the clearml.conf
(look for git_user /git_pass)
Should solve the issue
LazyTurkey38 notice the assumption is that the docker entry-point ends with bash, and only then the agent take charge. I'm assuming this is not te case hence the agent spins the docker, then the docker just ends, could that be?
I tried with and without. I’m having the issue where if I run the task from the queue it will complete as soon as it goes into docker but if I run the same docker run it works.
might it be related to the docker socket not being mounted to the agent daemon running inside a docker container?
Oh yes, if the daemon is running Inside a docker container than you need both --privileged and mounting of the docker socket, to get it to work
AgitatedDove14 no I mean I can do:docker run -t --gpus "device=1" -dit -e APP_ENV=kprod -e CLEARML_WORKER_ID=ada:gpu1 -e CLEARML_DOCKER_IMAGE=922531023312.dkr.ecr.us-west-2.amazonaws.com/jym-coach:202108080511.7e8d6d1 -v /home/smjahad/.gitconfig:/root/.gitconfig -v /tmp/.clearml_agent.kjx6r9oo.cfg:/root/clearml.conf -v /tmp/clearml_agent.ssh.l8cguj81:/root/.ssh -v /home/smjahad/.clearml/apt-cache.1:/var/cache/apt/archives -v /home/smjahad/.clearml/pip-cache:/root/.cache/pip -v /home/smjahad/.clearml/pip-download-cache:/root/.clearml/pip-download-cache -v /home/smjahad/.clearml/cache:/clearml_agent_cache -v /home/smjahad/.clearml/vcs-cache:/root/.clearml/vcs-cache --rm <IMAGELINK> bash -c echo \'Binary::apt::APT::Keep-Downloaded-Packages "true";\' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL libsm6 libxext6 libxrender-dev libglib2.0-0" ; [ ! -z $(which git) ] || export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL git" ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL python3-pip" ; [ -z "$CLEARML_APT_INSTALL" ] || (apt-get update && apt-get install -y $CLEARML_APT_INSTALL) ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip==21.2.3" ; $LOCAL_PYTHON -m pip install -U clearml-agent ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id 1be91a7331ca46b7ae4bc4024d93bb36
Which is the same command that shows up on the experiment log (in UI) before it ends as “complete”
AgitatedDove14 should I try running the above command with privileged user?
AgitatedDove14 might it be related to the docker socket not being mounted to the agent daemon running inside a docker container?
it works if I run the same command manually.
What do you mean?
Can you do:docker run -it <my container here> bash
Then immediately get an interactive bash ?
SuccessfulKoala55 I tried to make a docker image by combining one of our dockerfiles with this https://github.com/allegroai/clearml-agent/blob/master/docker/agent/Dockerfile . I modified the entrypoint
to also be a combination of both.
Right now I’m not seeing that error, but the the process seems to exit (as completed) after the docker run
. I’m wondering if my Dockerfile is not properly setup and it’s exiting before the deamon is started.
btw, AgitatedDove14 I launch the agent daemon
outside docker (with --docker
) , that’s the way it is supposed to work right?
$ clearml-agent daemon --detached --queue manual_jobs automated_jobs --docker --gpus 0
And then the worker itself will run the docker run
command for me and start another non-daemon agent inside.
I guess the failure happens when it tries to switch to docker because the same experiment works with agents not started with --docker
flag
clearml-agent daemon --detached --queue manual_jobs automated_jobs --docker --gpus 0
If the user running this command can run "docker run", then you should ne fine