LazyFox65 seems like a docker issue.
Can you manually run the docker ?
Hmm can you run:docker run -it allegroai/clearml-agent-services:latest
What is the solution then? What exactly has helped?
I regularly run into the same problem when I launch pipelines locally (for remote execution)
However, when I clone the pipeline from web UI and launch it once again, it works. Is there a way to bypass this?
Hi MelancholyElk85
However, when I clone the pipeline from web UI and launch it once again, it works. Is there a way to bypass this?
In both cases, are you seeing a different behavior on the same machine running the agent (i.e. clonening from the UI vs code) ?
No, when I run the pipeline from the console on my local machine, it for some reason launches on clearml-services
hostname (despite of the fact I specified the queue with the desired agent with pipe.set_default_execution_queue
in my code)
MelancholyElk85 notice there is the pipeline controller queue (i.e. which agent will run the logic of the pipeline), and the default queue for the pipeline steps (i.e. the actual steps of the pipeline).
The default queue for the pipeline logic itself is services
. you can change it ( pipeline.start(..., queue='another_q')
)
Make sense ?
Yes, it works, thank you! The question remains though: why docker containers won't launch on services
the question remains though: why docker containers won't launch on
services
Maybe something with the way it launched on the docker-compose?
(I'm assuming it will fail on any docker container regardless, right?!)
Yes, look like it fails on 2 different containers at least
Hmm I'm assuming something wrong here:
https://github.com/allegroai/clearml-server/blob/a64c4d264d00eadd2d11818b37151d3cc6266d99/docker/docker-compose.yml#L119
What's the host machine OS ?
You mean the host where it works correctly? Ubuntu 20.04.3
Yeah.. that should have worked ...
What's the exact error you are getting ?
` 1633204284443 clearml-services INFO Executing: ['docker', 'run', '-t', '-l', 'clearml-worker-id=clearml-services:service:58186f9e975f484683a364cf9ce69583', '-l', 'clearml-parent-worker-id=clearml-services', '-e', 'NVIDIA_VISIBLE_DEVICES=none', '-e', 'CLEARML_WORKER_ID=clearml-services:service:58186f9e975f484683a364cf9ce69583', '-e', 'CLEARML_DOCKER_IMAGE=', '-v', '/tmp/.clearml_agent.pgsygoh2.cfg:/root/clearml.conf', '-v', '/root/.clearml/apt-cache:/var/cache/apt/archives', '-v', '/root/.clearml/pip-cache:/root/.cache/pip', '-v', '/root/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/root/.clearml/cache:/clearml_agent_cache', '-v', '/root/.clearml/vcs-cache:/root/.clearml/vcs-cache', '--rm', '', 'bash', '-c', 'echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL libsm6 libxext6 libxrender-dev libglib2.0-0" ; [ ! -z $(which git) ] || export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL git" ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL python3-pip" ; [ -z "$CLEARML_APT_INSTALL" ] || (apt-get update && apt-get install -y $CLEARML_APT_INSTALL) ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2" ; $LOCAL_PYTHON -m pip install -U clearml-agent ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=none $LOCAL_PYTHON -u -m clearml_agent execute --full-monitoring --id 58186f9e975f484683a364cf9ce69583']
1633204289496 clearml-services DEBUG docker: invalid reference format.
See 'docker run --help'.
1633204289546 clearml-services DEBUG Process failed, exit code 125 `
on the machine I build images? Docker version 20.10.8, build 3967b7d
On the machine running the docker-compose (i.e. the clearml-server)
Oh, I need to ask the guy who deployed it
1633204289496 clearml-services DEBUG docker: invalid reference format.
This is the strange message, like the execution command is not valid...
AgitatedDove14 I run into this problem again. Are there any known issues about it? I don't remember what helped the last time
More specifically, there are 2 tasks with almost identical docker commands. The only difference is the image itself. The task with one image works, and with another image it fails. Both images are valid images that lauch nicely on my laptop. Both images exist in the registry. Maybe you have some ideas what could possibly be wrong here?
Solved. The problem was a trailing space before the image name in the Image
section in web UI. I think you should probably strip the string before proceeding to environment building step, to avoid this annoying stuff to happen. Of course, users could check twice before launching, but this thing will come up every once in a while regardless