I see, that should work, thank you! I guess I was hoping to find a solution with some clearml args rather than creating a new docker image
And this is my log: 1708442371374 0aa73e67e07c info ClearML Task: overwriting (reusing) task id=3d5d4e989c7a4fbcaceed1e6c92d1d40
ClearML results page: XXXXX/projects/c2187a1a5e654360a3d565a14d0dc1b0/experiments/3d5d4e989c7a4fbcaceed1e6c92d1d40/output/log
1708442371974 0aa73e67e07c info 1
2024-02-20 10:19:31,990 - clearml.Task - INFO - Waiting for repository detection and full package requirement analysis
1708442373378 0aa73e67e07c info 2024-02-20 10:19:33,378 - clearml.Task - INFO - Finished repository detection and package analysis
1708442384226 YYYYY:gpu0 INFO task 3d5d4e989c7a4fbcaceed1e6c92d1d40 pulled from 08f659b9bda740c782176dd13001ac39 by worker YYYYY:gpu0
1708442384303 YYYYY:gpu0 INFO Running Task 3d5d4e989c7a4fbcaceed1e6c92d1d40 inside docker: danielbogdoll/spconv_1_ood_lidar:latest arguments: ['-e', 'NVIDIA_DRIVER_CAPABILITIES=all']
1708442384326 YYYYY:gpu0 INFO Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', '-e', 'NVIDIA_DRIVER_CAPABILITIES=all', '-v', '/home/clearml-agent/.ssh/known_hosts:/root/.ssh/known_hosts', '--memory-swap=28G', '--memory=28G', '--shm-size=28G', '-e NVIDIA_DRIVER_CAPABILITIES=all', '-l', 'clearml-worker-id=YYYYY:gpu0', '-l', 'clearml-parent-worker-id=YYYYY:gpu0', '-e', 'CLEARML_WORKER_ID=YYYYY:gpu0', '-e', 'CLEARML_DOCKER_IMAGE=danielbogdoll/spconv_1_ood_lidar:latest -e NVIDIA_DRIVER_CAPABILITIES=all', '-e', 'CLEARML_TASK_ID=3d5d4e989c7a4fbcaceed1e6c92d1d40', '-v', '/tmp/.clearml_agent.o8882m7z.cfg:/tmp/clearml.conf', '-e', 'CLEARML_CONFIG_FILE=/tmp/clearml.conf', '-v', '/tmp/clearml_agent.ssh.3yistmo7:/.ssh', '-v', '/home/clearml-agent/.clearml/apt-cache:/var/cache/apt/archives', '-v', '/home/clearml-agent/.clearml/pip-cache:/root/.cache/pip', '-v', '/home/clearml-agent/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/home/clearml-agent/.clearml/cache:/clearml_agent_cache', '-v', '/home/clearml-agent/.clearml/vcs-cache:/root/.clearml/vcs-cache', '-v', '/home/clearml-agent/.clearml/venvs-cache:/root/.clearml/venvs-cache', '--rm', 'danielbogdoll/spconv_1_ood_lidar:latest', 'bash', '-c', 'echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL libsm6 libxext6 libxrender-dev libglib2.0-0" ; cp -Rf /.ssh -T ~/.ssh ; [ ! -z $(which git) ] || export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL git" ; declare LOCAL_PYTHON ; [ ! -z $LOCAL_PYTHON ] || for i in {15..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL python3-pip" ; [ -z "$CLEARML_APT_INSTALL" ] || (apt-get update -y ; apt-get install -y $CLEARML_APT_INSTALL) ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pippip" ; $LOCAL_PYTHON -m pip install -U clearml-agent==1.5.2rc0 ; cp /tmp/clearml.conf ~/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id 3d5d4e989c7a4fbcaceed1e6c92d1d40']
1708442389368 YYYYY:gpu0 DEBUG /usr/bin/bash: /usr/bin/bash: cannot execute binary file
1708442389388 YYYYY:gpu0 DEBUG Process failed, exit code 126
@<1523701087100473344:profile|SuccessfulKoala55> Is there any way for me to override this behavior? I don't have access to the original Dockerfile but need (aka makes my life much easier) the docker image :D
Well, the obvious solve would be to build your own docker image from that docker image (using the FROM
Docerfile directive) and only overriding the Entrypoint
1708426202645 4a9490578787 info ClearML Task: created new task id=406a4d3f372347faa9b7ba02bf993d47
ClearML results page: XXXXX/projects/c2187a1a5e654360a3d565a14d0dc1b0/experiments/406a4d3f372347faa9b7ba02bf993d47/output/log
1708426203801 4a9490578787 info 2024-02-20 05:50:03,801 - clearml.Task - INFO - Waiting for repository detection and full package requirement analysis
2024-02-20 05:50:04,638 - clearml.Task - INFO - Finished repository detection and package analysis
1708426214556 YYYYY:gpu1 INFO task 406a4d3f372347faa9b7ba02bf993d47 pulled from 08f659b9bda740c782176dd13001ac39 by worker YYYYY:gpu1
1708426214642 YYYYY:gpu1 INFO Running Task 406a4d3f372347faa9b7ba02bf993d47 inside docker: scrin/dev-spconv:latest arguments: ['-e', 'NVIDIA_DRIVER_CAPABILITIES=all']
custom_setup_bash_script:
pip install open3d
pip install --no-index torch-scatter -f None
pip install strictyaml
sudo apt-get update
sudo apt-get install -y libx11-6
sudo apt-get install -y libgl1-mesa-glx
1708426214666 YYYYY:gpu1 INFO Executing: ['docker', 'run', '-t', '--gpus', '"device=1"', '-e', 'NVIDIA_DRIVER_CAPABILITIES=all', '-v', '/home/clearml-agent/.ssh/known_hosts:/root/.ssh/known_hosts', '--memory-swap=28G', '--memory=28G', '--shm-size=28G', '-e NVIDIA_DRIVER_CAPABILITIES=all', '-l', 'clearml-worker-id=YYYYY:gpu1', '-l', 'clearml-parent-worker-id=YYYYY:gpu1', '-e', 'CLEARML_WORKER_ID=YYYYY:gpu1', '-e', 'CLEARML_DOCKER_IMAGE=scrin/dev-spconv:latest -e NVIDIA_DRIVER_CAPABILITIES=all', '-e', 'CLEARML_TASK_ID=406a4d3f372347faa9b7ba02bf993d47', '-v', '/tmp/.clearml_agent.wpanxpf8.cfg:/tmp/clearml.conf', '-e', 'CLEARML_CONFIG_FILE=/tmp/clearml.conf', '-v', '/tmp/clearml_agent.ssh.x3n8s40k:/.ssh', '-v', '/home/clearml-agent/.clearml/apt-cache.1:/var/cache/apt/archives', '-v', '/home/clearml-agent/.clearml/pip-cache:/root/.cache/pip', '-v', '/home/clearml-agent/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/home/clearml-agent/.clearml/cache:/clearml_agent_cache', '-v', '/home/clearml-agent/.clearml/vcs-cache:/root/.clearml/vcs-cache', '-v', '/home/clearml-agent/.clearml/venvs-cache:/root/.clearml/venvs-cache', '--rm', 'scrin/dev-spconv:latest', 'bash', '-c', 'echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL libsm6 libxext6 libxrender-dev libglib2.0-0" ; cp -Rf /.ssh -T ~/.ssh ; [ ! -z $(which git) ] || export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL git" ; declare LOCAL_PYTHON ; [ ! -z $LOCAL_PYTHON ] || for i in {15..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL python3-pip" ; [ -z "$CLEARML_APT_INSTALL" ] || (apt-get update -y ; apt-get install -y $CLEARML_APT_INSTALL) ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pippip" ; $LOCAL_PYTHON -m pip install -U clearml-agent==1.5.2rc0 ; pip install open3d ; pip install --no-index torch-scatter -f None ; pip install strictyaml ; sudo apt-get update ; sudo apt-get install -y libx11-6 ; sudo apt-get install -y libgl1-mesa-glx ; cp /tmp/clearml.conf ~/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id 406a4d3f372347faa9b7ba02bf993d47']
1708426219710 YYYYY:gpu1 DEBUG [74G[ OK ]
]0;root@be3ea49471b7: ~ root@be3ea49471b7:~#
1708426304971 YYYYY:gpu1 ERROR User aborted: stopping task (1)
1708426305008 YYYYY:gpu1 DEBUG Process aborted by user
Try simply removing the entrypoint from the original image instead of setting it to bash- see here
Hi @<1670964687132430336:profile|SpicyFrog56> , I think this is because of the entrypoint of this docker image - note the format of the docker run command used by the agent - it's basically passing a command and args, but I guess the entrypoint messes that up? You can easily check by trying a similar docker run command by yourself and checking how to container behaves
If you ask Bash to run Bash you might get some issues 🙂
Hi @<1670964687132430336:profile|SpicyFrog56> , can you please add the full log?
That worked, interesting. Thanks! Not sure if I fully understand why...:D
This thing is that the agent is designed to provide you with maximum flexibility, meaning you can use a docker image that works differently and can set itself up in the entrypoint, so the agent never overrides the entrypoint - in your specific case, that's an issue 🙂
Hey @<1523701087100473344:profile|SuccessfulKoala55> , I played with the Dockerfile a bit but can't get it working. Locally, I can access the docker image and everything runs as expected, but if I create the ClearML task, it fails, at least with a new error. The Dockerfile looks like this:
Use the base image
FROM scrin/dev-spconv:latest
ENTRYPOINT ["/bin/bash"]
Install required Python packages
RUN pip install open3d
RUN pip install --no-index torch-scatter -f None
RUN pip install strictyaml
RUN pip install clearml
RUN pip install "boto3>=1.9"
Update package information (continue even if it fails)
RUN apt-get update || true
Install required system libraries
RUN apt-get install -y libx11-6
RUN apt-get install -y libgl1-mesa-glx