Thanks I am basing my docker on https://github.com/facebookresearch/detectron2/blob/master/docker/Dockerfile
It is now stacking after:2021-03-09 14:54:07 task 609a976a889748d6a6e4baf360ef93b4 pulled from 8e47f5b0694e426e814f0855186f560e by worker ov-01:gpu1 2021-03-09 14:54:08 running Task 609a976a889748d6a6e4baf360ef93b4 inside default docker image: MyDockerImage:v0 2021-03-09 14:54:08 Executing: ['docker', 'run', '-t', '--gpus', '"device=1"', '-e', 'CLEARML_WORKER_ID=ov-01:gpu1', '-e', 'CLEARML_DOCKER_IMAGE=MyDockerImage:v0', '-v', '/tmp/.clearml_agent.jvxowhq4.cfg:/root/clearml.conf', '-v', '/tmp/clearml_agent.ssh.n9gr_ou9:/root/.ssh', '-v', '/home/ophir/.clearml/apt-cache:/var/cache/apt/archives', '-v', '/home/ophir/.clearml/pip-cache:/root/.cache/pip', '-v', '/home/ophir/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/home/ophir/.trains/cache:/clearml_agent_cache', '-v', '/home/ophir/.clearml/vcs-cache:/root/.clearml/vcs-cache', '--rm', 'MyDockerImage:v0', 'bash', '-c', 'sudo su root ; cp -f /root/*.conf ~/ ; echo \'Binary::apt::APT::Keep-Downloaded-Packages "true";\' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; apt-get update ; apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2" ; $LOCAL_PYTHON -m pip install -U clearml-agent ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id 609a976a889748d6a6e4baf360ef93b4'] 2021-03-09 14:54:13 ]0;root@edd13d234b4d: /home/appuserroot@edd13d234b4d:/home/appuser#
The issue itself is changing the default user.
USER appuser
WORKDIR /home/appuser
Any reason for it ?
I just need it to ran the docker and run the command inside it no?
Yes this is definitely the issue, the agent assume the docker user is "root".
Let me check something
ARG USER_ID=1000 RUN useradd -m --no-log-init --system --uid ${USER_ID} appuser -g sudo RUN echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers USER appuser WORKDIR /home/appuser
Ok looks It is starting the training...
Thanks 💯
but I am think they done it for a reason no?
Hi CooperativeFox72
But my docker image has all my code and all the packages it needed I don't understand why the agent need to install all of those again? (edited)
So based on the docker file you previously posted, I think all your python packages are actually installed on the "appuser" and not as system packages.
Basically remove the "add user" part and the --user
from the pip install.
For example:
` FROM nvidia/cuda:10.1-cudnn7-devel
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y
python3-opencv ca-certificates python3-dev git wget sudo ninja-build
RUN ln -sv /usr/bin/python3 /usr/bin/python
WORKDIR /root/
RUN wget &&
python3 get-pip.py &&
rm get-pip.py
install dependencies
See
for other options if you use a different version of CUDA
RUN pip install tensorboard cmake # cmake from apt-get is too old
RUN pip install torch==1.8 torchvision==0.9 -f
RUN pip install 'git+ '
install detectron2
RUN git clone detectron2_repo
set FORCE_CUDA because during docker build
cuda is not accessible
ENV FORCE_CUDA="1"
This will by default build detectron2 for all common cuda architectures and take a lot more time,
because inside docker build
, there is no way to tell which architecture will be used.
ARG TORCH_CUDA_ARCH_LIST="Kepler;Kepler+Tesla;Maxwell;Maxwell+Tegra;Pascal;Volta;Turing"
ENV TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST}"
RUN pip install -e detectron2_repo
Set a fixed model cache directory.
ENV FVCORE_CACHE="/tmp" `
I have an other question.
Now that I using the root user it looks better,
But my docker image has all my code and all the packages it needed I don't understand why the agent need to install all of those again?
but I am think they done it for a reason no?
Not a very good one, they just installed everything under the user and used --user for the pip.
It really does not matter inside a docker, the only reason one might want to do that is if you are mounting other drives and you want to make sure they are not accessed with "root" user, but with 1000 user id.
Thanks, I will make sure that all the python packages install as root..
And will let you know if it works
I am creating this user
Please explain, I think this is the culprit ...
Hi AgitatedDove14 ,
Sorry for the late response It was late at my country 🙂 .
This what I am gettingappuser@219886f802f0:~$ sudo su root root@219886f802f0:/home/appuser# whoami root
So for now I am leaving this issue...
Thanks a lot 🙏 🙌
I did it just because FAIR did it in detectron2 Dockerfile
Not a very good one, they just installed everything under the user and used --user for the pip.
It really does not matter inside a docker, the only reason one might want to do that is if you are mounting other drives and you want to make sure they are not accessed with "root" user, but with 1000 user id.
This sounds a good reason haha 😄
Let me check if we can hack something...
Thanks 🙏
So I ask my boss and DevOps and they say for now we can use the root
user inside the docker image...
CooperativeFox72
Could you try to run the docker and then inside the docker try to do:su root whoami
Are you inheriting from their docker file ?
Okay we have something 🙂
To your clearml.conf add:agent.docker_preprocess_bash_script = [ "su root", "cp -f /root/*.conf ~/", ]
Let's see if that works