Yes this is definitely the issue, the agent assume the docker user is "root".
Let me check something
Thanks I am basing my docker on https://github.com/facebookresearch/detectron2/blob/master/docker/Dockerfile
I did it just because FAIR did it in detectron2 Dockerfile
So for now I am leaving this issue...
Thanks a lot 🙏 🙌
Ok looks It is starting the training...
Thanks 💯
So I ask my boss and DevOps and they say for now we can use the root
user inside the docker image...
The issue itself is changing the default user.
USER appuser
WORKDIR /home/appuser
Any reason for it ?
Thanks, I will make sure that all the python packages install as root..
And will let you know if it works
I just need it to ran the docker and run the command inside it no?
It is now stacking after:2021-03-09 14:54:07 task 609a976a889748d6a6e4baf360ef93b4 pulled from 8e47f5b0694e426e814f0855186f560e by worker ov-01:gpu1 2021-03-09 14:54:08 running Task 609a976a889748d6a6e4baf360ef93b4 inside default docker image: MyDockerImage:v0 2021-03-09 14:54:08 Executing: ['docker', 'run', '-t', '--gpus', '"device=1"', '-e', 'CLEARML_WORKER_ID=ov-01:gpu1', '-e', 'CLEARML_DOCKER_IMAGE=MyDockerImage:v0', '-v', '/tmp/.clearml_agent.jvxowhq4.cfg:/root/clearml.conf', '-v', '/tmp/clearml_agent.ssh.n9gr_ou9:/root/.ssh', '-v', '/home/ophir/.clearml/apt-cache:/var/cache/apt/archives', '-v', '/home/ophir/.clearml/pip-cache:/root/.cache/pip', '-v', '/home/ophir/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/home/ophir/.trains/cache:/clearml_agent_cache', '-v', '/home/ophir/.clearml/vcs-cache:/root/.clearml/vcs-cache', '--rm', 'MyDockerImage:v0', 'bash', '-c', 'sudo su root ; cp -f /root/*.conf ~/ ; echo \'Binary::apt::APT::Keep-Downloaded-Packages "true";\' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; apt-get update ; apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2" ; $LOCAL_PYTHON -m pip install -U clearml-agent ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id 609a976a889748d6a6e4baf360ef93b4'] 2021-03-09 14:54:13 ]0;root@edd13d234b4d: /home/appuserroot@edd13d234b4d:/home/appuser#
I am creating this user
Please explain, I think this is the culprit ...
but I am think they done it for a reason no?
Hi CooperativeFox72
But my docker image has all my code and all the packages it needed I don't understand why the agent need to install all of those again? (edited)
So based on the docker file you previously posted, I think all your python packages are actually installed on the "appuser" and not as system packages.
Basically remove the "add user" part and the --user
from the pip install.
For example:
` FROM nvidia/cuda:10.1-cudnn7-devel
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y
python3-opencv ca-certificates python3-dev git wget sudo ninja-build
RUN ln -sv /usr/bin/python3 /usr/bin/python
WORKDIR /root/
RUN wget &&
python3 get-pip.py &&
rm get-pip.py
install dependencies
See
for other options if you use a different version of CUDA
RUN pip install tensorboard cmake # cmake from apt-get is too old
RUN pip install torch==1.8 torchvision==0.9 -f
RUN pip install 'git+ '
install detectron2
RUN git clone detectron2_repo
set FORCE_CUDA because during docker build
cuda is not accessible
ENV FORCE_CUDA="1"
This will by default build detectron2 for all common cuda architectures and take a lot more time,
because inside docker build
, there is no way to tell which architecture will be used.
ARG TORCH_CUDA_ARCH_LIST="Kepler;Kepler+Tesla;Maxwell;Maxwell+Tegra;Pascal;Volta;Turing"
ENV TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST}"
RUN pip install -e detectron2_repo
Set a fixed model cache directory.
ENV FVCORE_CACHE="/tmp" `
ARG USER_ID=1000 RUN useradd -m --no-log-init --system --uid ${USER_ID} appuser -g sudo RUN echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers USER appuser WORKDIR /home/appuser
CooperativeFox72
Could you try to run the docker and then inside the docker try to do:su root whoami
I have an other question.
Now that I using the root user it looks better,
But my docker image has all my code and all the packages it needed I don't understand why the agent need to install all of those again?
Okay we have something 🙂
To your clearml.conf add:agent.docker_preprocess_bash_script = [ "su root", "cp -f /root/*.conf ~/", ]
Let's see if that works
Hi AgitatedDove14 ,
Sorry for the late response It was late at my country 🙂 .
This what I am gettingappuser@219886f802f0:~$ sudo su root root@219886f802f0:/home/appuser# whoami root
Not a very good one, they just installed everything under the user and used --user for the pip.
It really does not matter inside a docker, the only reason one might want to do that is if you are mounting other drives and you want to make sure they are not accessed with "root" user, but with 1000 user id.
This sounds a good reason haha 😄
Let me check if we can hack something...
Thanks 🙏
but I am think they done it for a reason no?
Not a very good one, they just installed everything under the user and used --user for the pip.
It really does not matter inside a docker, the only reason one might want to do that is if you are mounting other drives and you want to make sure they are not accessed with "root" user, but with 1000 user id.
Are you inheriting from their docker file ?