Reputation
Badges 1
92 × Eureka!Does it possible to know in advance where the Agent will clone the code?
Or running a link command just before the execution of the code?
my docker has my project on it all ready so I know where to mount. Maybe the agent moves/create copy of my project somewhere else?
Hi AgitatedDove14 ,
Sorry for the late response It was late at my country 🙂 .
This what I am gettingappuser@219886f802f0:~$ sudo su root root@219886f802f0:/home/appuser# whoami root
OHH nice, I thought that it just some kind of job queue on up and running machines
Thanks, I will make sure that all the python packages install as root..
And will let you know if it works
Thanks I will upgrade the server for now and will let you know
Thanks for the quick replay.
This will set more time before the timeout right?
Maybe there is a way to do something like:task.freeze_monitor() download() task.defrost_monitor()
So I ask my boss and DevOps and they say for now we can use the root user inside the docker image...
Thanks I am basing my docker on https://github.com/facebookresearch/detectron2/blob/master/docker/Dockerfile
It is now stacking after:
` 2021-03-09 14:54:07
task 609a976a889748d6a6e4baf360ef93b4 pulled from 8e47f5b0694e426e814f0855186f560e by worker ov-01:gpu1
2021-03-09 14:54:08
running Task 609a976a889748d6a6e4baf360ef93b4 inside default docker image: MyDockerImage:v0
2021-03-09 14:54:08
Executing: ['docker', 'run', '-t', '--gpus', '"device=1"', '-e', 'CLEARML_WORKER_ID=ov-01:gpu1', '-e', 'CLEARML_DOCKER_IMAGE=MyDockerImage:v0', '-v', '/tmp/.clearml_agent.jvxowhq4.cfg:/root/clearml.conf', '-v', '/...
AgitatedDove14 Hi, sorry for the long delay.
I tried to use 0.16 instead of 0.13.1.
I didn't have time to debug it (I am overwhelming with work right now).
But it doesn't work the same as 0.13.1. I am still getting some hanging in my eval process.
I am don't know if it just slower or really stuck since I killed it and move back to 0.13.1 until my busy time will pass.
Thanks
Hey... Thanks for checking with me.
I didn't have time yet but will check it and let you know..
I reproduced the stuck with this code..
But for now only with my env , when I tried to create new env only with the packages that this code needed it wont stuck.
So maybe the problem is conflict between packages?
Not a very good one, they just installed everything under the user and used --user for the pip.
It really does not matter inside a docker, the only reason one might want to do that is if you are mounting other drives and you want to make sure they are not accessed with "root" user, but with 1000 user id.
This sounds a good reason haha 😄
Let me check if we can hack something...
Thanks 🙏
I have an other question.
Now that I using the root user it looks better,
But my docker image has all my code and all the packages it needed I don't understand why the agent need to install all of those again?
Ok looks It is starting the training...
Thanks 💯
I tried without yaml.dump(my_params_dict) will try with it..
so the file was not the same as the connect_configuration uploaded
Thanks
From the UI it will since it getting the temp file from there.
I mean from the code (let say remotely)
OK thanks for the answer.. I will usetask.set_resource_monitor_iteration_timeout(seconds_from_start=1800)as you suggested for now..
If you will add something like I suggest can you notify me?
So for now I am leaving this issue...
Thanks a lot 🙏 🙌
ARG USER_ID=1000 RUN useradd -m --no-log-init --system --uid ${USER_ID} appuser -g sudo RUN echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers USER appuser WORKDIR /home/appuser
The hang is still happening in trains==0.15.2rc0
Sure, love to do it when I have more time 🙂