Reputation
Badges 1
979 × Eureka!I see that I have several volumes:
` $ docker volume ls
DRIVER VOLUME NAME
local 5b0bfe5ab1a3d645bd635b2fb6f2aefd2b657d566019343c8305959903996c67
local 43b60287d60db798dc9d1defe1d7d861334c9c8299aefad6da2f20db278cfc5b
local 1406d50aa65ab55d323500d1fb23f19adfc6e721261ab6103a59d20e82146099
local 7367a215bd42a4e888e5d88ce708bf74aedc48a6e9417c72a19739cb80f25e6d
local 7413c39f5e4b6568304832d9d2e925ebdbf47ad31ad22d77830d3618af79237b
local a55cb71edff48c2138a5da9d8d1e26df3b...
Ok I have a very different problem now: I did the following to restart the ES cluster:docker-compose down docker-compose up -d
And now the cluster is empty. I think docker simply created a new volume instead of reusing the previous one, which was always the case so far.
Hey FriendlySquid61 ,
I ended up asking for full control of EC2 not to be blocked, so unfortunately I cannot give you a more precise list π
Now I am trying to restart the cluster with docker-compose and specifying the last volume, how can I do that?
So it can be that when restarting the docker-compose, it used another volume, hence the loss of data
Sorry, I was actually able to fix it (using 1.1.3) not sure what was the problem π
But I see in the agent logs:Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', ...
I want to make sure that an agent did finish uploading its artifacts before marking itself as complete, so that the controller does not try to access these artifacts while they are not available
I was asking to exclude this possibility from my debugging journey π
No, I want to launch the second step after the first one is finished and all its artifacts are uploaded
Yes π Thanks!
Hi SuccessfulKoala55 , super thatβs what I was looking for
I got some progress TimelyPenguin76 , Now the task runs and I get the error from docker:docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
Thanks @<1523701087100473344:profile|SuccessfulKoala55> ! Are alive workers sending ping to notify the server that they are alive or does the server infers that they are alive based on the last communication?
You mean it will resolve by itself in the following days or should I do something? Or there is nothing to do and it will stay this way?
Are you planning to add a server-backup service task in the near future?
Ok, by setting PyJWT==1.7.1
in the setup.py of the experiment pip did not enforced the update
yes -> but I still don't understand why the post_packages didn't work, could be worth investigating
Which commit corresponds to RC version? So far we tested with latest commit on master (9a7850b23d2b0e1f2098ab051de58ce806143fff)
Yes, it works now! Yay!
The fileΒ /tmp/.clearml_agent_out.j7wo7ltp.txt
Β does not exist
And I do that each time I want to create a subtask. This way I am sure to retrieve the task if it already exists
Alright SuccessfulKoala55 I was able to make it work by downgrading clearml-agent to 0.17.2
extra_configurations = {'SubnetId': "<subnet-id>"}
with brackets right?
ha wait, I removed the http://
in the host and it worked π