Reputation
Badges 1
979 × Eureka!it also happens without hitting F5 after some time (~hours)
The simple workaround I imagined (not tested) at the moment is to sleep 2 minutes after closing the task, to keep the clearml-agent busy until the instance is shutted down:self.clearml_task.mark_stopped() self.clearml_task.close() time.sleep(120) # Prevent the agent to pick up new tasks
My use case it: in a spot instance marked for termination after 2 mins by aws, I want to close a task and prevent the clearml-agent to pick up a new task after.
I want the clearml-agent/instance to stop right after the experiment/training is โpausedโ (experiment marked as stopped + artifacts saved)
as it's also based on pytorch-ignite!
I am not sure to understand, what is the link with pytorch-ignite?
We're in the brainstorming phase of what are the best approaches to integrate, we might pick your brain later on
Awesome, I'd be happy to help!
So the problem comes when I domy_task.output_uri = "
s3://my-bucket , trains in the background checks if it has access to this bucket and it is not able to find/ read the creds
Will it freeze/crash/break/stop the ongoing experiments?
hoo thats cool! I could place torch==1.3.1 there
So that I donโt loose what I worked on when stopping the session, and if I need to, I can ssh to the machine and directly access the content inside the user folder
Oof now I cannot start the second controller in the services queue on the same second machine, it fails with
` Processing /tmp/build/80754af9/cffi_1605538068321/work
ERROR: Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: '/tmp/build/80754af9/cffi_1605538068321/work'
clearml_agent: ERROR: Could not install task requirements!
Command '['/home/machine/.clearml/venvs-builds.1.3/3.6/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r'...
I did that recently - what are you trying to do exactly?
no, I think I could reproduce with multiple queues
Notice the last line should not have
--docker
Did you meant --detached
?
I also think we need to make sure we monitor all agents (this is important as this is the trigger to spin down the instance)
That's what I though yea, no problem, it was rather a question, if I encounter the need for that, I will adapt and open a PR ๐
If I donโt start clearml-session
, I can easily connect to the agent, so clearml-session is doing something that messes up the ssh config and prevent me from ssh into the agent afterwards
AgitatedDove14 yes but I don't see in the docs how to attach it to the logger of the earlystopping handler
Nope, Iโd like to wait and see how the different tools improve over this year before picking THE one ๐
self.clearml_task.get_initial_iteration()
also gives me the correct number
sorry, the clearml-session. The error is the one I shared at the beginning of this thread
The parent task is a data_processing task, therefore I retrieve it so that I can then data_processed = parent_task.artifacts["data_processed"]
GrumpyPenguin23 yes, it is the latest
AgitatedDove14 , what I was looking for was: parent_task = Task.get_task(task.parent)
Yes, actually thats what I am doing, because I have a task C depending on tasks A and B. Since a Task cannot have two parents, I retrieve one task id (task A) as the parent id and the other one (ID of task B) as a hyper-parameter, as you described ๐
AgitatedDove14 In my case I'd rather have it under the "Artifacts" tab because it is a big json file
No, they have different names - I will try to update both agents to the latest versions
Thanks! I will investigate further, I am thinking that the AWS instance might have been stuck for an unknown reason (becoming unhealthy)
Hi, /opt/clearml is ~40Mb, /opt/clearml/data is about ~50gb
Could be also related to https://allegroai-trains.slack.com/archives/CTK20V944/p1597928652031300