Reputation
Badges 1
979 × Eureka!You already fixed the problem with pyjwt in the newest version of clearml/clearml-agents, so all good π
I was able to fix by applying for a license and registering it
(by console you mean in the dashboard right? or the terminal?)
Hi SuccessfulKoala55 , How can I now if I log in in this free access mode? I assume it is since in the login page I only see login field, not password field
Guys the experiments I had running didn't fail, they just waited and reconnected, this is crazy cool
with my hack yes, without, no
I got some progress TimelyPenguin76 , Now the task runs and I get the error from docker:docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
Hi CumbersomeCormorant74 yes, this is almost the scenario: I have a dozen of projects. In one of them, I have ~20 archived experiments, in different states (draft, failed, aborted, completed). I went to this archive, selected all of them and deleted them using the bulk delete operation. I had several failed delete popups. So I tried again with smaller bulks (like 5 experiments at a time) to localize the experiments at the origin of the error. I could delete most of them. At some point, all ...
it also happens without hitting F5 after some time (~hours)
The simple workaround I imagined (not tested) at the moment is to sleep 2 minutes after closing the task, to keep the clearml-agent busy until the instance is shutted down:self.clearml_task.mark_stopped() self.clearml_task.close() time.sleep(120) # Prevent the agent to pick up new tasks
My use case it: in a spot instance marked for termination after 2 mins by aws, I want to close a task and prevent the clearml-agent to pick up a new task after.
I want the clearml-agent/instance to stop right after the experiment/training is βpausedβ (experiment marked as stopped + artifacts saved)
as it's also based on pytorch-ignite!
I am not sure to understand, what is the link with pytorch-ignite?
We're in the brainstorming phase of what are the best approaches to integrate, we might pick your brain later on
Awesome, I'd be happy to help!
So the problem comes when I domy_task.output_uri = "
s3://my-bucket , trains in the background checks if it has access to this bucket and it is not able to find/ read the creds
Will it freeze/crash/break/stop the ongoing experiments?
hoo thats cool! I could place torch==1.3.1 there
So that I donβt loose what I worked on when stopping the session, and if I need to, I can ssh to the machine and directly access the content inside the user folder
Oof now I cannot start the second controller in the services queue on the same second machine, it fails with
` Processing /tmp/build/80754af9/cffi_1605538068321/work
ERROR: Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: '/tmp/build/80754af9/cffi_1605538068321/work'
clearml_agent: ERROR: Could not install task requirements!
Command '['/home/machine/.clearml/venvs-builds.1.3/3.6/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r'...
I did that recently - what are you trying to do exactly?
no, I think I could reproduce with multiple queues
Notice the last line should not have
--docker
Did you meant --detached
?
I also think we need to make sure we monitor all agents (this is important as this is the trigger to spin down the instance)
That's what I though yea, no problem, it was rather a question, if I encounter the need for that, I will adapt and open a PR π
If I donβt start clearml-session
, I can easily connect to the agent, so clearml-session is doing something that messes up the ssh config and prevent me from ssh into the agent afterwards
AgitatedDove14 yes but I don't see in the docs how to attach it to the logger of the earlystopping handler
Nope, Iβd like to wait and see how the different tools improve over this year before picking THE one π
self.clearml_task.get_initial_iteration()
also gives me the correct number
sorry, the clearml-session. The error is the one I shared at the beginning of this thread
The parent task is a data_processing task, therefore I retrieve it so that I can then data_processed = parent_task.artifacts["data_processed"]