Reputation
Badges 1
978 × Eureka!The simple workaround I imagined (not tested) at the moment is to sleep 2 minutes after closing the task, to keep the clearml-agent busy until the instance is shutted down:self.clearml_task.mark_stopped() self.clearml_task.close() time.sleep(120) # Prevent the agent to pick up new tasks
Thanks AgitatedDove14 !
What would be the exact content of NVIDIA_VISIBLE_DEVICES
if I run the following command?trains-agent daemon --gpus 0,1 --queue default &
Unfortunately this is difficult to reproduce... Neverthless it would be important to me to be robust against it, because if this error happens in a task in the middle of my pipeline, the whole process fails.
This binds to another wider topic I think: How to "skip" tasks if they already run (a mechanism similar to what [ https://luigi.readthedocs.io/en/stable/ ] offers). That would allow to restart the pipeline and skip tasks until the point where the task failed
Thanks for the hack! The use case is the following: I have a controler that creates training/validation/testing tasks by cloning (so that the parent task id is properly set to the controler). Otherwise I could simply create these tasks with Task.init, but then I would need to set manually the parent task for each one of these tasks, probably with a similar hack, right?
mmmmh I just restarted the experiment and it seems to work now. I am not sure why that happened. From this SO it could be related to size of the repo. Might be a good idea to clone with --depth 1
in the agents?
Or more generally, try to catch this error and retry a few times?
Ping CostlyOstrich36 AgitatedDove14 SuccessfulKoala55 Just making sure this wasn't missed π
Looks like its a hurray then π π πΎ
I want the clearml-agent/instance to stop right after the experiment/training is βpausedβ (experiment marked as stopped + artifacts saved)
sure, will be happy to debug that π
That would be awesome, yes, only from my side I have 0 knowledge of the pip codebase π
I am already trying with latest of pip π
torch==1.7.1 git+
.
yes, the only thing I changed is:install_requires=[ ... "my-dep @ git+
]
to:install_requires=[ ... "git+
"]
Hey FriendlySquid61 ,
I ended up asking for full control of EC2 not to be blocked, so unfortunately I cannot give you a more precise list π
and this works. However, without the trick from UnevenDolphin73 , the following wonβt work (return None):if __name__ == "__main__": task = Task.current_task() task.connect(config) run() from clearml import Task Task.init()
UnevenDolphin73 , task = clearml.Task.get_task(clearml.config.get_remote_task_id())
worked, thanks
AgitatedDove14 So I copied pasted locally the https://github.com/pytorch-ignite/examples/blob/main/tutorials/intermediate/cifar10-distributed.py from the examples of pytorch-ignite. Then I added a requirements.txt and called clearml-task
to run it on one of my agents. I adapted a bit the script (removed python-fire since itβs not yet supported by clearml).
So I guess the problem is that the following snippet:from clearml import Task Task.init()
Should be added before the if __name__ == "__main__":
?
I am using 0.17.5, it could be either a bug on ignite or indeed a delay on the send. I will try to build a simple reproducible example to understand to cause
And after the update, the loss graph appears
Hi SoggyFrog26 , https://github.com/allegroai/clearml/blob/master/docs/datasets.md