effectively making us lose 24 hours of GPU compute
Oof, sorry about that, man 😞
FierceHamster54 As long as you are not forking, you need to use Task.init
such that the libraries you are using get patched in the child process. You don't need to specify the project_name
, task_name
or outpur_uri
. You could try locally as well with a minimal example to check that everything works after calling Task.init
.
The train.py
is the default YOLOv5 training file, I initiated the task outside the call, should I go edit their training command-line file ?
But the task appeared with the correct name and outputs in the pipeline and the experiment manager
Hi FierceHamster54 ! Did you call Task.init()
in train.py
?
The worker docker image was running on python 3.8 and weare running on a PRO tier SaaS deployment, this failed run is from a few weeks ago and we did not run any pipeline since then
THe image OS and the runner OS were both Ubuntu 22 if I remember
FierceHamster54 I understand. I'm not sure why this happens then 😕 . We will need to investigate this properly. Thank you for reporting this and sorry for the time wasted training your model.
SmugDolphin23 But the training.py has already a CLearML task created under the hood since its integration with ClearML, beside initing the task before the execution of the file like in my snippet is not sufficient ?
One more question FierceHamster54 : what Python/OS/clearml version are you using?
FierceHamster54initing the task before the execution of the file like in my snippet is not sufficient ?
It is not because os.system
spawns a whole different process then the one you initialized your task in, so no patching is done on the framework you are using. Child processes need to call Task.init
because of this, unless they were forked, in which case the patching is already done.But the training.py has already a CLearML task created under the hood since its integration with ClearML
Does training.py
call functions from the clearml
library? If so, what functions and at which stages of the training? Having a task should be enough to save the models appropriately, so something could be bugged in our logging 🫤
I'm reffering https://clearml.slack.com/archives/CTK20V944/p1668070109678489?thread_ts=1667555788.111289&cid=CTK20V944 mapping the project to ClearML project and https://github.com/ultralytics/yolov5/tree/master/utils/loggers/clearml that when calling the trainin g.py from my machine successfully logged the training on clearML and uploaded the artifact correctly