Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hey All, Hope You'Re Having A Great Day, Having An Unexpected Behavior With A Training Task Of A Yolov5 Model On My Pipeline, I Specified A Task In My Training Component Like This:

Hey all, hope you're having a great day, having an unexpected behavior with a training task of a YOLOv5 model on my pipeline, I specified a task in my training component like this:
task = Task.init( project_name='XXXX', task_name=f'Model Retraining {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}', output_uri=' ` ',
)
task.set_progress(0)

print("Training model...")
os.system(train_cmd)
print("✔️ Model trained!")

task.set_progress(100) `The component was marked in time-out status in the pipeline menu (and thus incorrectly setting the pipeline outcome status as  ` failed ` ) but the training task itself succeeded (it has a duration of approx. 24h) but the artifact was not stored on the specified S3 bucket but on the local machine (path:  ` file:///root/.clearml/venvs-builds/3.8/task_repository/datasets/default/runs/yolov5s6_results/weights/best.pt ` ) and cannot be retrieved since it was an auto-scaled compute node, effectively making us lose 24 hours of GPU compute  :confused: 

The training command was:
train_cmd = f"python train.py --img 1280 --batch 16 --epochs ${epochs} --hyp hyp.vinz.yaml --data {os.path.join(training_data_path, 'dataset.yaml')} --weights http://yolov5s6.pt --cache ram --project {os.path.join(training_data_path, 'runs')} --name yolov5s6_results"

Does anyone has an explanation for this behavior ?

  
  
Posted 2 years ago
Votes Newest

Answers 13


The train.py is the default YOLOv5 training file, I initiated the task outside the call, should I go edit their training command-line file ?

  
  
Posted 2 years ago

Thank you

  
  
Posted 2 years ago

Hi FierceHamster54 ! Did you call Task.init() in train.py ?

  
  
Posted 2 years ago

effectively making us lose 24 hours of GPU compute

Oof, sorry about that, man 😞

  
  
Posted 2 years ago

FierceHamster54
initing the task before the execution of the file like in my snippet is not sufficient ?It is not because os.system spawns a whole different process then the one you initialized your task in, so no patching is done on the framework you are using. Child processes need to call Task.init because of this, unless they were forked, in which case the patching is already done.
But the training.py has already a CLearML task created under the hood since its integration with ClearMLDoes training.py call functions from the clearml library? If so, what functions and at which stages of the training? Having a task should be enough to save the models appropriately, so something could be bugged in our logging 🫤

  
  
Posted 2 years ago

FierceHamster54 As long as you are not forking, you need to use Task.init such that the libraries you are using get patched in the child process. You don't need to specify the project_name , task_name or outpur_uri . You could try locally as well with a minimal example to check that everything works after calling Task.init .

  
  
Posted 2 years ago

SmugDolphin23 But the training.py has already a CLearML task created under the hood since its integration with ClearML, beside initing the task before the execution of the file like in my snippet is not sufficient ?

  
  
Posted 2 years ago

But the task appeared with the correct name and outputs in the pipeline and the experiment manager

  
  
Posted 2 years ago

THe image OS and the runner OS were both Ubuntu 22 if I remember

  
  
Posted 2 years ago

I'm reffering https://clearml.slack.com/archives/CTK20V944/p1668070109678489?thread_ts=1667555788.111289&cid=CTK20V944 mapping the project to ClearML project and https://github.com/ultralytics/yolov5/tree/master/utils/loggers/clearml that when calling the trainin g.py from my machine successfully logged the training on clearML and uploaded the artifact correctly

  
  
Posted 2 years ago

One more question FierceHamster54 : what Python/OS/clearml version are you using?

  
  
Posted 2 years ago

The worker docker image was running on python 3.8 and weare running on a PRO tier SaaS deployment, this failed run is from a few weeks ago and we did not run any pipeline since then

  
  
Posted 2 years ago

FierceHamster54 I understand. I'm not sure why this happens then 😕 . We will need to investigate this properly. Thank you for reporting this and sorry for the time wasted training your model.

  
  
Posted 2 years ago
1K Views
13 Answers
2 years ago
one year ago
Tags