Hey All, Hope You'Re Having A Great Day, Having An Unexpected Behavior With A Training Task Of A Yolov5 Model On My Pipeline, I Specified A Task In My Training Component Like This:

Answered

Hey all, hope you're having a great day, having an unexpected behavior with a training task of a YOLOv5 model on my pipeline, I specified a task in my training component like this:
task = Task.init( project_name='XXXX', task_name=f'Model Retraining {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}', output_uri=' ` ',
)
task.set_progress(0)

print("Training model...")
os.system(train_cmd)
print("✔️ Model trained!")

task.set_progress(100) `The component was marked in time-out status in the pipeline menu (and thus incorrectly setting the pipeline outcome status as  ` failed ` ) but the training task itself succeeded (it has a duration of approx. 24h) but the artifact was not stored on the specified S3 bucket but on the local machine (path:  ` file:///root/.clearml/venvs-builds/3.8/task_repository/datasets/default/runs/yolov5s6_results/weights/best.pt ` ) and cannot be retrieved since it was an auto-scaled compute node, effectively making us lose 24 hours of GPU compute  :confused:

The training command was:
train_cmd = f"python train.py --img 1280 --batch 16 --epochs ${epochs} --hyp hyp.vinz.yaml --data {os.path.join(training_data_path, 'dataset.yaml')} --weights http://yolov5s6.pt --cache ram --project {os.path.join(training_data_path, 'runs')} --name yolov5s6_results"

Does anyone has an explanation for this behavior ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FierceHamster54
				
					0
					 × 1

Votes Newest

Answers 13

The worker docker image was running on python 3.8 and weare running on a PRO tier SaaS deployment, this failed run is from a few weeks ago and we did not run any pipeline since then

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FierceHamster54
				
					0
					 × 1

FierceHamster54 As long as you are not forking, you need to use Task.init such that the libraries you are using get patched in the child process. You don't need to specify the project_name , task_name or outpur_uri . You could try locally as well with a minimal example to check that everything works after calling Task.init .

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

But the task appeared with the correct name and outputs in the pipeline and the experiment manager

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FierceHamster54
				
					0
					 × 1

The train.py is the default YOLOv5 training file, I initiated the task outside the call, should I go edit their training command-line file ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FierceHamster54
				
					0
					 × 1

SmugDolphin23 But the training.py has already a CLearML task created under the hood since its integration with ClearML, beside initing the task before the execution of the file like in my snippet is not sufficient ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FierceHamster54
				
					0
					 × 1

effectively making us lose 24 hours of GPU compute

Oof, sorry about that, man 😞

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

FierceHamster54
initing the task before the execution of the file like in my snippet is not sufficient ?It is not because os.system spawns a whole different process then the one you initialized your task in, so no patching is done on the framework you are using. Child processes need to call Task.init because of this, unless they were forked, in which case the patching is already done.
But the training.py has already a CLearML task created under the hood since its integration with ClearMLDoes training.py call functions from the clearml library? If so, what functions and at which stages of the training? Having a task should be enough to save the models appropriately, so something could be bugged in our logging 🫤

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

One more question FierceHamster54 : what Python/OS/clearml version are you using?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

FierceHamster54 I understand. I'm not sure why this happens then 😕 . We will need to investigate this properly. Thank you for reporting this and sorry for the time wasted training your model.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

Hi FierceHamster54 ! Did you call Task.init() in train.py ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

THe image OS and the runner OS were both Ubuntu 22 if I remember

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FierceHamster54
				
					0
					 × 1

Thank you

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

I'm reffering https://clearml.slack.com/archives/CTK20V944/p1668070109678489?thread_ts=1667555788.111289&cid=CTK20V944 mapping the project to ClearML project and https://github.com/ultralytics/yolov5/tree/master/utils/loggers/clearml that when calling the trainin g.py from my machine successfully logged the training on clearML and uploaded the artifact correctly

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FierceHamster54
				
					0
					 × 1

Write your answer

2K Views

13 Answers

2 years ago