Hey There, Since A Bit I Often Find Experiments Being Stuck While Training A Model. It Seems To Happen Randomly And I Could Not Find A Reproducible Scenario So Far, But It Happens Often Enough To Be Annoying (I'D Say 1 Out Of 5 Experiments). The Symptoms

Answered

Hey there, since a bit I often find experiments being stuck while training a model.
It seems to happen randomly and I could not find a reproducible scenario so far, but it happens often enough to be annoying (I'd say 1 out of 5 experiments).
The symptoms are:
The task is stuck: no more logging, no more CPU/GPU/disk/network activity The task is marked as running. It stays marked as running forever if I don't stop the task manually. The last logs before the experiments is stuck are always about saving a checkpoint at the end of an epoch. We do it with https://pytorch.org/ignite/generated/ignite.contrib.handlers.clearml_logger.html#ignite.contrib.handlers.clearml_logger.ClearMLSaver (pytorch-ignite). See logs below. The task/process/agent are responsive: If I manually abort the task from the WebUI, the task is stopped properly and I do see logged "User aborted: stopping task (1)" and "Process aborted by user". I can then start other tasks on the same agent without any problem.What could be the reason for the task being stuck like that? Inter-process communication? Thread locks?
2022-10-08 01:23:10,692 valid_dataset INFO: Epoch[1] Complete. Time taken: 00:03:36 2022-10-08 01:23:10,693 valid_dataset INFO: Engine run complete. Time taken: 00:03:37 2022-10-08 01:23:10,693 - clearml.storage - INFO - Starting upload: /tmp/.clearml.upload_model_zcgp8dow.tmp => s3-artefacts/project/experiment_name.experiment_hash/models/best_checkpoint_0.pt 2022-10-08 01:23:11,041 train INFO: Epoch[28] Complete. Time taken: 00:21:05 2022-10-08 01:23:11,093 - clearml.storage - INFO - Uploading: 5.00MB / 37.65MB @ 12.55MBs from /tmp/.clearml.upload_model_zcgp8dow.tmp 2022-10-08 01:23:11,183 - clearml.storage - INFO - Uploading: 10.00MB / 37.65MB @ 55.48MBs from /tmp/.clearml.upload_model_zcgp8dow.tmp 2022-10-08 01:23:11,205 - clearml.storage - INFO - Uploading: 15.00MB / 37.65MB @ 230.58MBs from /tmp/.clearml.upload_model_zcgp8dow.tmp 2022-10-08 01:23:11,227 - clearml.storage - INFO - Uploading: 20.15MB / 37.65MB @ 238.17MBs from /tmp/.clearml.upload_model_zcgp8dow.tmp 2022-10-08 01:23:11,250 - clearml.storage - INFO - Uploading: 25.15MB / 37.65MB @ 212.19MBs from /tmp/.clearml.upload_model_zcgp8dow.tmp 2022-10-08 01:23:11,282 - clearml.storage - INFO - Uploading: 30.15MB / 37.65MB @ 158.42MBs from /tmp/.clearml.upload_model_zcgp8dow.tmp 2022-10-08 01:23:11,304 - clearml.storage - INFO - Uploading: 35.15MB / 37.65MB @ 220.15MBs from /tmp/.clearml.upload_model_zcgp8dow.tmp 2022-10-08 01:23:12,222 - clearml.Task - INFO - Completed model upload to

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Votes Newest

Answers 12

You mean you "aborted the task" from the UI?

Yes exactly

I'm assuming from the leftover processes ?

Most likely yes, but I don't see how clearml would have an impact here, I am more inclined to think it would be a pytorch dataloader issue, although I don't see why

From the log I see the agent is running in venv mode
Hmm please try with the latest clearml-agent (the others should not have any effect)

yes in venv mode, I'll try with the latest version as well

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

JitteryCoyote63 what's the clearml version ?
Are you always seeing the "model uploaded completed" message ?
What's the python version you are using?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

What is latest rc of clearml-agent? 1.5.2rc0?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Any chance this is reproducible ?

Unfortunately not at the moment, I could find a reproducible scenario. If I clone a task that was stuck and start it, it might not be stuck

How many processes do you see running (i.e. ps -Af | grep python) ?

I will check that when the next one will be blocked 👍

What is the training framework? is it multiprocess ? how are you launching the process itself? is it Linux OS? is it running inside a specific container ?

I train with pytorch (1.11) and ignite (0.4.8), using multiprocess (via the dataloader with n_workers=8) on linux, not running inside a docker container

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

There seems to be a problem with multiprocessing: Although I stopped the task,

You mean you "aborted the task" from the UI?

There is a memory leak somewhere, please see the screenshot of datadog memory consumptionI'm assuming from the leftover processes ?

Python 3.8/Pytorch 1.11/clearml-sdk 1.9.0/clearml-agent 1.4.1

From the log I see the agent is running in venv mode
Hmm please try with the latest clearml-agent (the others should not have any effect)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Most likely yes, but I don't see how clearml would have an impact here, I am more inclined to think it would be a pytorch dataloader issue, although I don't see why

These are most certainly dataloader process. But clearml-agent when killing the process should also kill all subprocesses, and it might be there is something going on that prenets it from killing the subprocesses ...

Is this easily reproducible ? Can you verify it is still the case with the latest RC of clearml-agent ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Any insight will help, if you can provide the log of the Task that did get stuck, that would be a good start

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Note: Could be related to https://github.com/allegroai/clearml/issues/790 , not sure

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

@<1523701205467926528:profile|AgitatedDove14> I see other rc in pypi but no corresponding tags in the clearml-agent repo? are these releases legit?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Hmm, #790 should be solved in 1.7.2
Yes, I always see the "model uploaded completed" for such stuck tasksAny chance this is reproducible ?
How many processes do you see running (i.e. ps -Af | grep python) ?
What is the training framework? is it multiprocess ? how are you launching the process itself? is it Linux OS? is it running inside a specific container ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi @<1523701205467926528:profile|AgitatedDove14> , I want to circule back on this issue. This is still relevant and I could collect the following on an ec2 instance running a clearml-agent running a stuck task:

There seems to be a problem with multiprocessing: Although I stopped the task, there are still so many processes forked from the main training process. I guess these are zombies. Please check the htop tree.
There is a memory leak somewhere, please see the screenshot of datadog memory consumptionI don't know yet what to explore first. My first assumption is that this bug is due to a recent version of clearml-sdk/clearml-agent/python/pytorch (the training used to work smoothly couple of months ago). Now I get this problem on all my experiments.
Note: I think the memory is a consequence of the multiprocessing zombie bug, because on some experiments the memory grows but the experiment get stuck before reaching max of mem (see 2nd screenshot of datadog mem consumption). But that's just a hypothesis

Python 3.8/Pytorch 1.11/clearml-sdk 1.9.0/clearml-agent 1.4.1

Right now my next steps for debugging would be:

Train without clearml integration -> If works, check which version of clearml sdk/agent is responsible
Train with older version of the training code -> If works, look for guilty code changes
Train with different python/pytorch version

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Hi AgitatedDove14 , sorry somehow this message got lost 😄
clearml version is the latest at the time, 1.7.1 Yes, I always see the "model uploaded completed" for such stuck tasks I am using python 3.8.10

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Write your answer

2K Views

12 Answers

3 years ago

2 years ago