Hey There, Happy New Year To All Of You

Answered

Hey there, happy new year to all of you 🍾
I have several tasks that are stuck while training a model with pytorch/ignite, more precisely right after uploading a checkpoint to s3. This is the last line of the logs:
` 1609780734331 office:machine DEBUG 2021-01-04 18:18:51,928 valid INFO: Engine run starting with max_epochs=1.
2021-01-04 18:18:51,929 - trains.storage - INFO - Starting upload: /tmp/.trains.upload_model_svfr7_s3.tmp => my-bucket/train-experiments/my_experiement.ae0a73381b53749c7c0b73ee2c3325d5/models/epoch_checkpoint_3.pt

1609780756290 office:machine DEBUG 2021-01-04 18:19:11,575 - trains.Task - INFO - Completed model upload to s3://my-bucket/trains-experiments/my_experiement.ae0a73381b53749c7c0b73ee2c3325d5/models/epoch_checkpoint_3.pt There is no error before this line. The task is still in a Running state but nothing more happens (no metrics/logs is collected anymore). Note: If I reset the experiment via the dashboard, the logs and the scalars are successfully cleaned but the agent continues running the task (the task is still in a Running ` state) and the agent continues logging hardware metrics (GPU/CPU/Disk metrics). If I abort the experiment via the dashboard, the agent correctly shuts down the task. Then I can reset the task and it works as expected

This happened 3-5 times already on different experiments, with trains==0.16.3, trains-agent==0.16.1, trains-server==0.16.1

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Votes Newest

Answers 4

Hi JitteryCoyote63
If you want to stop the Task, click Abort (Reset will not stop the task or restart it, it will just clear the outputs and let you edit the Task itself) I think we witnessed something like that due to DataLoaders multiprocessing issues, and I think the solution was to add 'multiprocessing_context='forkserver' to the DataLoaderhttps://github.com/allegroai/clearml/issues/207#issuecomment-702422291
Could you verify?

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Did you experiment any drop of performances using forkserver?

No, seems to be working properly for me.

If yes, did you test the variant suggested in the pytorch issue? If yes, did it solve the speed issue?

I haven't tested it, that said it seems like a generic optimization of the DataLoader

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi AgitatedDove14 , so I ran 3 experiments:
One with my current implementation (using "fork") One using "forkserver" One using "forkserver" + the DataLoader optimizationI sent you the results via MP, here are the outcomes:
fork -> 101 mins, low RAM usage (5Go constant), almost no IO forkserver -> 123 mins, high RAM usage (16Go, fluctuations), high IO forkserver + DataLoader optimization: 105 mins, high RAM usage (from 28Go to 16Go), high IO
CPU/GPU curves are the same for the 3 experiments. Switching to forkserver doesn't appear to be ideal for my use case because of the high mem consumption. The ideal solution for me would be to keep fork as it appears to be both fast and low resource consuming. Would that be something possible?

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Hi AgitatedDove14 , thanks for the answer! I will try adding 'multiprocessing_context='forkserver' to the DataLoader. In the issue you linked, nirraviv mentionned that forkserver was slower and shared a link to another issue https://github.com/pytorch/pytorch/issues/15849#issuecomment-573921048 where someone implemented a fast variant of the DataLoader to overcome the speed problem.
Did you experiment any drop of performances using forkserver? If yes, did you test the variant suggested in the pytorch issue? If yes, did it solve the speed issue?

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Write your answer

1K Views

4 Answers

4 years ago

2 years ago