Hi, I Have A Problem That I Am Not Really Sure About How To Track It Down: I Sometimes Get The Following Message That Kills My Running Process After A Few Hours:

Answered

Hi, I have a problem that I am not really sure about how to track it down: I sometimes get the following message that kills my running process after a few hours: clearml.Task - WARNING - ### TASK STOPPED - USER ABORTED - STATUS CHANGED ### . This time, it happened while I was asleep, so I didn’t do anything. My Server was up all the time. I am running my training in a docker container on a cluster that is not managed by me and report to ClearML Community Server. Has anyone ever experienced something similar?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ShallowKitten67
				
					0
					 × 1

Votes Newest

Answers 28

Hi ShallowKitten67 .

Can you send the logs? can you share the machine monitoring (from scalars section)?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

ShallowKitten67 this could happen if you're changing your task's status somewhere in your code - are you?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Here are the machine monitoring scalars. Seems fine to me. I am currently trying to reproduce results from a paper, thus I do not tune batch_size etc to use all available resources.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ShallowKitten67
				
					0
					 × 1

Since I do not manage the cluster, I do not have permission to access system logs. In the docker logs, the last thing that gets printed is the clearml.Task WARNING .

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ShallowKitten67
				
					0
					 × 1

Oh, and I do not change the task’s status in my code. I just create it at the beginning of my training.

` configuration = parser.parse(config_path)

task = clearml.Task.init(project_name='Foo',
task_name=configuration.name) `

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ShallowKitten67
				
					0
					 × 1

ShallowKitten67 are you relying on the automatic reporting (so just creating a task and doing nothing clearml-related afterwards), or are you explicitly calling any clearml methods in your code?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I use tensorboard and rely on automatic logging for all of my scalar reporting. However, I periodically log some scatter plots using clearml.Logger.report_plotly . And I use report_text to log some information about training progress to the console.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ShallowKitten67
				
					0
					 × 1

OK, those can't cause any issue 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

What you're seeing is basically the SDK's response to the Task's status being change mid-run, or to someone clicking "Stop" in the UI

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

There are literally only two things that can cause that specific message to be printed 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

What server are you using?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I am using the community server at https://app.community.clear.ml . In my environment I use clearml==1.0.2 , so I probably should update to the latest version

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ShallowKitten67
				
					0
					 × 1

That seems OK

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

but updating to the latest version is always a good idea 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Is there a way to check how much storage I am using on the community server?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ShallowKitten67
				
					0
					 × 1

It's on the way, but not yet possible 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

how many experiments do you have?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Currently 38

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ShallowKitten67
				
					0
					 × 1

Doesn't seem too large 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Yes, also on my machine, where I store the Tensorboard logs together with additional results (Meshes and Model checkpoints) of all experiments, I only use like 1 GB

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ShallowKitten67
				
					0
					 × 1

It seems like I lost connection during the run of my experiment. But this happened like 200 Epochs before the process got terminated

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ShallowKitten67
				
					0
					 × 1

Well, there's a watchdog on the server that automatically stops tasks that haven't reported for a long time - I guess that's what happened...

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I mean, assuming you lost connection to the server and stopped reporting

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

It seems like I regained connection. At least I can see all values until the task got terminated and after the HTTPTimeOut warning in my logs, the training runs for another 200 Iterations (~1.5 hours)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ShallowKitten67
				
					0
					 × 1

So that doesn't explain why the task's status was changed...

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

After some investigation, this might be related to an issue in ClearML SDK 1.0.2 with the subprocesses support - I suggest upgrading to ClearML SDK 1.0.4 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Little update here: It happened again after an update to ClearML SDK 1.0.4, but this time it happened immediately after I lost HTTP connection. This makes sense with your explanations. Can I suppress this by setting sdk.development.support_stopping in the config to false ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ShallowKitten67
				
					0
					 × 1

Yeah, it should disable this behavior

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

911 Views

28 Answers

3 years ago

one year ago