I Am Using Clearml Pro And Pretty Regularly I Will Restart An Experiment And Nothing Will Get Logged To Clearml. It Shows The Experiment Running (For Days) And It'S Running Fine On The Pc But No Scalers Or Debug Samples Are Shown. How Do We Troubleshoot T

Answered

I am using ClearML Pro and pretty regularly I will restart an experiment and nothing will get logged to ClearML. It shows the experiment running (for days) and it's running fine on the PC but no scalers or debug samples are shown.
How do we troubleshoot this?

  				
Posted 
	8 months ago

					More  		
  Report
		
					ThankfulClams64
				
					0
					 × 1

Votes Newest

Answers 69

task.connect(model_config)
task.connect(DataAugConfig)

If these are separate dictionaries , you should probably use two sections:

    task.connect(model_config, name="model config")
    task.connect(DataAugConfig, name="data aug")

It is still getting stuck.
I notice that one of the scalars that gets logged early is logging the epoch while the remaining scalars seem to be iterations because the iteration value is 1355 instead of 26

wait so you are seeing Some scalars ?

while the remaining scalars seem to be iterations because the iteration value is 1355 instead of 26

what are you seeing in your TB?

  				
Posted 
	8 months ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

It was working for me. Anyway I modified the callback. Attached is the script that has the issue for me whenever I add random_image_logger to the callbacks It only logs some of the scalars for 1 epoch. It then is stuck and never recovers. When I remove random_image_logger the scalars are correctly logged. Again this only on 1 computer, other computers we have logging work perfectly fine

  				
Posted 
	7 months ago

					More  		
  Report
		
					ThankfulClams64
				
					0
					 × 1

Thanks ThankfulClams64 having a code that can reproduce it is exactly what we need.
One thing I might have missed and is very important , what is your tensorboard package version?

  				
Posted 
	7 months ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

It is not always reproducible it seems like something that we do not understand happens then the machine consistently has this issue. We believe it has something to do with stopping and starting experiments

  				
Posted 
	7 months ago

					More  		
  Report
		
					ThankfulClams64
				
					0
					 × 1

So I am only seeing values for the first epoch. It seems like it does not track all of them so maybe something is happening when it tries to log scalars.
I have seen it only log iterations but setting task.set_initial_iteration(0) seemed to fix that so it now seems to be logging the correct epoch
Tensorboard is correct and works. I have never seen an issue in the tensorboard logs

  				
Posted 
	7 months ago

					More  		
  Report
		
					ThankfulClams64
				
					0
					 × 1

Console output and also what you get on the ClearML task page under the console section

  				
Posted 
	8 months ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi we are currently having the issue. There is nothing in the console regarding ClearML besides

ClearML Task: created new task id=0174d5b9d7164f47bd10484fd268e3ff
======> WARNING! Git diff too large to store (3611kb), skipping uncommitted changes <======
ClearML results page:

The console logs continue to come in put no scalers or debug images show up.

  				
Posted 
	8 months ago

					More  		
  Report
		
					ThankfulClams64
				
					0
					 × 1

So even if you abort it on the start of the experiment it will keep running and reporting logs?

  				
Posted 
	7 months ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Console logs

  				
Posted 
	7 months ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Show more results

Write your answer

50K Views

69 Answers

8 months ago

7 months ago