Hi, I'M Experiencing A New Issue With Logging Scalars And Plots. It Seems That Something Has Changed On Your Side (Clearml) And Scalar And Plots Are Having Trouble Getting Logged (Usin The Pro Plan Of Clearml). The Result Is A Full Training Run Without Sc

Answered

Hi, i'm experiencing a new issue with logging scalars and plots. it seems that something has changed on your side (clearml) and scalar and plots are having trouble getting logged (usin the pro plan of clearml). the result is a full training run without scalars logged (the same code worked perfectly a few days ago). it seems like the issue is not persistent (for example, this morning the logging has succeeded, and after a few hours new tasks fails to log). looking at the attached console log, you can see that the logs order doesn't makes sense (image fetching is at the middle of the log after reporting epoch results...).
another example is the attached image, where logs are ok until epoch 227, but stops reporting afterwards (has completed 400 epochs).
these issues resulting tasks that are stuck under "running" mode although they finished the training, probably due to logs that needs to be done and are not completed (we use "task.flush(wait_for_uploads=True)" at the end of our scripts)
any help regarding these issues would be much appreciated!

  				
Posted 
	5 days ago

					More  		
  Report
		
					DangerousBee35
				
					0
					 × 1

Votes Newest

Answers 2

i don't have one, as i said it is not very reproduceable. the same code runs fine one time, and another time (running the exact same experiment) it works the same but with the logging issues. as i mentioned, IMO it is not something related to the code itself but to connectivity with clearml servers. i'm running on GCP machines, which is not the first time i'm experiencing connectivity issues with clearml when working on them (we migrated from AWS ec2 a few weeks ago). the first issue was with very long time task.connect executions (up to several hours for connecting some dictionaries which should be executed within seconds). maybe the issue is related with (low) prioritizing requests from GCP?

  				
Posted 
	4 days ago

					More  		
  Report
		
					DangerousBee35
				
					0
					 × 1

Hi DangerousBee35 , do you have some stand-alone code snippet that reproduces this behaviour?

  				
Posted 
	4 days ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Write your answer

50 Views

2 Answers

5 days ago

4 days ago