Hi Everyone, I'M Using The

Answered

Hi everyone, I'm using the https://api.clear.ml/ server and ran a bunch of experiments using hydra multirun (sequential runs). Many of these experiments appear with status running on clearml even though they have finished running, and not all of the plots got uploaded. Is this because the server is a bit overloaded and is timing out when receiving the logs?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AttractiveHawk17
				
					0
					 × 1

Votes Newest

Answers 11

Hi AttractiveCockroach17

. Many of these experiments appear with status running on clearml even though they have finish running,

Could it be their process just terminated? (i.e. not properly shutdown) ?
How are you running these multiple experiments?
BTW: if the server does not see any change in a Task for (I think the default is 2 hours) it will automatically mark these Task as aborted

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

im running them with python my_script.py -m my_parameter=value_1,value_2,value_3 (using hydra multirun)

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AttractiveHawk17
				
					0
					 × 1

So as you say, it seems hydra kills these

Hmm let me check in the code, maybe we can somehow hook into it

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Okay, so I can't figure why it would "kill" the new experiments, I mean it should run them, but is there any "smart stopping" that causes it to kill he process before it ends ?
BTW: can this be reproduced with the clearml hydra example ?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

each of those runs finished producing 10 plots each but in clearml only 1, a few, or none got uploaded

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AttractiveHawk17
				
					0
					 × 1

yes

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AttractiveHawk17
				
					0
					 × 1

AttractiveCockroach17 can I assume you are working with the hydra local launcher ?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

it doesnt happen with all the tasks of the multirun as you can see in the photo

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AttractiveHawk17
				
					0
					 × 1

AttractiveCockroach17 could it be Hydra actually kills these processes?
(I'm trying to figure out if we can fix something with the hydra integration so that it marks them as aborted)

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

indeed, im looking at their corresponding multirun outputs folder and the logs terminate before without error and the only plots saved are those in clearml. So as you say, it seems hydra kills these

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AttractiveHawk17
				
					0
					 × 1

dont think will be reproducible with the hydra example. It was just that I launched like 50 jobs and some of them because of the parameters maybe failed (strangely with no error).
But is ok for now I guess, will debug wether those experiments that failed would failed if ran independently as well

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AttractiveHawk17
				
					0
					 × 1

Write your answer

761 Views

11 Answers

one year ago