Hello, We Have A Self Hosted Clearml Server Connected To Different Queues And Use It To Launch Remote Experiments (Clearml==1.9.3, Clearml-Agent==1.5.2Rc0). It Is Working Really Well For Us Unless One Workflow :) We Would Like To Abort An Experiment And E

Answered

Hello,
We have a self hosted ClearML server connected to different queues and use it to launch remote experiments (clearml==1.9.3, clearml-agent==1.5.2rc0). It is working really well for us unless one workflow :)
We would like to abort an experiment and enqueue it in another queue with a lower priority using the interface (continue same task). We use Tensorflow and tensorboard with the keras compile/fit wrapper. The tensorboard plots look fine after enqueue (restart at the last complete epochs/iterations) thanks to inital_epochs but clearml patch over these functions create an offset like mentioned in this issue . We are able to set_initial_iteration to 0 but not get_last_iteration . As I understand it we could in the task init with continue_last_task=0 but I do not get how I could set that during enqueue trigger by the interface or there is another solution ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FunnyAlligator17
				
					0
					 × 1

Votes Newest

Answers 18

done here: None , thanks in advance for your help

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FunnyAlligator17
				
					0
					 × 1

yes no problem, i will try to explain it correctly but do not hesitate to complete or ask for more information

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FunnyAlligator17
				
					0
					 × 1

last iteration is no reset and I still have a gap in my scalars

Hmm is this reproducible ? can you check with the latest clearml version (1.10.3) ?
btw: I'm assuming continue_last_task=0

I think I found the issue, the fact the agent is launching it causes it to ignore the "overridden" set_initial_iteration

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes it is reproducible do you want a snippet? How do you patch tensorboard plots to decide iterations and where does it uses last_iteration ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FunnyAlligator17
				
					0
					 × 1

It's available in pypi: None

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hello,
I had again the same problem but within a remote pipeline setup. The task launching the pipeline has continue_last_task=0 but I guess this argument is not shared to the node/step that it will launched because when the retry_on_failure of add_function_step is triggered we start to see again the offset inside the scalars of the node. Is there a way to inherit arg of the base task or something like that?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FunnyAlligator17
				
					0
					 × 1

Hello, thanks. Do not hesitate to tag me on the PR my github username is MaximeChurin . Once I have tested it I will let you know

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FunnyAlligator17
				
					0
					 × 1

Hi @<1558986821491232768:profile|FunnyAlligator17> , apologies, I think v1.10.4rc0 which is already out contains this fix...

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Yes it is reproducible do you want a snippet?

Already fixed 🙂 please ping tomorrow, I think an RC should be out soon with the fix

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I have the same offset (that appear after each fail on my scalars).

Hmm, I actually would think this is the "correct" behavior, but I see your point:
Any chance you can open a GH issue ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thank you very much, it works perfectly with the rc!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FunnyAlligator17
				
					0
					 × 1

Hello @<1523701205467926528:profile|AgitatedDove14> , do you have any update or ETA on this rc?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FunnyAlligator17
				
					0
					 × 1

Hi @<1558986821491232768:profile|FunnyAlligator17>
What do you mean by?

We are able to

set_initial_iteration

to 0 but not

get_last_iteration

.

Are you saying that if your code looks like:

Task.set_initial_iteration(0)
task = Task.init(...)

and you abort and re-enqueue, you still have a gap in the scalars ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I had again the same problem but within a remote pipeline setup.

Are you saying the ussue is not fixed? can you verify the pipeline & pipeline components are using the at least 1.104rc0 version?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi !
I do not see it in the repo, am I missing something?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FunnyAlligator17
				
					0
					 × 1

Woot woot, will do!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes I am saying that not matter if we set_initial_iteration(0) and also continue_last_iteration=0 on the task init, if I requeue the task the last iteration is no reset and I still have a gap in my scalars
Let me know if you need more information of my env/worklow or need a dedicated issue

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FunnyAlligator17
				
					0
					 × 1

It is fixed with a single task workflow (abort then enqueue), but within a pipeline with retry_on_failure I have the same offset (that appear after each fail on my scalars). Yes we have clearml==1.11.0

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FunnyAlligator17
				
					0
					 × 1

Write your answer

3K Views

18 Answers

2 years ago