Hi, I Have A Question About Task Status. I Have A Script That Runs "Forever": It Loads (Or Creates, If It Does Not Exist Yet) A Specific Clearml Task, Does Some Work (In My Case, Checks If Database Has Changed And If So Dump It To A File And Upload It As

Answered

Hi, I have a question about task status.

I have a script that runs "forever": it loads (or creates, if it does not exist yet) a specific ClearML task, does some work (in my case, checks if database has changed and if so dump it to a file and upload it as a task artifact, while emitting logs to the CONSOLE tab) then go to sleep for 24 hours and start over again..
What happens is that the task initially starts out as "Running" and later becomes "Aborted", in the INFO tab I see: "STATUS REASON: Forced stop (non-responsive)".
I do call task.mark_started(force=True) in the beginning of each iteration but is still becomes "Aborted" each time.

As for the side effects: the logs do appear in the CONSOLE tab, but not all of the artifact files appear in the ARTIFACTS tab (or when inspecting task.artifacts ), but all of them can be found and downloaded manually in the fileserver, maybe this is somehow related to the task being aborted.

Is there a task timeout somewhere I can set to more than 24 hours so the task does not become "Aborted", or some task.keep_alive() method?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ShakyKangaroo32
				
					0
					 × 1

Votes Newest

Answers 9

Oh, so the task has an internal keepalive mechanism and me calling time.sleep() for more than 2 hours prevents it from working?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ShakyKangaroo32
				
					0
					 × 1

OK thanks. Just curious then, suppose you use the task for normal experiment tracking, you do Task.init() in the beginning as usual and train you model and your epochs are longer then 2 hours and you only print/report stuff at epoch end, would this cause the task to abort too?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ShakyKangaroo32
				
					0
					 × 1

Usually tasks are timed out by default after not having any action after 2 hours. I guess you could just keep the task alive as a process on your machine by printing something once every hour or 30 minutes

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

After some experimenting it seems that the situation improves when I call task.mark_started(force=True) before each task.upload_artifact() instead of just once in the beginning of the script.

Seems there are two approaches, either "revive" before each upload, or somehow keep it always "Running", do you have an idea how the second approach can be achieved? (I did not call task.close() or task.mark_*() anywhere).

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ShakyKangaroo32
				
					0
					 × 1

You need to separate the Task object itself from the code that is running. If you're manually 'reviving' a task but then nothing happens and no code is running then the task will get aborted eventually. I'm not sure I understand entirely what you're doing but I have a feeling you're doing something 'hacky'.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

@<1558986867771183104:profile|ShakyKangaroo32> If you just want something to run in regular period, have you consider TaskScheduler: None

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

No it wouldn't since something would actually be going on and the python script haven't finished

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

@<1576381444509405184:profile|ManiacalLizard2> , thanks, that was my initial solution, but I had some trouble with reusing the previously created task for the scheduler when the process that made the call to TaskScheduler.add_task() was interrupted.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ShakyKangaroo32
				
					0
					 × 1

Hi @<1558986867771183104:profile|ShakyKangaroo32> , can you please elaborate more on what is happening? So you're taking an existing task that finished and forcing it to get 'started' again? Then you write some things to it sometimes and then later you 'revive' it again? And due to this it appears some artifacts are missing?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Write your answer

2K Views

9 Answers

one year ago