Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I Have A Question About Task Status. I Have A Script That Runs "Forever": It Loads (Or Creates, If It Does Not Exist Yet) A Specific Clearml Task, Does Some Work (In My Case, Checks If Database Has Changed And If So Dump It To A File And Upload It As

Hi, I have a question about task status.

I have a script that runs "forever": it loads (or creates, if it does not exist yet) a specific ClearML task, does some work (in my case, checks if database has changed and if so dump it to a file and upload it as a task artifact, while emitting logs to the CONSOLE tab) then go to sleep for 24 hours and start over again..
What happens is that the task initially starts out as "Running" and later becomes "Aborted", in the INFO tab I see: "STATUS REASON: Forced stop (non-responsive)".
I do call task.mark_started(force=True) in the beginning of each iteration but is still becomes "Aborted" each time.

As for the side effects: the logs do appear in the CONSOLE tab, but not all of the artifact files appear in the ARTIFACTS tab (or when inspecting task.artifacts ), but all of them can be found and downloaded manually in the fileserver, maybe this is somehow related to the task being aborted.

Is there a task timeout somewhere I can set to more than 24 hours so the task does not become "Aborted", or some task.keep_alive() method?

  
  
Posted 11 months ago
Votes Newest

Answers 9


After some experimenting it seems that the situation improves when I call task.mark_started(force=True) before each task.upload_artifact() instead of just once in the beginning of the script.

Seems there are two approaches, either "revive" before each upload, or somehow keep it always "Running", do you have an idea how the second approach can be achieved? (I did not call task.close() or task.mark_*() anywhere).

  
  
Posted 11 months ago

Usually tasks are timed out by default after not having any action after 2 hours. I guess you could just keep the task alive as a process on your machine by printing something once every hour or 30 minutes

  
  
Posted 11 months ago

No it wouldn't since something would actually be going on and the python script haven't finished

  
  
Posted 11 months ago

Oh, so the task has an internal keepalive mechanism and me calling time.sleep() for more than 2 hours prevents it from working?

  
  
Posted 11 months ago

@<1558986867771183104:profile|ShakyKangaroo32> If you just want something to run in regular period, have you consider TaskScheduler: None

  
  
Posted 11 months ago

You need to separate the Task object itself from the code that is running. If you're manually 'reviving' a task but then nothing happens and no code is running then the task will get aborted eventually. I'm not sure I understand entirely what you're doing but I have a feeling you're doing something 'hacky'.

  
  
Posted 11 months ago

@<1576381444509405184:profile|ManiacalLizard2> , thanks, that was my initial solution, but I had some trouble with reusing the previously created task for the scheduler when the process that made the call to TaskScheduler.add_task() was interrupted.

  
  
Posted 11 months ago

Hi @<1558986867771183104:profile|ShakyKangaroo32> , can you please elaborate more on what is happening? So you're taking an existing task that finished and forcing it to get 'started' again? Then you write some things to it sometimes and then later you 'revive' it again? And due to this it appears some artifacts are missing?

  
  
Posted 11 months ago

OK thanks. Just curious then, suppose you use the task for normal experiment tracking, you do Task.init() in the beginning as usual and train you model and your epochs are longer then 2 hours and you only print/report stuff at epoch end, would this cause the task to abort too?

  
  
Posted 11 months ago