After some experimenting it seems that the situation improves when I call task.mark_started(force=True)
before each task.upload_artifact()
instead of just once in the beginning of the script.
Seems there are two approaches, either "revive" before each upload, or somehow keep it always "Running", do you have an idea how the second approach can be achieved? (I did not call task.close()
or task.mark_*()
anywhere).
Oh, so the task has an internal keepalive mechanism and me calling time.sleep()
for more than 2 hours prevents it from working?
No it wouldn't since something would actually be going on and the python script haven't finished
OK thanks. Just curious then, suppose you use the task for normal experiment tracking, you do Task.init()
in the beginning as usual and train you model and your epochs are longer then 2 hours and you only print/report stuff at epoch end, would this cause the task to abort too?
Usually tasks are timed out by default after not having any action after 2 hours. I guess you could just keep the task alive as a process on your machine by printing something once every hour or 30 minutes
@<1558986867771183104:profile|ShakyKangaroo32> If you just want something to run in regular period, have you consider TaskScheduler: None
@<1576381444509405184:profile|ManiacalLizard2> , thanks, that was my initial solution, but I had some trouble with reusing the previously created task for the scheduler when the process that made the call to TaskScheduler.add_task()
was interrupted.
Hi @<1558986867771183104:profile|ShakyKangaroo32> , can you please elaborate more on what is happening? So you're taking an existing task that finished and forcing it to get 'started' again? Then you write some things to it sometimes and then later you 'revive' it again? And due to this it appears some artifacts are missing?
You need to separate the Task object itself from the code that is running. If you're manually 'reviving' a task but then nothing happens and no code is running then the task will get aborted eventually. I'm not sure I understand entirely what you're doing but I have a feeling you're doing something 'hacky'.