@ https://app.slack.com/team/UT8T0V3NE is there a non-free version support for the feature of preempting lower priority tasks to allow a higher priority task to come in?
Thanks for the answer!
the idea is to use ClearML for the node allocation/scheduling and PyTorch DDP for the actual communication, is that correct ?
Yes, i basically plan to use ClearML as user-friendly cluster manager
Regrading Task pollution, when the master node is done, it can delete all child/cloned Tasks so it is easier on the eyes.
Nice idea, i will try it out!
preempting lower priority tasks to allow a higher priority task to come in
Well this is usually outside of the scope of "single researcher" / "tiny team"...
This typically a large scale problem
That said, it will be fairly easy to write a service that aborts Tasks, "tags them to be "continued", then later (at night?!) push them back into a queue... wdyt?
AgitatedDove14 let me reach out to my pocket there 😉
Yes, i basically plan to use ClearML as user-friendly cluster manager
and it is 🙂
I think the main "drawback" is that you cannot "reserve" nodes for the multi-node training. The easiest solution is to have high-priority queue that is never used, and then have the DDP master process push into the high priority queue, which will ensure these are the next Tasks to be executed (now the only thing that is missing is preemption to running Tasks, but this automation policy is unfortunately not part of the open-source)
wdyt?
This task is picked up by first agent; it runs DDP launch script for itself and then creates clones of itself with task.create_function_task() and passes its address as argument to the function
Hi UnevenHorse85
Interesting use case, just for my understanding, the idea is to use ClearML for the node allocation/scheduling and PyTorch DDP for the actual communication, is that correct ?
passes its address as argument to the function
This seems like a great solution.
the queue is polluted with lots of cloned tasks that have to be aborted manually, and the whole job only requires only ...
I wouldn't say the queue pollution is the issue (or the multiple copies of the cloned Tasks), I think the main issue here is that the allocated nodes have to wait until all nodes are allocated, no?
Regrading Task pollution, when the master node is done, it can delete all child/cloned Tasks so it is easier on the eyes. This way if something goes wrong in one of the nodes, you have full visibility, but when everything works, you end up with a clean single copy.
wdyt?
` task = Task.init(...)
assume model checkpoint
if task.models['output']:
get the latest checlpoint
model_file_or_path = task.models['output'][-1].get_local_copy()
load the model checkpoint
run training code `RoughTiger69 Would the above work for you?
RoughTiger69 yes I think "Scale" tier covers it 😉
AgitatedDove14 looks like service-writing-time for me!
PS can you point me to some official example/ doc for how to persist/restore state so that tasks are restartable?
looks like service-writing-time for me!
Nice!
persist/restore state so that tasks are restartable?
You mean if you write preemption-ready training code ?