Hi SoreSparrow36
Of course fully integrated, here's a link to the docs https://clear.ml/docs/latest/docs/clearml_agent/clearml_agent_deployment/#slurm
The main advantage is the ability to launch and control jobs from outside the slurm cluster, from simple pipeline to logging the console outputs the performance and the ability to abort jobs directly from clearml as well as storing outputs
Wdyt?
im helping train my friend
on clearml to assist with his astrophysics research,
if that's the case, what you can do is use the agent inside your sbatch script,
(full open source). This means the sbatch becomes " clearml-agent execute --id <task_id_here>
" this will set up the environment and monitor the job and still allow you to launch it from slurm, wdyt?
ah . that's a shame its under Enterprise only . no wonder I missed it .
im helping train my friend FlutteringSeahorse49 on clearml to assist with his astrophysics research, and his university has a slurm cluster . So we're trying to figure out if we can launch an agent process on the cluster to pull work from the clearml queue (fwiw: on their cluster containers is not supported ) .
FlutteringSeahorse49 wants to start HPO though, so the desire is to deploy agents to listen to queues on the slurm cluster (perhaps the controller runs on his laptop).
would that still make sense?
Sorry SmallTurkey79 just notice your reply
Hmm so I know the enterprise version has a built-in support for slurm, which would remove the need to deploy agents on the slurm cluster.
What you can do is on the SLURM login server (i.e. a machine that can run sbatch), write a simple script that pulls the Task ID from the queue and calls sbatch with clearml-agent execute --id <task_id_here>
, would this be agood solution
but isnt that just the same as running agent in daemon mode? thats what i was hoping James could do
i think he's saying you'd want an intermediary layer that acts like the daemon .
why not run the daemon directly im not sure, but i suspect its bc it doesn't have an "end time" for execution (stays up)
Would this be equivalent to an automated job submission from clearml to the cluster?
yes exactly
I am looking for a setup which allows me to essentially create the workers and start the tasks from a slurm script
hmm I see, basically the slurm Admins are afraid you will create a script the clogs the SLURM cluster, hence no automated job submission, so you want to use slurm as a "time on cluster" and then when your time is allocated, use clearml for the job submission, is that correct?
If so then do exactly as SmallTurkey79 suggested, run the clearml daemon as a slurm batch, basically the daemon can run your jobs automatically, but from a slurm perspective you are still limited to the time slot that is allocated for you. also notice you can spin multiple clearml-agent daemon, so that you can run multiple jobs on the same node.
AgitatedDove14 Would this be equivalent to an automated job submission from clearml to the cluster? My cluster security rules do not allow for automated job submission. I am looking for a setup which allows me to essentially create the workers and start the tasks from a slurm script, with clearml simply receiving the information about the workers and sending information to the cluster regarding allotment of the tasks, but without clearml explicitly sending the work to the cluster. Let me know if this makes sense - or maybe I am misunderstanding what you're saying above
The difference is that running the agent in daemon mode, means the "daemon" itself is a job in SLURM.
What I was saying is pulling jobs from the clearml queue and then pushing them as individual SLURM jobs, does that make sense ?