Ohh try to add --full-monitoring
to the clearml-agent execute
None
Hey @<1523701205467926528:profile|AgitatedDove14> , I think the 'execute' function from the clearml-agent is great. I've been testing/using it for a few days, and, while it's a little more hands-on, it has been an amazing workaround for us uni students who have no budget 😂 . That said, I've been using clearml-agent execute <job_id>
to great workaround for us uni students who have no budget . That said, I've been using clearml-agent execute <job_id> t run jobs on an HPC node. That said, with this method I am not able to see the console on the web-ui. I've been defining this#!/bin/bash
#SBATCH --job-name=test_worker
#SBATCH --output=./logs/test_worker_%j.out
#SBATCH --error=./logs/test_worker_%j.err
in my SBATCH, and the only way I can see the logs is by manually logging into our HPC and viewing the logs directly using cat
or tail
, etc. Do you know if there's some way to redirect this output back into the web UI? Is there some API call from the docs I'm overlooking? Once again, thanks for all your help!
Is the clearml-agent queue not available in the open source?
fully available in the open source, what is missing is the SLURM connection, in the open source daemon is installed per machine (node) and spins containers/venv on the machine. The enterprise version adds support so it uses SLURM to provision the node. I hope it helps 🙂
so do you think it would be possible to spin up another daemon, which listens to this daemon, which then runs a slurm job?
This is exactly what the enterprise version does, I think there is a some built in assumption that only enterprises use SLURM
I want to emphasize that I do not mean to undermine your enterprise tier, but I am just trying to work with the limitations of the resources my university, which means I have to use our HPC resources.
Yep totally with you, SLURM is very university HPC oriented 🙂 this is why I suggested the srun + clearml-agent execute, wdyt?
Hi @<1600661428556009472:profile|HighCoyote66>
However, we need to allocate resources to ourselves manually, using an
srun
command or
sbatch
Long story short, there is a full SLURM integration, basically you push a job into the ClearML queue and it produces a slurm job that uses the agent to setup the venv/container and run your Task, but this is only part of the enterprise version 😞
You can however do the following (notice this is pseudo code, I probably have a typo in the srun command)
- Clone your Task in the UI
- Copy the new Task ID
srun clearml-agent execute --id <task-id-here>
This will use slurm to allocate the job and clearml-agent to actually set the environment automatically and run your code (with the ability to override arguments from the UI, like you would regularly). The missing part is of course the integration to the queue system and the automation (which unfortunately is not part of the open source)
Oh I see, I confused with what "Agent Orchestration" meant on the website. Is the clearml-agent queue not available in the open source?
I see that you can do clearml-agent daemon --queue
, so do you think it would be possible to spin up another daemon, which listens to this daemon, which then runs a slurm job?
I want to emphasize that I do not mean to undermine your enterprise tier, but I am just trying to work with the limitations of the resources my university, which means I have to use our HPC resources.