Reputation
Badges 1
43 × Eureka!I see what you mean. So in a simple "all-or-nothing" solution I have to choose between potentially starving either the single node tasks (high priority + wait) or multi-node tasks (wait for a time when there are enough available agents and only then allocate the resource).
I actually meant NCCL. nvcc is the CUDA compiler π
NCCL communication can be both inter- and intra- node
I thought of some sort of gang-scheduling scheme should be implemented on top of the job.
Maybe the agents should somehow go through a barrier with a counter and wait there until enough agents arrived
not really... what do you mean by "free" agent?
Lets start with a simple setup. Multi-node DDP in pytorch
another question - when running a non-dockerized agent and setting CLEARML_AGENT_SKIP_PIP_VENV_INSTALL
, I still see things being installed when the experiment starts. Why does that happen?
just seems a bit cleaner and more DevOps/k8s friendly to work with the container version of the agent π
great!
Is there a way to add this for an existing task's draft via the web UI?
Thanks AgitatedDove14 . I'll try that
You mean running everything on a single machine (manually)?
Yes, but not limited to this.
I want to be able to install the venv in multiple servers and start the "simple" agents in each one on them. You can think of it as some kind of one-off agent for a specific (distributed) hyperparameter search task
I'm trying to achieve a workflow similar to the one in wandb
for parameter sweep where there are no venvs involved other than the one created by the user π
lol great hack. I'll check it out.
Although I'd be really happy if there was a solution in which I can just spawn an ad-hoc worker π
It's a very convenient way of doing a parameter sweep on with minimal setup effort
Of course conda needs to be installed, it is using a pre-existing condaΒ env, no?! what am I missing
its not a conda env, just a regular venv (poetry in this specific case)
And the assumption is the code is also there ?
yes. The user is responsible for the entire setup. the agent just executes python <path to script> <current hpo args>
Can you elaborate on what you would do with it? Like an OS environment disable the entire setup itself ? will it clone the code base ?
It will not do any setup steps. Ideally it would just pull an experiment from a dedicated HPO queue and run it inplace
hows does this work with HPO?
the tasks are generated in advance?
the hack doesn't work if conda is not installed π
I just don't fully understand the internals of an HPO process. If I create an Optimizer
task with a simple grid search, how do different tasks know which arguments were already dispatched if the arguments are generated at runtime?
AgitatedDove14 Just to see that I understood correctly - in an HPO task, all subtasks (a specific parameter combination) are created and pushed to the relevant queue at the moment the main (HPO) task is created?
that was my next question π
How does this design work with a stateful search algorithm?
An easier fix for now will probably be some kind of warning to the user that a task is created but not connected
oops. I used create instead of init π³
I think so. IMHO all API calls should maybe reside in a different module since they usually happen inside some control code
JitteryCoyote63 I still don't understand what is the actual CUDA version you are using on your machine
note that the cuda driver was only recently added to nvidia-smi