![Profile picture](https://clearml-web-assets.s3.amazonaws.com/scoold/avatars/ExcitedFish86.png)
Reputation
Badges 1
43 × Eureka!I'm trying to achieve a workflow similar to the one in wandb
for parameter sweep where there are no venvs involved other than the one created by the user 😅
Can you elaborate on what you would do with it? Like an OS environment disable the entire setup itself ? will it clone the code base ?
It will not do any setup steps. Ideally it would just pull an experiment from a dedicated HPO queue and run it inplace
the hack doesn't work if conda is not installed 😞
hows does this work with HPO?
the tasks are generated in advance?
that was my next question 🙂
How does this design work with a stateful search algorithm?
note that the cuda driver was only recently added to nvidia-smi
this is the cuda driver api. you need libcudart.so
AgitatedDove14 , I'm running an agent inside a docker (using the image on dockerhub) and mounted the docker socket to the host so the agent can start sibling containers. How do I set the config for this agent? Some options can be set through env vars but not all of them 😞
just to be clear, multiple CUDA runtime version can coexist on a single machine, and the only thing that points to which one you are using when running an application are the library search paths (which can be set either with LD_LIBRARY_PATH
, or, preferably, by creating a file under /etc/ld.so.conf.d/
which contains the path to your cuda directory and executing ldconfig
)
I think so. IMHO all API calls should maybe reside in a different module since they usually happen inside some control code
An easier fix for now will probably be some kind of warning to the user that a task is created but not connected
I see what you mean. So in a simple "all-or-nothing" solution I have to choose between potentially starving either the single node tasks (high priority + wait) or multi-node tasks (wait for a time when there are enough available agents and only then allocate the resource).
I actually meant NCCL. nvcc is the CUDA compiler 😅
NCCL communication can be both inter- and intra- node
not really... what do you mean by "free" agent?
so you dont have cuda installed 🙂
another question - when running a non-dockerized agent and setting CLEARML_AGENT_SKIP_PIP_VENV_INSTALL
, I still see things being installed when the experiment starts. Why does that happen?
the conda sets up cuda I think
Lets start with a simple setup. Multi-node DDP in pytorch
just seems a bit cleaner and more DevOps/k8s friendly to work with the container version of the agent 🙂
oops. I used create instead of init 😳
as a workaround I just stick the epoch number in the series
argument of report_scatter2d
, with the same title name
You mean running everything on a single machine (manually)?
Yes, but not limited to this.
I want to be able to install the venv in multiple servers and start the "simple" agents in each one on them. You can think of it as some kind of one-off agent for a specific (distributed) hyperparameter search task