Reputation
Badges 1
43 × Eureka!Of course conda needs to be installed, it is using a pre-existing condaย env, no?! what am I missing
its not a conda env, just a regular venv (poetry in this specific case)
And the assumption is the code is also there ?
yes. The user is responsible for the entire setup. the agent just executes python <path to script> <current hpo args>
Regardless, it would be very convenient to add a flag to the agent which point it to an existing virtual environment and bypassing the entire setup process. This would facilitate ramping up new users to clearml
who don't want the bells and whistles and would just a simple HPO from an existing env (which may not even exist as part of a git repo)
Can you elaborate on what you would do with it? Like an OS environment disable the entire setup itself ? will it clone the code base ?
It will not do any setup steps. Ideally it would just pull an experiment from a dedicated HPO queue and run it inplace
Lets start with a simple setup. Multi-node DDP in pytorch
I see what you mean. So in a simple "all-or-nothing" solution I have to choose between potentially starving either the single node tasks (high priority + wait) or multi-node tasks (wait for a time when there are enough available agents and only then allocate the resource).
I actually meant NCCL. nvcc is the CUDA compiler ๐
NCCL communication can be both inter- and intra- node
I'm not working with tensorflow. I'm using SummaryWriter
from torch.utils.tensorboard
. Specifically add_pr_curve
:
https://pytorch.org/docs/stable/tensorboard.html#torch.utils.tensorboard.writer.SummaryWriter.add_pr_curve
this is the cuda driver api. you need libcudart.so
note that the cuda driver was only recently added to nvidia-smi
can you initialize a tensor on the GPU?
cudnn isn't cuda, it's a separate library.
are you running on docker on bare metal? you should have cuda installed at /usr/local/cuda-<>
JitteryCoyote63 I still don't understand what is the actual CUDA version you are using on your machine
the conda sets up cuda I think
just to be clear, multiple CUDA runtime version can coexist on a single machine, and the only thing that points to which one you are using when running an application are the library search paths (which can be set either with LD_LIBRARY_PATH
, or, preferably, by creating a file under /etc/ld.so.conf.d/
which contains the path to your cuda directory and executing ldconfig
)
try:sudo updatedb locate libcudart
the hack doesn't work if conda is not installed ๐
I think so. IMHO all API calls should maybe reside in a different module since they usually happen inside some control code
that was my next question ๐
How does this design work with a stateful search algorithm?
so you dont have cuda installed ๐
You mean running everything on a single machine (manually)?
Yes, but not limited to this.
I want to be able to install the venv in multiple servers and start the "simple" agents in each one on them. You can think of it as some kind of one-off agent for a specific (distributed) hyperparameter search task