Reputation
Badges 1
43 × Eureka!this is pretty weird. PL should only save from rank==0 :
https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L394
AgitatedDove14 , I'm running an agent inside a docker (using the image on dockerhub) and mounted the docker socket to the host so the agent can start sibling containers. How do I set the config for this agent? Some options can be set through env vars but not all of them 😞
Can you elaborate on what you would do with it? Like an OS environment disable the entire setup itself ? will it clone the code base ?
It will not do any setup steps. Ideally it would just pull an experiment from a dedicated HPO queue and run it inplace
the hack doesn't work if conda is not installed 😞
JitteryCoyote63 I still don't understand what is the actual CUDA version you are using on your machine
just to be clear, multiple CUDA runtime version can coexist on a single machine, and the only thing that points to which one you are using when running an application are the library search paths (which can be set either with LD_LIBRARY_PATH , or, preferably, by creating a file under /etc/ld.so.conf.d/ which contains the path to your cuda directory and executing ldconfig )
cudnn isn't cuda, it's a separate library.
are you running on docker on bare metal? you should have cuda installed at /usr/local/cuda-<>
not really... what do you mean by "free" agent?
note that the cuda driver was only recently added to nvidia-smi
I just don't fully understand the internals of an HPO process. If I create an Optimizer task with a simple grid search, how do different tasks know which arguments were already dispatched if the arguments are generated at runtime?
sounds great.
BTW the code is working now out-of-the-box. Just 2 magic line - import + Task.init
try:sudo updatedb locate libcudart
this is the cuda driver api. you need libcudart.so
lol great hack. I'll check it out.
Although I'd be really happy if there was a solution in which I can just spawn an ad-hoc worker 🙂
I'm trying to achieve a workflow similar to the one in wandb for parameter sweep where there are no venvs involved other than the one created by the user 😅
I'm not working with tensorflow. I'm using SummaryWriter from torch.utils.tensorboard . Specifically add_pr_curve :
https://pytorch.org/docs/stable/tensorboard.html#torch.utils.tensorboard.writer.SummaryWriter.add_pr_curve
Lets start with a simple setup. Multi-node DDP in pytorch
just seems a bit cleaner and more DevOps/k8s friendly to work with the container version of the agent 🙂
so you dont have cuda installed 🙂
AgitatedDove14 Just to see that I understood correctly - in an HPO task, all subtasks (a specific parameter combination) are created and pushed to the relevant queue at the moment the main (HPO) task is created?