Reputation
Badges 1
43 × Eureka!this is pretty weird. PL should only save from rank==0 :
https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L394
I'm not working with tensorflow. I'm using SummaryWriter
from torch.utils.tensorboard
. Specifically add_pr_curve
:
https://pytorch.org/docs/stable/tensorboard.html#torch.utils.tensorboard.writer.SummaryWriter.add_pr_curve
cudnn isn't cuda, it's a separate library.
are you running on docker on bare metal? you should have cuda installed at /usr/local/cuda-<>
not really... what do you mean by "free" agent?
oops. I used create instead of init 😳
I think so. IMHO all API calls should maybe reside in a different module since they usually happen inside some control code
The legacy version worked just before I mv
ed the folder but now (after reverting to the old name) that doesn't work also 😢
as a workaround I just stick the epoch number in the series
argument of report_scatter2d
, with the same title name
I see what you mean. So in a simple "all-or-nothing" solution I have to choose between potentially starving either the single node tasks (high priority + wait) or multi-node tasks (wait for a time when there are enough available agents and only then allocate the resource).
I actually meant NCCL. nvcc is the CUDA compiler 😅
NCCL communication can be both inter- and intra- node
so you dont have cuda installed 🙂
just seems a bit cleaner and more DevOps/k8s friendly to work with the container version of the agent 🙂
I thought of some sort of gang-scheduling scheme should be implemented on top of the job.
Maybe the agents should somehow go through a barrier with a counter and wait there until enough agents arrived
Can you elaborate on what you would do with it? Like an OS environment disable the entire setup itself ? will it clone the code base ?
It will not do any setup steps. Ideally it would just pull an experiment from a dedicated HPO queue and run it inplace
JitteryCoyote63 I still don't understand what is the actual CUDA version you are using on your machine
this is the cuda driver api. you need libcudart.so
the hack doesn't work if conda is not installed 😞
I just don't fully understand the internals of an HPO process. If I create an Optimizer
task with a simple grid search, how do different tasks know which arguments were already dispatched if the arguments are generated at runtime?
Regardless, it would be very convenient to add a flag to the agent which point it to an existing virtual environment and bypassing the entire setup process. This would facilitate ramping up new users to clearml
who don't want the bells and whistles and would just a simple HPO from an existing env (which may not even exist as part of a git repo)