
Reputation
Badges 1
43 × Eureka!An easier fix for now will probably be some kind of warning to the user that a task is created but not connected
so you dont have cuda installed 🙂
note that the cuda driver was only recently added to nvidia-smi
sounds great.
BTW the code is working now out-of-the-box. Just 2 magic line - import
+ Task.init
this is pretty weird. PL should only save from rank==0 :
https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L394
try:sudo updatedb locate libcudart
I thought of some sort of gang-scheduling scheme should be implemented on top of the job.
Maybe the agents should somehow go through a barrier with a counter and wait there until enough agents arrived
the conda sets up cuda I think
not really... what do you mean by "free" agent?
The legacy version worked just before I mv
ed the folder but now (after reverting to the old name) that doesn't work also 😢
as a workaround I just stick the epoch number in the series
argument of report_scatter2d
, with the same title name
just to be clear, multiple CUDA runtime version can coexist on a single machine, and the only thing that points to which one you are using when running an application are the library search paths (which can be set either with LD_LIBRARY_PATH
, or, preferably, by creating a file under /etc/ld.so.conf.d/
which contains the path to your cuda directory and executing ldconfig
)
oops. I used create instead of init 😳
JitteryCoyote63 I still don't understand what is the actual CUDA version you are using on your machine
I think so. IMHO all API calls should maybe reside in a different module since they usually happen inside some control code
this is the cuda driver api. you need libcudart.so
lol great hack. I'll check it out.
Although I'd be really happy if there was a solution in which I can just spawn an ad-hoc worker 🙂
that was my next question 🙂
How does this design work with a stateful search algorithm?
just seems a bit cleaner and more DevOps/k8s friendly to work with the container version of the agent 🙂