Reputation
Badges 1
43 × Eureka!cudnn isn't cuda, it's a separate library.
are you running on docker on bare metal? you should have cuda installed at /usr/local/cuda-<>
try:sudo updatedb locate libcudart
this is the cuda driver api. you need libcudart.so
can you initialize a tensor on the GPU?
so you dont have cuda installed 🙂
just to be clear, multiple CUDA runtime version can coexist on a single machine, and the only thing that points to which one you are using when running an application are the library search paths (which can be set either with LD_LIBRARY_PATH
, or, preferably, by creating a file under /etc/ld.so.conf.d/
which contains the path to your cuda directory and executing ldconfig
)
the conda sets up cuda I think
The legacy version worked just before I mv
ed the folder but now (after reverting to the old name) that doesn't work also 😢
I see what you mean. So in a simple "all-or-nothing" solution I have to choose between potentially starving either the single node tasks (high priority + wait) or multi-node tasks (wait for a time when there are enough available agents and only then allocate the resource).
I actually meant NCCL. nvcc is the CUDA compiler 😅
NCCL communication can be both inter- and intra- node
this is pretty weird. PL should only save from rank==0 :
https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L394
I thought of some sort of gang-scheduling scheme should be implemented on top of the job.
Maybe the agents should somehow go through a barrier with a counter and wait there until enough agents arrived
oops. I used create instead of init 😳
I think so. IMHO all API calls should maybe reside in a different module since they usually happen inside some control code
An easier fix for now will probably be some kind of warning to the user that a task is created but not connected
sounds great.
BTW the code is working now out-of-the-box. Just 2 magic line - import
+ Task.init
not really... what do you mean by "free" agent?