Reputation
Badges 1
46 × Eureka!The queue will be empty when I run task
Hi @<1523701087100473344:profile|SuccessfulKoala55> I was able to solve this issue after upgrade clearml to 1.12.2, but my training/val loss become nan after the update
Thanks @<1523701205467926528:profile|AgitatedDove14> . I just got an issue running clearml-task remotely, it has been working fine before today, but now every time I run clearml-task, it shows pending, and I've been waiting for 3 hours the status is still pending. The autoscalers was charging the hourly rate even though the task is still pending for 3 hours. From the console log of Clearml GPU instance, I saw it is listening to the queue, but there is no log even after 3 hours. There is not...
Hi @<1523701087100473344:profile|SuccessfulKoala55> , I just to start an EC2 instance manually and pull the docker, it is able to pull the docker without seeing the no space left issue
@<1523701087100473344:profile|SuccessfulKoala55> Thanks, it is the AMI issue
but it still not is able to run any task after I abort and rerun another task
@<1523701205467926528:profile|AgitatedDove14> I'm trying to run Clearml GPU compute(RTX 3080) with pytorch-lightning but keep getting CUDA error. Is there any specific CUDA/Ubuntu/torch/python version required? I tried several different version but can't make it work
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 as telos_algorithms
File "/code/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1013, in _run_stage
with isolate_rng():
Fi...
the gpu arugment is actually inside my example.yaml:
defaults:
- default.yaml
accelerator: gpu
devices: 1
I've added gpu:True to my hydra config file but the GPU is still not used
@<1523701070390366208:profile|CostlyOstrich36> sorry wrong log uploaded, here is the error:
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
@<1523701070390366208:profile|CostlyOstrich36> Isn't the docker extra arguments only takes docker run command instead of dockerd ?
@<1523701205467926528:profile|AgitatedDove14> Is there any reason why you mentioned that the "correct" way to work with python and containers is to actually install everything on the system (not venv)?
screenshot of AWS Autoscaler setup, cpu mode is NOT enabled
