
Reputation
Badges 1
46 × Eureka!It seems like CPU is working on something, I saw the usage is spiking periodically but I didn't run any task this morning
There is nothing on the queue and worker
screenshot of AWS Autoscaler setup, cpu mode is NOT enabled
Hi @<1523701070390366208:profile|CostlyOstrich36> , here is the configuration. The GPU could be found sometimes when I clone the previous successful run, but the GPU was found randomly. Also I am unable to run multiple task at the same time even with cloning the previous run
Hi @<1523701087100473344:profile|SuccessfulKoala55> I was able to solve this issue after upgrade clearml to 1.12.2, but my training/val loss become nan after the update
@<1523701087100473344:profile|SuccessfulKoala55> Hi Jake, I am using 1.12.0
Hi @<1523701087100473344:profile|SuccessfulKoala55> , what preconfiguration is needed for the docker service to make? I've tried to run the docker pull manually in AWS EC2 with the same docker image without the space limit issue.
Hi @<1523701070390366208:profile|CostlyOstrich36> , here it is
Thanks @<1523701205467926528:profile|AgitatedDove14> . I just got an issue running clearml-task remotely, it has been working fine before today, but now every time I run clearml-task, it shows pending, and I've been waiting for 3 hours the status is still pending. The autoscalers was charging the hourly rate even though the task is still pending for 3 hours. From the console log of Clearml GPU instance, I saw it is listening to the queue, but there is no log even after 3 hours. There is not...
I got the same cuda issue after being able to use GPU
@<1808672071991955456:profile|CumbersomeCamel72> It can be mount without docker, but can't be mounted if I run a docker on the instance
Hi @<1523701070390366208:profile|CostlyOstrich36> Any idea why this happen?
@<1523701070390366208:profile|CostlyOstrich36> sorry wrong log uploaded, here is the error:
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.