Reputation
Badges 1
36 × Eureka!@<1523701087100473344:profile|SuccessfulKoala55> Thanks, it is the AMI issue
@<1523701087100473344:profile|SuccessfulKoala55> Hi Jake, I tried to use --output-uri
in clearml-task but got the same error clearml.storage - ERROR - Failed uploading: '
LazyEval Wrapper
' object cannot be interpreted as an integer
but it still not is able to run any task after I abort and rerun another task
Hi @<1523701087100473344:profile|SuccessfulKoala55> , I just to start an EC2 instance manually and pull the docker, it is able to pull the docker without seeing the no space left issue
Hi @<1523701087100473344:profile|SuccessfulKoala55> , what preconfiguration is needed for the docker service to make? I've tried to run the docker pull manually in AWS EC2 with the same docker image without the space limit issue.
Hi @<1523701070390366208:profile|CostlyOstrich36> , any suggestion for this error?
I got the same cuda issue after being able to use GPU
It seems like CPU is working on something, I saw the usage is spiking periodically but I didn't run any task this morning
I did use --args to clearml-task command for this run, but it looks like the docker didn't take it
Here it is @<1523701205467926528:profile|AgitatedDove14>
okay, when I run main.py on my local machine, I can use python main.py experiement=example.yaml
to override acceleator to GPU option. But seems like the --args experiement=example.yaml
in clearml-task didn't work so I have to manually modify it on UI?
clearml-task \
--project fluoro-motion-detection \
--name uniformer-test \
--repo git@github.com:imperative-care-campbell/algorithms-python.git \
--branch SW-956-Fluoro-Motion-Detection \
--script fluoro_motio...
@<1523701205467926528:profile|AgitatedDove14> I'm trying to run Clearml GPU compute(RTX 3080) with pytorch-lightning but keep getting CUDA error. Is there any specific CUDA/Ubuntu/torch/python version required? I tried several different version but can't make it work
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 as telos_algorithms
File "/code/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1013, in _run_stage
with isolate_rng():
Fi...
Actually never mind, it's working now!
I've added gpu:True to my hydra config file but the GPU is still not used
Thanks @<1523701205467926528:profile|AgitatedDove14> . I just got an issue running clearml-task remotely, it has been working fine before today, but now every time I run clearml-task, it shows pending, and I've been waiting for 3 hours the status is still pending. The autoscalers was charging the hourly rate even though the task is still pending for 3 hours. From the console log of Clearml GPU instance, I saw it is listening to the queue, but there is no log even after 3 hours. There is not...
the gpu arugment is actually inside my example.yaml:
defaults:
- default.yaml
accelerator: gpu
devices: 1
it has been pending whole day yesterday, but today it's able to run the task
I see, seems like the -args for scripts didn't passed to the docker:
--script fluoro_motion_detection/src/run/main.py \
--args experiment=example.yaml \
I was trying to run python main.py experiemnt=example.yaml
Hi @<1523701435869433856:profile|SmugDolphin23> I see, but is there anyway to see the overridden config in OmegaConf so I can easily compare the difference between 2 experiments?
The queue will be empty when I run task
@<1523701087100473344:profile|SuccessfulKoala55> Hi Jake, I am using 1.12.0
Hi @<1523701070390366208:profile|CostlyOstrich36> , here it is
Hi @<1523701087100473344:profile|SuccessfulKoala55> I was able to solve this issue after upgrade clearml to 1.12.2, but my training/val loss become nan after the update