
Reputation
Badges 1
46 × Eureka!it has been pending whole day yesterday, but today it's able to run the task
I did use --args to clearml-task command for this run, but it looks like the docker didn't take it
Hi @<1808672071991955456:profile|CumbersomeCamel72> , the error instance is launched from ClearML AWS Autoscaler on the webpage. The sucessful mounted instance is launch manually from AWS web
I've added gpu:True to my hydra config file but the GPU is still not used
@<1808672071991955456:profile|CumbersomeCamel72> It can be mount without docker, but can't be mounted if I run a docker on the instance
Hi @<1523701070390366208:profile|CostlyOstrich36> Any idea why this happen?
screenshot of AWS Autoscaler setup, cpu mode is NOT enabled
And this issue happens randomly, I was able to run it again last night, but failed again this morning
@<1523701070390366208:profile|CostlyOstrich36> sorry wrong log uploaded, here is the error:
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
Hi @<1523701070390366208:profile|CostlyOstrich36> , here is the configuration. The GPU could be found sometimes when I clone the previous successful run, but the GPU was found randomly. Also I am unable to run multiple task at the same time even with cloning the previous run
@<1523701070390366208:profile|CostlyOstrich36> yes, in the end of the new file
one thing I've changed is the AMI for the autoscaler, I changed it from amazon linux to ubuntu linux since my docker file size exceed the limit of the amazon linux. Not sure if this has anything to do with this issue
@<1523701205467926528:profile|AgitatedDove14> I'm trying to run Clearml GPU compute(RTX 3080) with pytorch-lightning but keep getting CUDA error. Is there any specific CUDA/Ubuntu/torch/python version required? I tried several different version but can't make it work
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 as telos_algorithms
File "/code/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1013, in _run_stage
with isolate_rng():
Fi...
@<1523701205467926528:profile|AgitatedDove14> Yes I cansee the worker:
It seems like CPU is working on something, I saw the usage is spiking periodically but I didn't run any task this morning
I was trying to run python main.py experiemnt=example.yaml
@<1523701205467926528:profile|AgitatedDove14> Is there any reason why you mentioned that the "correct" way to work with python and containers is to actually install everything on the system (not venv)?
I got the same cuda issue after being able to use GPU
I see, seems like the -args for scripts didn't passed to the docker:
--script fluoro_motion_detection/src/run/main.py \
--args experiment=example.yaml \
the gpu arugment is actually inside my example.yaml:
defaults:
- default.yaml
accelerator: gpu
devices: 1
Here it is @<1523701205467926528:profile|AgitatedDove14>
The queue will be empty when I run task
#
from typing import List, Optional, Tuple
import pyrootutils
import lightning
import hydra
from clearml import Task
from omegaconf import DictConfig, OmegaConf
from lightning import LightningDataModule, LightningModule, Trainer, Callback
from lightning.pytorch.loggers import Logger
pyrootutils.setup_root(__file__, indicator="pyproject.toml", pythonpath=True)
# ------------------------------------------------------------------------------------ #
# the setup_root above is...
There is nothing on the queue and worker
but it still not is able to run any task after I abort and rerun another task
Actually never mind, it's working now!