
Reputation
Badges 1
46 × Eureka!@<1523701087100473344:profile|SuccessfulKoala55> Hi Jake, I am using 1.12.0
Here it is @<1523701205467926528:profile|AgitatedDove14>
@<1523701205467926528:profile|AgitatedDove14> Is there any reason why you mentioned that the "correct" way to work with python and containers is to actually install everything on the system (not venv)?
@<1523701087100473344:profile|SuccessfulKoala55> Hi Jake, I tried to use --output-uri
in clearml-task but got the same error clearml.storage - ERROR - Failed uploading: '
LazyEval Wrapper
' object cannot be interpreted as an integer
Hi @<1523701435869433856:profile|SmugDolphin23> I see, but is there anyway to see the overridden config in OmegaConf so I can easily compare the difference between 2 experiments?
Hi @<1523701087100473344:profile|SuccessfulKoala55> I was able to solve this issue after upgrade clearml to 1.12.2, but my training/val loss become nan after the update
it has been pending whole day yesterday, but today it's able to run the task
I was trying to run python main.py experiemnt=example.yaml
Actually never mind, it's working now!
but it still not is able to run any task after I abort and rerun another task
@<1523701070390366208:profile|CostlyOstrich36> Isn't the docker extra arguments only takes docker run
command instead of dockerd
?
I see, seems like the -args for scripts didn't passed to the docker:
--script fluoro_motion_detection/src/run/main.py \
--args experiment=example.yaml \
I got the same cuda issue after being able to use GPU
Hi @<1523701087100473344:profile|SuccessfulKoala55> , what preconfiguration is needed for the docker service to make? I've tried to run the docker pull manually in AWS EC2 with the same docker image without the space limit issue.
It seems like CPU is working on something, I saw the usage is spiking periodically but I didn't run any task this morning
the gpu arugment is actually inside my example.yaml:
defaults:
- default.yaml
accelerator: gpu
devices: 1
@<1523701205467926528:profile|AgitatedDove14> I'm trying to run Clearml GPU compute(RTX 3080) with pytorch-lightning but keep getting CUDA error. Is there any specific CUDA/Ubuntu/torch/python version required? I tried several different version but can't make it work
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 as telos_algorithms
File "/code/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1013, in _run_stage
with isolate_rng():
Fi...
@<1523701087100473344:profile|SuccessfulKoala55> Thanks, it is the AMI issue
And this issue happens randomly, I was able to run it again last night, but failed again this morning
Hi @<1808672071991955456:profile|CumbersomeCamel72> , the error instance is launched from ClearML AWS Autoscaler on the webpage. The sucessful mounted instance is launch manually from AWS web
@<1808672071991955456:profile|CumbersomeCamel72> It can be mount without docker, but can't be mounted if I run a docker on the instance
#
from typing import List, Optional, Tuple
import pyrootutils
import lightning
import hydra
from clearml import Task
from omegaconf import DictConfig, OmegaConf
from lightning import LightningDataModule, LightningModule, Trainer, Callback
from lightning.pytorch.loggers import Logger
pyrootutils.setup_root(__file__, indicator="pyproject.toml", pythonpath=True)
# ------------------------------------------------------------------------------------ #
# the setup_root above is...
Thanks @<1523701205467926528:profile|AgitatedDove14> . I just got an issue running clearml-task remotely, it has been working fine before today, but now every time I run clearml-task, it shows pending, and I've been waiting for 3 hours the status is still pending. The autoscalers was charging the hourly rate even though the task is still pending for 3 hours. From the console log of Clearml GPU instance, I saw it is listening to the queue, but there is no log even after 3 hours. There is not...