Reputation
Badges 1
30 × Eureka!@<1523701205467926528:profile|AgitatedDove14> Is there any reason why you mentioned that the "correct" way to work with python and containers is to actually install everything on the system (not venv)?
Hi @<1523701435869433856:profile|SmugDolphin23> I see, but is there anyway to see the overridden config in OmegaConf so I can easily compare the difference between 2 experiments?
Hi @<1523701070390366208:profile|CostlyOstrich36> , any suggestion for this error?
I've added gpu:True to my hydra config file but the GPU is still not used
Hi @<1523701070390366208:profile|CostlyOstrich36> , here it is
There is nothing on the queue and worker
it has been pending whole day yesterday, but today it's able to run the task
but it still not is able to run any task after I abort and rerun another task
@<1523701087100473344:profile|SuccessfulKoala55> Hi Jake, I tried to use --output-uri
in clearml-task but got the same error clearml.storage - ERROR - Failed uploading: '
LazyEval Wrapper
' object cannot be interpreted as an integer
@<1523701087100473344:profile|SuccessfulKoala55> Hi Jake, I am using 1.12.0
Thanks @<1523701205467926528:profile|AgitatedDove14> . I just got an issue running clearml-task remotely, it has been working fine before today, but now every time I run clearml-task, it shows pending, and I've been waiting for 3 hours the status is still pending. The autoscalers was charging the hourly rate even though the task is still pending for 3 hours. From the console log of Clearml GPU instance, I saw it is listening to the queue, but there is no log even after 3 hours. There is not...
#
from typing import List, Optional, Tuple
import pyrootutils
import lightning
import hydra
from clearml import Task
from omegaconf import DictConfig, OmegaConf
from lightning import LightningDataModule, LightningModule, Trainer, Callback
from lightning.pytorch.loggers import Logger
pyrootutils.setup_root(__file__, indicator="pyproject.toml", pythonpath=True)
# ------------------------------------------------------------------------------------ #
# the setup_root above is...
I see, seems like the -args for scripts didn't passed to the docker:
--script fluoro_motion_detection/src/run/main.py \
--args experiment=example.yaml \
I was trying to run python main.py experiemnt=example.yaml
It seems like CPU is working on something, I saw the usage is spiking periodically but I didn't run any task this morning
okay, when I run main.py on my local machine, I can use python main.py experiement=example.yaml
to override acceleator to GPU option. But seems like the --args experiement=example.yaml
in clearml-task didn't work so I have to manually modify it on UI?
clearml-task \
--project fluoro-motion-detection \
--name uniformer-test \
--repo git@github.com:imperative-care-campbell/algorithms-python.git \
--branch SW-956-Fluoro-Motion-Detection \
--script fluoro_motio...
Here it is @<1523701205467926528:profile|AgitatedDove14>
@<1523701205467926528:profile|AgitatedDove14> I'm trying to run Clearml GPU compute(RTX 3080) with pytorch-lightning but keep getting CUDA error. Is there any specific CUDA/Ubuntu/torch/python version required? I tried several different version but can't make it work
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 as telos_algorithms
File "/code/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1013, in _run_stage
with isolate_rng():
Fi...
@<1523701205467926528:profile|AgitatedDove14> Yes I cansee the worker:
I got the same cuda issue after being able to use GPU
Actually never mind, it's working now!
The queue will be empty when I run task
the gpu arugment is actually inside my example.yaml:
defaults:
- default.yaml
accelerator: gpu
devices: 1
Hi @<1523701087100473344:profile|SuccessfulKoala55> I was able to solve this issue after upgrade clearml to 1.12.2, but my training/val loss become nan after the update
I did use --args to clearml-task command for this run, but it looks like the docker didn't take it