I've added gpu:True to my hydra config file but the GPU is still not used
Thanks @<1523701205467926528:profile|AgitatedDove14> . I just got an issue running clearml-task remotely, it has been working fine before today, but now every time I run clearml-task, it shows pending, and I've been waiting for 3 hours the status is still pending. The autoscalers was charging the hourly rate even though the task is still pending for 3 hours. From the console log of Clearml GPU instance, I saw it is listening to the queue, but there is no log even after 3 hours. There is nothing else I am running beside this one task, and seems like the worker never spin up again
2023-08-03 04:41:00,624 - clearml.Auto-Scaler - INFO - Spinning new instance resource='default', prefix='38ae71a80baf4a58893631d23c0c6e72_3090_1', queue='test-gpu'
2023-08-03 04:41:00,625 - clearml.Auto-Scaler - INFO - Creating instance for resource default
2023-08-03 04:41:01,027 - clearml.Auto-Scaler - INFO - New instance b97e702d-e2b3-4f28-adab-be59648601ea listening to test-gpu queue
but it still not is able to run any task after I abort and rerun another task
When you "run" a task you are pushing it to a queue, so how come a queue is empty? what happens after you push your newly cloned task to the queue ?
It seems like CPU is working on something, I saw the usage is spiking periodically but I didn't run any task this morning
I got the same cuda issue after being able to use GPU
is it displaying that it is running anything?
And how did you connect your example,yaml?
Thanks for the detials @<1597762318140182528:profile|EnchantingPenguin77>
clearml.Auto-Scaler - INFO - New instance b97e702d-e2b3-4f28-adab-be59648601ea listening to test-gpu queue
This looks like a new agent was spined on your EC2 account, can you see it in the "Workers" page ?
Here it is @<1523701205467926528:profile|AgitatedDove14>
you should have a gpu argument there, set it to true
The queue will be empty when I run task
but it still not is able to run any task after I abort and rerun another task
Notice you should be able to override them in the UI (under Args seciton)
There is nothing on the queue and worker
I see, seems like the -args for scripts didn't passed to the docker:
--script fluoro_motion_detection/src/run/main.py \
--args experiment=example.yaml \
@<1523701205467926528:profile|AgitatedDove14> Is there any reason why you mentioned that the "correct" way to work with python and containers is to actually install everything on the system (not venv)?
Click on the Task it is running and abort it, it seems to be stuck, I guess this is why the others are not pulled
well I do not think you set your pytorch lightining to use cuda:
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/code/.venv/lib/python3.9/site-packages/lightning/pytorch/trainer/setup.py:176: PossibleUserWarning: GPU available but not used. Set `accelerator` and `devices` using `Trainer(accelerator='gpu', devices=1)`.
okay, when I run main.py on my local machine, I can use python main.py experiement=example.yaml
to override acceleator to GPU option. But seems like the --args experiement=example.yaml
in clearml-task didn't work so I have to manually modify it on UI?
clearml-task \
--project fluoro-motion-detection \
--name uniformer-test \
--repo git@github.com:imperative-care-campbell/algorithms-python.git \
--branch SW-956-Fluoro-Motion-Detection \
--script fluoro_motion_detection/src/run/main.py \
--args experiment=example.yaml \
--docker mzhengtelos/algorithm-ml:pyenv \
--docker_args "--env CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=$PYTHON_ENV_DIR --env AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID --env AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" \
--queue test-gpu
That's the right place but
like you would use hydra --override, which in your case I think it should be "accelerator.gpu" ,
You can also change allow_omegaconf_edit
in the UI to True, and then you could just edit the OmegaConf in the UI (if you do not change
allow_omegaconf_edit` then the edit in the UI is ignored)
Actually never mind, it's working now!
None
See: Add an experiment hyperparameter:
and add gpu
: True
Hi @<1597762318140182528:profile|EnchantingPenguin77>
, but it seems like clearml always create a virtual environmen
Yes that's correct, but the new venv inside the container inherits from the system packages (so if nothing changes it does nothing)
Is there a way that I can have the clearml-task to automatically activated a virtual environment use the activated custom virtual environment in my docker and run the scripts
Yoo can but the "correct" way to work with python and containers is to actually install everything on the system (not venv)
That said, just set this env variable to point top the python binary inside your venv in the container
CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=/root/venv/bin/python
None
it has been pending whole day yesterday, but today it's able to run the task