Reputation
Badges 1
88 × Eureka!not sure if related but clearml 1.14 tend to not "show" the gpu_type
but then it still missing a bunch of library in the Taks (that succeed) > Execution > INSTALLED PACKAGES
So when I do a clone of that task, and try to run the clone, the task fail because it is missing python package 😞
interesting, the issue happen with mamba
venv. Now I use a python native venv and it is detecting correctly
is task.add_requirements("requirements.txt")
redundant ?
Is ClearML always look for a requirements.txt
in the repo root ?
that format is correct as I can run pip install -r requirements.txt
using the exact same file
and in the train.py
, I have task.add_requirements("requirements.txt")
Ok. Found the solution.
The importance is to use this:
Task.add_requirements("requirements.txt")
task = Task.init(project_name='hieutest', task_name='foo',reuse_last_task_id=False)
And not:
task = Task.init(project_name='hieutest', task_name='foo',reuse_last_task_id=False)
task.add_requirements("requirements.txt")
from the logs, it feels like after git clone, it spend minutes without outputting anything. @<1523701205467926528:profile|AgitatedDove14> Do you know what is the agent suppose to do after git clone ?
I guess a check that all packages is installed ? But then with CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1, what is the agent doing ??
in my case using self-hosted and agent inside a docker container:
47:45 : taks foo pulled
[ git clone, pip install, check that all requirements satisfied, and nothing is downloaded]
48:16 : start training
I know that git clone and pip verify all installed is normal. But for some reason in Michael screenshot, I don't see those steps ...
Oh, I was assuming you are passing the entire DB backups to the cloud.
Yes, that is what I want to do.
So I need to migrate both the MongoDB database and elastic search database from my local docker instance to the equivalent in the cloud ?
I really like how you make all this decoupled !! 🎉
Just keep in mind my your bottleneck will be the transfer rate. So mounting will not save you anything as you still need to transfer the whole dataset sooner or later to your GPU instance.
One solution is as Jake suggest. The other can be pre-download the data to your instance with a CPU only cheap instance type, then restart the instance with GPU.
what about having 2 agents, one on each GPU, on the same machine, serving the same queue ? So that when you enqueue, which ever agent (thus GPU) available will take the new task
I think a proper screenshot of the full log with some information redacted is the way to go. Otherwise we are just guessing in the dark
how does it work if I create my pipeline from code ? Does the task will get the git repo state when first run and use commit hash and uncommited changed as "signature" ?
very hard to diagnose with this tiny bit of log ...
@<1523701087100473344:profile|SuccessfulKoala55> Actually it failed now: failed to talked to our storage in Azure:
ClearML Task: created new task id=c47dd71dea2f421db05647a21d78ed26
2024-01-25 21:45:23,926 - clearml.storage - ERROR - Failed uploading: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)
2024-01-25 21:46:48,877 - clearml.storage - WARNING - Storage helper problem for .clearml.0149daec-7a03-4853-a0cd-a7e2b295...
not sure ... providing Zscaler certificate seems to allow clearml to talk to our clearml server, hosted in azure, Task init worked. But then failed to connect to the storage account (Azure too) ...
@<1523701087100473344:profile|SuccessfulKoala55> I managed to make this working by:
concat the existing OS ca bundle and zscaler certificate. And set REQUESTS_CA_BUNDLE
to that bundle file
Try to set CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=true
in the terminal start clearml-agent
See None
in my case, I set eveything inside the container, including the agent and not using docker mode altogether.
When my container start, it start the agent inside it in "normal" mode
depend on how the agent is launched ...
there is a tricky thing: clearml-agent should not be running from a venv itself ... don't remember where I read that doc
one specify the venv python, the other tell it to not do anything