Reputation
Badges 1
92 × Eureka!You don't need agent on your local machine.
You want an agent running on the GPU machine.
Local code will create an experiment in ClearML Server, then run up to the line remotely_execute()
then stop
Once local code stop, the Clearml Server will take over and enqueue the experiment to the prescribe queue
The agent on the GPU see there is a experiment on its queue and then pull it and execute it. This time, clearml lib magic will make the code on the GPU machine, launched by the agent, run...
my code looks like this :
parser = argparse.ArgumentParser()
parser.add_argument('-c', '--config-file', type=str, default='train_config.yaml',
help='train config file')
parser.add_argument('-t', '--train-times', type=int, default=1,
help='train the same model several times')
parser.add_argument('--dataset_dir', help='path to folder containing the preped dataset.', required=True)
parser.add_argument('--backup', action='s...
once you install manually your package inside the docker container, check that your file module_b/templates/my_template.yml
is where it should be
so it's not suppose to say "illegal output destination ..." ?
ok, so if git commit or uncommit changes differ from previous run, then the cache is "invalidated" and the step will be run again ?
Are you talking about this: None
It seems to not doing anything aboout the database data ...
what you mean by different script ?
so what was the solution/hack then ?
I mean, depend on what do you want to report ... if you want to stick to table, I suggest earlier to gather your stats in table format ...
Otherwise, matplotlib seems to be the most user friendly way
I don't use submodule so don't really know how that behave with ClearML
So we have 3 python package, store in github.com
On the dev machine, the datascientist (DS) will add the local ssh key to his github account as authorized ssh keys, account level.
With the DS can run git clone git@github.com:org/repo1
then install that python package via pip install -e .
Do that for all 3 python packages, each in its own repo1
, repo2
and repo3
. All 3 can be clone using the same key that the DS added to his account.
The DS run a tra...
@<1523701070390366208:profile|CostlyOstrich36>
Yes. I am investigating that route now.
if you want to replace MLflow by ClearML: do it !! It's like "Should I use sandal or running shoes for my next marathon ..."
Let your user try ClearML, and I am pretty sure all of them will want to swap over !!!
Nevermind: None
By default, the File Server is not secured even if Web Login Authentication has been configured. Using an object storage solution that has built-in security is recommended.
My bad
I will try it. But it's a bit random when this happen so ... We will see
@<1523701087100473344:profile|SuccessfulKoala55> I can confirm that v1.8.1rc2 fixed the issue in our case. I manage to reproduce it:
- Do a local commit without pushing
- Create task and queue it
- The queue task failed as expected as the commit is only local
- Push your local commit
- Requeue the task
- Expecting that the task succeeed as the commit is avail: but it fails as the vcs seems to be in weird state from previous failure
- Now with v1.8.1rc2 the issue is solved
@<1523701087100473344:profile|SuccessfulKoala55> Actually it failed now: failed to talked to our storage in Azure:
ClearML Task: created new task id=c47dd71dea2f421db05647a21d78ed26
2024-01-25 21:45:23,926 - clearml.storage - ERROR - Failed uploading: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)
2024-01-25 21:46:48,877 - clearml.storage - WARNING - Storage helper problem for .clearml.0149daec-7a03-4853-a0cd-a7e2b295...
Is it because Azure is "whitelisted" in our network ? Thus need a different certificate ?? And how do I provide 2 differents certificate ? Is bundling them simple as a concat of 2 pem file ?
@<1523701087100473344:profile|SuccessfulKoala55> I managed to make this working by:
concat the existing OS ca bundle and zscaler certificate. And set REQUESTS_CA_BUNDLE
to that bundle file
@<1523701087100473344:profile|SuccessfulKoala55> Thanks. Manage to get it working now with
export REQUESTS_CA_BUNDLE=/etc/ssl/certs/zscaler.crt
(Ubuntu system)
not sure ... providing Zscaler certificate seems to allow clearml to talk to our clearml server, hosted in azure, Task init worked. But then failed to connect to the storage account (Azure too) ...
Onprem: User management is not "live" as you need to reboot and password are hardcoded ... No permission distinction, as everyone is admin ...
can you make train1.py
use clearml.conf.server1
and train2.py
use clearml.conf2
?? In which case I would be intersted @<1523701087100473344:profile|SuccessfulKoala55>
Can you share the agent log, in the console tab, before the error?
Looks like your issue is not that ClearML is not tracking your changes but more about your Configuration is overwrriten.
This often happen to me. The way I debug this is put a lot of print statement along the code to track when the Configuration is overwriten and narrow down why. print statement will show up in the Console tab.
So I tried:
CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=/data/hieu/opt/python-venv/fastai/bin/python3.10
clearml-agent daemon --queue no_venv
Then enqueue a cloned task to no_venv
It is still trying to create a venv (and fail):
[...]
tag =
docker_cmd =
entry_point = debug.py
working_dir = apple_ic
created virtual environment CPython3.10.10.final.0-64 in 140ms
creator CPython3Posix(dest=/data/hieu/deleteme/clearml-agent/venvs-builds/3.10, clear=False, no_vcs_ignore=False, gl...