Reputation
Badges 1
113 × Eureka!For us, we use Azure, we use KeyVault to store secret.
The VM/node that run agent have a Azure Identity that have permission to read those Secret.
To pull the Secret, we simply have az login --identity [--client-id foobar] prior to az secret ....
Hi.
How do you tell the server to use my azure storage instead of local drive, on the host machine ? Isn't it by setting azure.storage in /opt/clearml/config/clearml.conf ?
Nevermind, didn't read properly ...
I guess when the pods simply crash or disconnect, the clearml agent won't have a chance to report to ClearML server: hey, the network is going to be cut ....
You will need to k8s logic to flow back to the DS that the node just die for xyz reason ...
got it working. I was using CLEARML_AGENT_SKIP_PIP_VENV_INSTALL .
now I just use agent.package_manager.system_site_packages=true
ok, so if git commit or uncommit changes differ from previous run, then the cache is "invalidated" and the step will be run again ?
Feels like Docker, Kubernetes is more fit for that purpose ...
@<1523701070390366208:profile|CostlyOstrich36>
Solved @<1533620191232004096:profile|NuttyLobster9> . In my case:
I need to from clearml import Task very early in the code (like first line), before importing argparse
And not calling task.connect(parser)
we are usign mmsegmentation by the way
If the agent is the one running the experiment, very likely that your task will be killed.
And when the agent come back, immediately or later, probably nothing will happen. It won't resume ...
I use ssh public key to access to our repo ... Never tried to provide credential to clearml itself (via clearml.conf ) so I cannot help much here ...
is task.add_requirements("requirements.txt") redundant ?
Is ClearML always look for a requirements.txt in the repo root ?
Is it because your training code download the pretrain model from pytorch or whatever, to local disk in /tmp/xxx then train from there ? so ClearML will just reference the local path.
I think you need to manually download the pre-train model, then wrap it with Clearml InputModel (eg here )
And then use that InputModel as pre-train ?
May be clearml staffs have better approach ? @<152370107039036...
That --docker_args seems to be for clearml-task as described here , while you are using clearml-agent which is a different thing
the agent inside the docker compose is just a handy one to serve a service queue where you can queue all your "clean up" tasks that are not deep learning related, using only a bit of CPU
so it's not suppose to say "illegal output destination ..." ?
Found the issue: my bad practice for import 😛
You need to import clearml before doing argument parser. Bad way:
import argparse
def handleArgs():
parser = argparse.ArgumentParser()
parser.add_argument('-c', '--config-file', type=str, default='train_config.yaml',
help='train config file')
parser.add_argument('--device', type=int, default=0,
help='cuda device index to run the training')
args = parser....
what about having 2 agents, one on each GPU, on the same machine, serving the same queue ? So that when you enqueue, which ever agent (thus GPU) available will take the new task
So we have 3 python package, store in github.com
On the dev machine, the datascientist (DS) will add the local ssh key to his github account as authorized ssh keys, account level.
With the DS can run git clone git@github.com:org/repo1 then install that python package via pip install -e .
Do that for all 3 python packages, each in its own repo1 , repo2 and repo3 . All 3 can be clone using the same key that the DS added to his account.
The DS run a tra...
But then how did the agent know where is the venv that it needs to use?
If you are using multi storage place, I don't see any other choice than putting multi credential in the conf file ... Free or Paid Clearml Server ...
the underlying code has this assumption when writing it
That means that you want to make things work not in a standard Python way ... In which case you need to do "non-standard" things to make it work.
You can do this for example in the beginning of your run.py
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
In this way, you not relying on a non-standard feature to be implemented by your tool like pycharm or `cle...
you should know where your latest model is located then just call task.upload_artifact on that file ?
the weird thing is that: the GPU 0 seems to be in used as reported by nvtop in the host. But it is 50% slower than when running directly instead of through the clearml-agent ...