Reputation
Badges 1
113 × Eureka!Feels like Docker, Kubernetes is more fit for that purpose ...
@<1523701070390366208:profile|CostlyOstrich36>
Solved @<1533620191232004096:profile|NuttyLobster9> . In my case:
I need to from clearml import Task very early in the code (like first line), before importing argparse
And not calling task.connect(parser)
we are usign mmsegmentation by the way
If the agent is the one running the experiment, very likely that your task will be killed.
And when the agent come back, immediately or later, probably nothing will happen. It won't resume ...
I use ssh public key to access to our repo ... Never tried to provide credential to clearml itself (via clearml.conf ) so I cannot help much here ...
is task.add_requirements("requirements.txt") redundant ?
Is ClearML always look for a requirements.txt in the repo root ?
Is it because your training code download the pretrain model from pytorch or whatever, to local disk in /tmp/xxx then train from there ? so ClearML will just reference the local path.
I think you need to manually download the pre-train model, then wrap it with Clearml InputModel (eg here )
And then use that InputModel as pre-train ?
May be clearml staffs have better approach ? @<152370107039036...
That --docker_args seems to be for clearml-task as described here , while you are using clearml-agent which is a different thing
the agent inside the docker compose is just a handy one to serve a service queue where you can queue all your "clean up" tasks that are not deep learning related, using only a bit of CPU
so it's not suppose to say "illegal output destination ..." ?
Found the issue: my bad practice for import 😛
You need to import clearml before doing argument parser. Bad way:
import argparse
def handleArgs():
parser = argparse.ArgumentParser()
parser.add_argument('-c', '--config-file', type=str, default='train_config.yaml',
help='train config file')
parser.add_argument('--device', type=int, default=0,
help='cuda device index to run the training')
args = parser....
what about having 2 agents, one on each GPU, on the same machine, serving the same queue ? So that when you enqueue, which ever agent (thus GPU) available will take the new task
But then how did the agent know where is the venv that it needs to use?
If you are using multi storage place, I don't see any other choice than putting multi credential in the conf file ... Free or Paid Clearml Server ...
the underlying code has this assumption when writing it
That means that you want to make things work not in a standard Python way ... In which case you need to do "non-standard" things to make it work.
You can do this for example in the beginning of your run.py
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
In this way, you not relying on a non-standard feature to be implemented by your tool like pycharm or `cle...
you should know where your latest model is located then just call task.upload_artifact on that file ?
the weird thing is that: the GPU 0 seems to be in used as reported by nvtop in the host. But it is 50% slower than when running directly instead of through the clearml-agent ...
What exactly are you trying to achieve ?
Let assume that you have Task.init() in run.py
And run.py is inside /foo/bar/
If you run :
cd /foo
python bar/run.py
Then the Task will have working folder /foo
If you run:
cd /foo/bar
python run.py
Then your task will have the working folder /foo/bar
I think ES use a greedy strategy where it allocate first then use it from there ...
because when I was running both agents on my local machine everything was working perfectly fine
This is probably you (or someone) had set up ssh public key with your git repo sometime in the past
How are you using the function update_output_model ?
Can you share the agent log, in the console tab, before the error?
kind of ....
Now I think about it, the best approach would be to:
- Clone a task
@<1523701087100473344:profile|SuccessfulKoala55> Is it even possible to have the server storing file to a given blob storage ?