Reputation
Badges 1
31 × Eureka!OK, so i got into this mess with the argparse because i was turning OFF the automatic detection of command line arguments
I was turning it off because i was calling, inside my script, the argparser from torch.distributed.run
( best way i found to run a torchrun
command in the clearml-agent)
Because of torch.distributed.run
, clearml was automatically tracking inexisting command line arguments, which lead to an error on the remote agent.
In case this happens to anyone else, my ...
just to be clear, this works on my local machine:
distributed_args = torch.distributed.run.parse_args(sys.argv)
distributed_args.nproc_per_node = args.gpus
torch.distributed.run.run(distributed_args)
But not when clearml-agent runs it
So the args are patched on the "main" process, but only on the remote worker
Oh wait. Do I need the Task to exist in the subprocesses?
I re-create it on the subprocesses, because I thought my tensorboard
stuff wouldn't get logged if the task wasn't initialized
Since I can't use the
torchrun
comand (from my tests, clearml won't use it on the clearm-agent), I went with the
did you check this example?
@<1523701205467926528:profile|AgitatedDove14> actually i did! I based my code adaptation around it, since originally i was running a shell script that called torchrun
But tbh I didn't want to mess too much with my existing code, so i just did a quick and dirty adaptation using the torch.distributed.run
command.
To be honest, i don't think using this envvar is the best option. I think just getting the task as normal (from the task name using Task.init) is the better option
But for these edge cases like i described, CLEARML_TASK_ID is ok
Are you saying you "manually" pares args ?
More or less! Maybe there's a simpler solution that I haven't found yet.
I'm using torch.distributed.run to run my training on multiple GPU's.
Since I can't use the torchrun
comand (from my tests, clearml won't use it on the clearm-agent), I went with the following workaround:
distributed_args = torch.distributed.run.parse_args(sys.argv)
distributed_args.nproc_per_node = args.gpus
torch.distributed.run.run(distributed_args)
Wh...
My final solution was to manually detect if i needed to patch the original argparse
on the training script ( by using the CLEARML_TASK_ID envvar) and to turn off the automatic argparse
connection
Thanks!
Follow-up question: how does clearML "inject" the argparse arguments before the task is initialized?
Does it mess with sys.argv
? Does it inject itself into argparse
?
I had to do another workaround since when torch.distributed.run
called it's ArgumentParser
, it was getting the arguments from my script (and from my task) instead of the ones I passed it
Hi @<1523701205467926528:profile|AgitatedDove14> , made this mock test real quick, it reproduces the issue:
None
Actually, in addition to the parallel coordinates 😄
They're both good ways to visualize, but i think scatterplots are more intuitive for my use case
Had some people in my org bump it up too.
If i get some free time I'll consider contributing
is it possible to change an existing model's URL?
would it be smart to try to do this straight on the database? Is it Mongodb?
Wow, didn't catch the docker_setup_bash_script
argument. Thanks!
- Not sure about pushing our container to a public registry. But follow-up question: how do I configure secrets (like container registry credentials) for a clearml-agent to use for a task?
- Is it possible to do this on a task-by-task basis? I thought clearml-agent only installs pip requirements and such, is there a way to configure a setup script for my task environment?
From what i understand, what this does is build a container from an existing task. That's not really what i need
I'll describe my use case, maybe it makes it clearer:
I have a Dockerfile which builds an image with a bunch of system dependencies i need for training.
I want clearml-agent
to use this image, but run docker build
whenever necessary, this way I don't need to keep updating the base image which I want my tasks to be run with
And here is the repo: None
What do i set it to so that my models upload to clearml? justserver_ip:8081
?
Basically: locally, when i run pip install -r requirements.txt
, the softgroup.ops
package is installed correctly. But not on the remote worker
I install the softgroup.ops
package via the last line in requirements.txt
, i.e. pip install -e .
Not sure if i can because of some proprietary stuff on the code.
But i'll try writing a minimum working example on monday!
I attached three logs:
- local_console_output : how i setup my local task. Important commands:
apt-install
that installs the same dependencies that are on thedocker_setup_bash_script
; andpip install -r requirements.txt
- local_task_output: clearml experiment console log. The error "the following arguments are required: config" is the expected behavior
- remote_task_output: clearml experiment console log obtained when i clone the local task and enqueue it for remote execution. No...
in what order does the agent do things?
I assumed it was
- Start the docker container
- Run the docker setup bash script
- Pull the repo , checkout the commit, apply changes
- Install pip requirementsIn this case, i wouldn't have the correct version of the repo at the time the setup bash script runs
I ned to pip-install the package because i need to build some Cuda extensions
Is there any way i can do something equivalent to -e .
in the agent context?
Also, if you check the logs my package is actually built at step 4:
2023-05-03 10:07:58
Building wheels for collected packages: softgroup
Building wheel for softgroup (setup.py) ... ?25l-
2023-05-03 10:08:14
\ |
2023-05-03 10:08:19
/ - \
Looks like the -e
flag is ignored. But it should work either way 🤔
Looks like it was a python thing, not a clearml thing!
Clearml correctly installs the .
from requirements.txt
, but the project from the working directory was conflicting with the installed package, so python couldn't find the compiled extension.
With some small changes to my repo, everything works
Thanks for the help anyway!