Since I can't use the
torchrun
comand (from my tests, clearml won't use it on the clearm-agent), I went with the
@<1556450111259676672:profile|PlainSeaurchin97> did you check this example?
None
Hi @<1523701205467926528:profile|AgitatedDove14> , made this mock test real quick, it reproduces the issue:
None
Follow-up question: how does clearML "inject" the argparse arguments before the task is initialized?
it patches the actual parse_args
call, to make sure it works you just need to make sure it was imported before the actual call takes place
I had to do another workaround since when
torch.distributed.run
called it's
ArgumentParser
, it was getting the arguments from my script (and from my task) instead of the ones I passed it
Are you saying you "manually" pares args ?
Oh wait. Do I need the Task to exist in the subprocesses?
I re-create it on the subprocesses, because I thought my tensorboard
stuff wouldn't get logged if the task wasn't initialized
Okay, I take it back. os.getenv("CLEARML_TASK_ID")
does work. I forgot to rebuild my container after making the change. Thanks for bringing this option to my attention!
- Yes Task.init should be called on each subprocess (because torch forks them before they ar epatched)
- I think the main issue is that we patch the argparse on the Subprocess (this is assuming you did not manually parse non argv argument)
- If you can create a mock test I think we can work around the issue, as long as the way you spin it is the standard pytorch distub way
OK, so i got into this mess with the argparse because i was turning OFF the automatic detection of command line arguments
I was turning it off because i was calling, inside my script, the argparser from torch.distributed.run
( best way i found to run a torchrun
command in the clearml-agent)
Because of torch.distributed.run
, clearml was automatically tracking inexisting command line arguments, which lead to an error on the remote agent.
In case this happens to anyone else, my solution was the following:
valid_args = { action.dest:True for action in get_arg_parser()._actions }
task = Task.init(
project_name=args.project_name,
task_name=args.task_name,
auto_connect_arg_parser={**valid_args, "*": False} # only consider OUR args (not torch.distributed.run's)
)
Thanks!
Follow-up question: how does clearML "inject" the argparse arguments before the task is initialized?
Does it mess with sys.argv
? Does it inject itself into argparse
?
I had to do another workaround since when torch.distributed.run
called it's ArgumentParser
, it was getting the arguments from my script (and from my task) instead of the ones I passed it
Hi @<1556450111259676672:profile|PlainSeaurchin97>
Is there any simple way to use
argparse
to pass a clearml task name?
need to call
args = task.connect(args)
.
noooo π there is no need to do that, the arguments are automatically detected
see for yourself
args = parse_args()
task = Task.init(task_name=args.task_name)
Since I can't use the
torchrun
comand (from my tests, clearml won't use it on the clearm-agent), I went with the
did you check this example?
@<1523701205467926528:profile|AgitatedDove14> actually i did! I based my code adaptation around it, since originally i was running a shell script that called torchrun
But tbh I didn't want to mess too much with my existing code, so i just did a quick and dirty adaptation using the torch.distributed.run
command.
Yes this is exactly the solution!
Nice π !
I actually have a question about your original code snipped, @<1556450111259676672:profile|PlainSeaurchin97> . I have been trying to figure out a way to access the task object when running remotely so that I can instantiate the logger but when I tried task_id = os.getenv("CLEARML_TASK_ID")
, itβs returning None
. I also tried Task.current_task()
and also got None
back. What is the recommended way to access the Task object from within the remote agent?
Are you saying you "manually" pares args ?
More or less! Maybe there's a simpler solution that I haven't found yet.
I'm using torch.distributed.run to run my training on multiple GPU's.
Since I can't use the torchrun
comand (from my tests, clearml won't use it on the clearm-agent), I went with the following workaround:
distributed_args = torch.distributed.run.parse_args(sys.argv)
distributed_args.nproc_per_node = args.gpus
torch.distributed.run.run(distributed_args)
Which would be the equivalent of calling torchrun train.py arg1 arg2 ...
Except since clearml patches the parse_args
call inside of the torch.distributed.run.parse_args
function, it generates the same arguments i passed to script.py
and gives an error like "error: the following arguments are required: torchrun_arg_1 , torchrun_arg_2 ..."
just to be clear, this works on my local machine:
distributed_args = torch.distributed.run.parse_args(sys.argv)
distributed_args.nproc_per_node = args.gpus
torch.distributed.run.run(distributed_args)
But not when clearml-agent runs it
So the args are patched on the "main" process, but only on the remote worker
To be honest, i don't think using this envvar is the best option. I think just getting the task as normal (from the task name using Task.init) is the better option
But for these edge cases like i described, CLEARML_TASK_ID is ok
My final solution was to manually detect if i needed to patch the original argparse
on the training script ( by using the CLEARML_TASK_ID envvar) and to turn off the automatic argparse
connection
Hmm yeah I can see why...
Now that I think about it, at least in theory the second process that torch creates, should inherit from the main one, and as such Task.init is basically "ignored"
Now I wonder why your first version of the code did not work?
Could it be that we patched the argparser on the subprocess and that we should not have?