I actually have a question about your original code snipped, @<1556450111259676672:profile|PlainSeaurchin97> . I have been trying to figure out a way to access the task object when running remotely so that I can instantiate the logger but when I tried task_id = os.getenv("CLEARML_TASK_ID")
, itβs returning None
. I also tried Task.current_task()
and also got None
back. What is the recommended way to access the Task object from within the remote agent?
Since I can't use the
torchrun
comand (from my tests, clearml won't use it on the clearm-agent), I went with the
did you check this example?
@<1523701205467926528:profile|AgitatedDove14> actually i did! I based my code adaptation around it, since originally i was running a shell script that called torchrun
But tbh I didn't want to mess too much with my existing code, so i just did a quick and dirty adaptation using the torch.distributed.run
command.
To be honest, i don't think using this envvar is the best option. I think just getting the task as normal (from the task name using Task.init) is the better option
But for these edge cases like i described, CLEARML_TASK_ID is ok
Are you saying you "manually" pares args ?
More or less! Maybe there's a simpler solution that I haven't found yet.
I'm using torch.distributed.run to run my training on multiple GPU's.
Since I can't use the torchrun
comand (from my tests, clearml won't use it on the clearm-agent), I went with the following workaround:
distributed_args = torch.distributed.run.parse_args(sys.argv)
distributed_args.nproc_per_node = args.gpus
torch.distributed.run.run(distributed_args)
Which would be the equivalent of calling torchrun train.py arg1 arg2 ...
Except since clearml patches the parse_args
call inside of the torch.distributed.run.parse_args
function, it generates the same arguments i passed to script.py
and gives an error like "error: the following arguments are required: torchrun_arg_1 , torchrun_arg_2 ..."
Thanks!
Follow-up question: how does clearML "inject" the argparse arguments before the task is initialized?
Does it mess with sys.argv
? Does it inject itself into argparse
?
I had to do another workaround since when torch.distributed.run
called it's ArgumentParser
, it was getting the arguments from my script (and from my task) instead of the ones I passed it
Follow-up question: how does clearML "inject" the argparse arguments before the task is initialized?
it patches the actual parse_args
call, to make sure it works you just need to make sure it was imported before the actual call takes place
I had to do another workaround since when
torch.distributed.run
called it's
ArgumentParser
, it was getting the arguments from my script (and from my task) instead of the ones I passed it
Are you saying you "manually" pares args ?
just to be clear, this works on my local machine:
distributed_args = torch.distributed.run.parse_args(sys.argv)
distributed_args.nproc_per_node = args.gpus
torch.distributed.run.run(distributed_args)
But not when clearml-agent runs it
So the args are patched on the "main" process, but only on the remote worker
- Yes Task.init should be called on each subprocess (because torch forks them before they ar epatched)
- I think the main issue is that we patch the argparse on the Subprocess (this is assuming you did not manually parse non argv argument)
- If you can create a mock test I think we can work around the issue, as long as the way you spin it is the standard pytorch distub way
Oh wait. Do I need the Task to exist in the subprocesses?
I re-create it on the subprocesses, because I thought my tensorboard
stuff wouldn't get logged if the task wasn't initialized
Hi @<1556450111259676672:profile|PlainSeaurchin97>
Is there any simple way to use
argparse
to pass a clearml task name?
need to call
args = task.connect(args)
.
noooo π there is no need to do that, the arguments are automatically detected
see for yourself
args = parse_args()
task = Task.init(task_name=args.task_name)
Okay, I take it back. os.getenv("CLEARML_TASK_ID")
does work. I forgot to rebuild my container after making the change. Thanks for bringing this option to my attention!
Hmm yeah I can see why...
Now that I think about it, at least in theory the second process that torch creates, should inherit from the main one, and as such Task.init is basically "ignored"
Now I wonder why your first version of the code did not work?
Could it be that we patched the argparser on the subprocess and that we should not have?
My final solution was to manually detect if i needed to patch the original argparse
on the training script ( by using the CLEARML_TASK_ID envvar) and to turn off the automatic argparse
connection
Yes this is exactly the solution!
Nice π !
Hi @<1523701205467926528:profile|AgitatedDove14> , made this mock test real quick, it reproduces the issue:
None
Since I can't use the
torchrun
comand (from my tests, clearml won't use it on the clearm-agent), I went with the
@<1556450111259676672:profile|PlainSeaurchin97> did you check this example?
None
OK, so i got into this mess with the argparse because i was turning OFF the automatic detection of command line arguments
I was turning it off because i was calling, inside my script, the argparser from torch.distributed.run
( best way i found to run a torchrun
command in the clearml-agent)
Because of torch.distributed.run
, clearml was automatically tracking inexisting command line arguments, which lead to an error on the remote agent.
In case this happens to anyone else, my solution was the following:
valid_args = { action.dest:True for action in get_arg_parser()._actions }
task = Task.init(
project_name=args.project_name,
task_name=args.task_name,
auto_connect_arg_parser={**valid_args, "*": False} # only consider OUR args (not torch.distributed.run's)
)