Hi All! Is There Any Simple Way To Use

Answered

Hi all!

Is there any simple way to use argparse to pass a clearml task name?
I was using an argument called --clearml_task for this, but i ran into an interesting issue: to track args , i need to call task.Init(task_name=args.clearml_task) , but to get the args object (and overwrite it on the remote clearml-agent), i need to call args = task.connect(args) .
So I have a chicken-and-egg situation

My solution was a workaround like this:

def get_task_if_remote():
    # Set by clearml-agent
    task_id = os.environ.get("CLEARML_TASK_ID")
    if task_id is not None:
        return Task.get_task(task_id=task_id)

if __name__=="__main__":
    task = get_task_if_remote()

    if task is None: # First run
        args = get_arg_parser().parse_args()
        is_remote=False
    else: # ClearML is running this remotely
        is_remote=True
        args = get_arg_parser().parse_args(["fake_config.yaml", "--clearml_id", task.task_id])

    task, cfg, args = prep_clearml(args)

    if not is_remote:
        task.execute_remotely(queue_name = args.remote)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PlainSeaurchin97
				
					0
					 × 1

Votes Newest

Answers 17

Are you saying you "manually" pares args ?

More or less! Maybe there's a simpler solution that I haven't found yet.

I'm using torch.distributed.run to run my training on multiple GPU's.
Since I can't use the torchrun comand (from my tests, clearml won't use it on the clearm-agent), I went with the following workaround:

distributed_args = torch.distributed.run.parse_args(sys.argv)
distributed_args.nproc_per_node = args.gpus
torch.distributed.run.run(distributed_args)

Which would be the equivalent of calling torchrun train.py arg1 arg2 ...

Except since clearml patches the parse_args call inside of the torch.distributed.run.parse_args function, it generates the same arguments i passed to script.py and gives an error like "error: the following arguments are required: torchrun_arg_1 , torchrun_arg_2 ..."

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PlainSeaurchin97
				
					0
					 × 1

Yes Task.init should be called on each subprocess (because torch forks them before they ar epatched)
I think the main issue is that we patch the argparse on the Subprocess (this is assuming you did not manually parse non argv argument)
If you can create a mock test I think we can work around the issue, as long as the way you spin it is the standard pytorch distub way

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hmm yeah I can see why...
Now that I think about it, at least in theory the second process that torch creates, should inherit from the main one, and as such Task.init is basically "ignored"
Now I wonder why your first version of the code did not work?
Could it be that we patched the argparser on the subprocess and that we should not have?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks!

Follow-up question: how does clearML "inject" the argparse arguments before the task is initialized?
Does it mess with sys.argv ? Does it inject itself into argparse ?

I had to do another workaround since when torch.distributed.run called it's ArgumentParser , it was getting the arguments from my script (and from my task) instead of the ones I passed it

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PlainSeaurchin97
				
					0
					 × 1

Hi @<1523701205467926528:profile|AgitatedDove14> , made this mock test real quick, it reproduces the issue:
None

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PlainSeaurchin97
				
					0
					 × 1

Okay, I take it back. os.getenv("CLEARML_TASK_ID") does work. I forgot to rebuild my container after making the change. Thanks for bringing this option to my attention!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					NuttyLobster9
				
					0
					 × 1

To be honest, i don't think using this envvar is the best option. I think just getting the task as normal (from the task name using Task.init) is the better option

But for these edge cases like i described, CLEARML_TASK_ID is ok

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PlainSeaurchin97
				
					0
					 × 1

Follow-up question: how does clearML "inject" the argparse arguments before the task is initialized?

it patches the actual parse_args call, to make sure it works you just need to make sure it was imported before the actual call takes place

I had to do another workaround since when

torch.distributed.run

called it's

ArgumentParser

, it was getting the arguments from my script (and from my task) instead of the ones I passed it

Are you saying you "manually" pares args ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes this is exactly the solution!
Nice 🎊 !

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi @<1556450111259676672:profile|PlainSeaurchin97>

Is there any simple way to use

argparse

to pass a clearml task name?

need to call

args = task.connect(args)

.

noooo 🙂 there is no need to do that, the arguments are automatically detected
see for yourself

args = parse_args()
task = Task.init(task_name=args.task_name)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

OK, so i got into this mess with the argparse because i was turning OFF the automatic detection of command line arguments

I was turning it off because i was calling, inside my script, the argparser from torch.distributed.run ( best way i found to run a torchrun command in the clearml-agent)

Because of torch.distributed.run , clearml was automatically tracking inexisting command line arguments, which lead to an error on the remote agent.

In case this happens to anyone else, my solution was the following:

valid_args = { action.dest:True for action in get_arg_parser()._actions }
task = Task.init(
            project_name=args.project_name,
            task_name=args.task_name,
            auto_connect_arg_parser={**valid_args, "*": False} # only consider OUR args (not torch.distributed.run's)
        )

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PlainSeaurchin97
				
					0
					 × 1

Oh wait. Do I need the Task to exist in the subprocesses?
I re-create it on the subprocesses, because I thought my tensorboard stuff wouldn't get logged if the task wasn't initialized

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PlainSeaurchin97
				
					0
					 × 1

My final solution was to manually detect if i needed to patch the original argparse on the training script ( by using the CLEARML_TASK_ID envvar) and to turn off the automatic argparse connection

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PlainSeaurchin97
				
					0
					 × 1

I actually have a question about your original code snipped, @<1556450111259676672:profile|PlainSeaurchin97> . I have been trying to figure out a way to access the task object when running remotely so that I can instantiate the logger but when I tried task_id = os.getenv("CLEARML_TASK_ID") , it’s returning None . I also tried Task.current_task() and also got None back. What is the recommended way to access the Task object from within the remote agent?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					NuttyLobster9
				
					0
					 × 1

Since I can't use the

torchrun

comand (from my tests, clearml won't use it on the clearm-agent), I went with the

did you check this example?

@<1523701205467926528:profile|AgitatedDove14> actually i did! I based my code adaptation around it, since originally i was running a shell script that called torchrun

But tbh I didn't want to mess too much with my existing code, so i just did a quick and dirty adaptation using the torch.distributed.run command.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PlainSeaurchin97
				
					0
					 × 1

just to be clear, this works on my local machine:

distributed_args = torch.distributed.run.parse_args(sys.argv)
distributed_args.nproc_per_node = args.gpus
torch.distributed.run.run(distributed_args)

But not when clearml-agent runs it

So the args are patched on the "main" process, but only on the remote worker

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PlainSeaurchin97
				
					0
					 × 1

Since I can't use the

torchrun

comand (from my tests, clearml won't use it on the clearm-agent), I went with the

@<1556450111259676672:profile|PlainSeaurchin97> did you check this example?
None

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

17 Answers

2 years ago