Hi All! Is There Any Simple Way To Use

Answered

Hi all!

Is there any simple way to use argparse to pass a clearml task name?
I was using an argument called --clearml_task for this, but i ran into an interesting issue: to track args , i need to call task.Init(task_name=args.clearml_task) , but to get the args object (and overwrite it on the remote clearml-agent), i need to call args = task.connect(args) .
So I have a chicken-and-egg situation

My solution was a workaround like this:

def get_task_if_remote():
    # Set by clearml-agent
    task_id = os.environ.get("CLEARML_TASK_ID")
    if task_id is not None:
        return Task.get_task(task_id=task_id)

if __name__=="__main__":
    task = get_task_if_remote()

    if task is None: # First run
        args = get_arg_parser().parse_args()
        is_remote=False
    else: # ClearML is running this remotely
        is_remote=True
        args = get_arg_parser().parse_args(["fake_config.yaml", "--clearml_id", task.task_id])

    task, cfg, args = prep_clearml(args)

    if not is_remote:
        task.execute_remotely(queue_name = args.remote)

  				
Posted 
	one year ago

					More  		
  Report
		
					PlainSeaurchin97
				
					0
					 × 1

Votes Newest

Answers 17

Hi PlainSeaurchin97

Is there any simple way to use

argparse

to pass a clearml task name?

need to call

args = task.connect(args)

.

noooo 🙂 there is no need to do that, the arguments are automatically detected
see for yourself

args = parse_args()
task = Task.init(task_name=args.task_name)

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

OK, so i got into this mess with the argparse because i was turning OFF the automatic detection of command line arguments

I was turning it off because i was calling, inside my script, the argparser from torch.distributed.run ( best way i found to run a torchrun command in the clearml-agent)

Because of torch.distributed.run , clearml was automatically tracking inexisting command line arguments, which lead to an error on the remote agent.

In case this happens to anyone else, my solution was the following:

valid_args = { action.dest:True for action in get_arg_parser()._actions }
task = Task.init(
            project_name=args.project_name,
            task_name=args.task_name,
            auto_connect_arg_parser={**valid_args, "*": False} # only consider OUR args (not torch.distributed.run's)
        )

  				
Posted 
	one year ago

					More  		
  Report
		
					PlainSeaurchin97
				
					0
					 × 1

Yes this is exactly the solution!
Nice 🎊 !

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks!

Follow-up question: how does clearML "inject" the argparse arguments before the task is initialized?
Does it mess with sys.argv ? Does it inject itself into argparse ?

I had to do another workaround since when torch.distributed.run called it's ArgumentParser , it was getting the arguments from my script (and from my task) instead of the ones I passed it

  				
Posted 
	one year ago

					More  		
  Report
		
					PlainSeaurchin97
				
					0
					 × 1

Follow-up question: how does clearML "inject" the argparse arguments before the task is initialized?

it patches the actual parse_args call, to make sure it works you just need to make sure it was imported before the actual call takes place

I had to do another workaround since when

torch.distributed.run

called it's

ArgumentParser

, it was getting the arguments from my script (and from my task) instead of the ones I passed it

Are you saying you "manually" pares args ?

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Are you saying you "manually" pares args ?

More or less! Maybe there's a simpler solution that I haven't found yet.

I'm using torch.distributed.run to run my training on multiple GPU's.
Since I can't use the torchrun comand (from my tests, clearml won't use it on the clearm-agent), I went with the following workaround:

distributed_args = torch.distributed.run.parse_args(sys.argv)
distributed_args.nproc_per_node = args.gpus
torch.distributed.run.run(distributed_args)

Which would be the equivalent of calling torchrun train.py arg1 arg2 ...

Except since clearml patches the parse_args call inside of the torch.distributed.run.parse_args function, it generates the same arguments i passed to script.py and gives an error like "error: the following arguments are required: torchrun_arg_1 , torchrun_arg_2 ..."

  				
Posted 
	one year ago

					More  		
  Report
		
					PlainSeaurchin97
				
					0
					 × 1

My final solution was to manually detect if i needed to patch the original argparse on the training script ( by using the CLEARML_TASK_ID envvar) and to turn off the automatic argparse connection

  				
Posted 
	one year ago

					More  		
  Report
		
					PlainSeaurchin97
				
					0
					 × 1

Since I can't use the

torchrun

comand (from my tests, clearml won't use it on the clearm-agent), I went with the

PlainSeaurchin97 did you check this example?
None

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I actually have a question about your original code snipped, PlainSeaurchin97 . I have been trying to figure out a way to access the task object when running remotely so that I can instantiate the logger but when I tried task_id = os.getenv("CLEARML_TASK_ID") , it’s returning None . I also tried Task.current_task() and also got None back. What is the recommended way to access the Task object from within the remote agent?

  				
Posted 
	one year ago

					More  		
  Report
		
					NuttyLobster9
				
					0
					 × 1

Okay, I take it back. os.getenv("CLEARML_TASK_ID") does work. I forgot to rebuild my container after making the change. Thanks for bringing this option to my attention!

  				
Posted 
	one year ago

					More  		
  Report
		
					NuttyLobster9
				
					0
					 × 1

To be honest, i don't think using this envvar is the best option. I think just getting the task as normal (from the task name using Task.init) is the better option

But for these edge cases like i described, CLEARML_TASK_ID is ok

  				
Posted 
	one year ago

					More  		
  Report
		
					PlainSeaurchin97
				
					0
					 × 1

Since I can't use the

torchrun

comand (from my tests, clearml won't use it on the clearm-agent), I went with the

did you check this example?

AgitatedDove14 actually i did! I based my code adaptation around it, since originally i was running a shell script that called torchrun

But tbh I didn't want to mess too much with my existing code, so i just did a quick and dirty adaptation using the torch.distributed.run command.

  				
Posted 
	one year ago

					More  		
  Report
		
					PlainSeaurchin97
				
					0
					 × 1

Hmm yeah I can see why...
Now that I think about it, at least in theory the second process that torch creates, should inherit from the main one, and as such Task.init is basically "ignored"
Now I wonder why your first version of the code did not work?
Could it be that we patched the argparser on the subprocess and that we should not have?

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Oh wait. Do I need the Task to exist in the subprocesses?
I re-create it on the subprocesses, because I thought my tensorboard stuff wouldn't get logged if the task wasn't initialized

  				
Posted 
	one year ago

					More  		
  Report
		
					PlainSeaurchin97
				
					0
					 × 1

just to be clear, this works on my local machine:

distributed_args = torch.distributed.run.parse_args(sys.argv)
distributed_args.nproc_per_node = args.gpus
torch.distributed.run.run(distributed_args)

But not when clearml-agent runs it

So the args are patched on the "main" process, but only on the remote worker

  				
Posted 
	one year ago

					More  		
  Report
		
					PlainSeaurchin97
				
					0
					 × 1

Yes Task.init should be called on each subprocess (because torch forks them before they ar epatched)
I think the main issue is that we patch the argparse on the Subprocess (this is assuming you did not manually parse non argv argument)
If you can create a mock test I think we can work around the issue, as long as the way you spin it is the standard pytorch distub way

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi AgitatedDove14 , made this mock test real quick, it reproduces the issue:
None

  				
Posted 
	one year ago

					More  		
  Report
		
					PlainSeaurchin97
				
					0
					 × 1

Write your answer

1K Views

17 Answers

one year ago