So, I'M Trying To Do A Several-Step Process, But It Needs To Run On A Gpu Queue In Clearml. How Would I Do That? Specifically, Here'S What I'M Trying To Do, Is It Possible?

Answered

So, I'm trying to do a several-step process, but it needs to run on a GPU queue in ClearML. How would I do that?

Specifically, here's what I'm trying to do, is it possible?
download a dataset. pass that dataset location to a script based on this one: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py (is it even possible to pass command line arguments?) still track the training progress in clearML run it all on a GPU Queue I have access to.
And here's what I've looked at:
I've looked at the " https://allegro.ai/clearml/docs/docs/examples/pipeline/pipeline_controller.html " and it looks like it runs each step in a separate Docker instance. But for the dataset, it needs to be in the same container as the training. I could also edit run_mlm.py to add in the data downloading before the training begins, but, well, the argparse logic in that script is really complex. It would be much easier to pass in the data location as a command-line argument. The folder I want to pass in actually contains several important files, e.g. a tokenizer json, a config.json, etc. Maybe I could run a "setup shell script" first?
I guess I'm confused how the setup shell scripts work, how to trigger my training script, and so forth.

Could I, say, run the download in the "Setup Shell Script", and then somehow run the script with command line arguments? Is it possible to make a Task that just runs a bash script? How do these things work? Sorry for the rambly question, if anyone's got some pointers to details of how these Tasks and Queues work I'd be appreciative

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Votes Newest

Answers 3

Sounds doable, I will give it a try.

The task.execute_remotely thing is quite interesting, I didn't know about that!

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Gave it a try, it seems our GPU Queue doesn't have the S3 creds set up correctly. Making a separate thread about that

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Hi SmallDeer34
Is the Dataset in clearml-data ? If it is then Dataset.get().get_local_copy() will get you a cached local copy of the entire dataset.
If it is not, then you can use StorageManager.get_local_copy(url_here) to download the dataset.

Any Argparser is automatically logged (and later can be overridden from the UI). Specifically HfArgumentParser will be automatically logged https://github.com/huggingface/transformers/blob/e43e11260ff3c0a1b3cb0f4f39782d71a51c0191/examples/pytorch/language-modeling/run_mlm.py#L200

Basically I would do the following:
Clone the huggingface repo to your dev machine.
Edit locally the run_mlm.py:
Add Task.init call add the Dataset download / StorageManager download. add `task.execute_remotely(queue_name='my_gpu_queue')This will make sure that all the local changes are automatically restored on the remote machine, it will auto populate the default arguments, and it will stop the local execution and relaunch the Task on the remote GPU.
wdyt?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

3 Answers

3 years ago

2 years ago