Hi All! I'M Struggling With A Specific Scenario, Maybe You Could Help. I Have 2 Machines Types (For Example Titan/A100) And 10 Types Of Models With A Generic Code (I Pass The Model Name As An Arg). I Want To Create A Task For Each Of The 20 Runs. That'S I

Answered

Hi all!
I'm struggling with a specific scenario, maybe you could help. I have 2 machines types (for example titan/a100) and 10 types of models with a generic code (I pass the model name as an arg). I want to create a task for each of the 20 runs. That's in order to compare each machine's results per tested model later.
I have implemented a loop, running over the models names. Inside I created a task for each machine type, and run the tasks remotely on a separate queue. I have an active agent on both machines. I want to enqueue the 10 experiments in each queue and let it do its job. But it doesn't seem to work as expected, I only got 2 tasks created. Started thinking this is not the right way to do it.

import timm
from pprint import pprint
import clearml
from clearml import Task
import subprocess 
from datetime import date, datetime

now = str(date.today()) + '_' + str(str(datetime.now().hour) + '_' + str(datetime.now().minute) + '_' + str(datetime.now().second))
project_name = 'timm_repos_runner'
gpu_queue_name = 'gpu_queue'
gpu2_queue_name = 'gpu2_queue'
model_names = timm.list_models(pretrained=True)
pprint(model_names[1:10])

# for loop over all models (pretrained = true)
for model_name in model_names[1:10]:
    pprint(f'running the model: {model_name}')
    # execute the experiment on gpu1
    task_gpu = Task.init(project_name=project_name, task_name=f'gpu_Remote_execution_{model_name},{now}')
    task_gpu.execute_remotely(queue_name=gpu_queue_name, clone=True, exit_process=False)
    pprint(f'running the model: {model_name} over gpu')
    subprocess.run(['bash', '/workdisk/ydagan/clearml_infra/run_val_script_gpu.sh', model_name])
    task_gpu.close()
    
    # execute the experiment on gpu2
    task_gpu2 = Task.init(project_name=project_name, task_name=f'gpu2_Remote_execution_{model_name},{now}')
    task_gpu2.execute_remotely(queue_name=gpu2_queue_name, clone=True, exit_process=False)
    
    pprint(f'running the model: {model_name} over gpu2')
    subprocess.run(['docker','exec', '--privileged', '-it', 'torch_docker', 'bash', '/workdisk/ydagan/clearml_infra/run_val_script_gpu2.sh', model_name])
    task_gpu2.close()
    
# TODO: get results of the two exps and compare

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ObliviousClams17
				
					0
					 × 1

Votes Newest

Answers

Hi @<1531807732334596096:profile|ObliviousClams17> , I think for your specific use case it would be easiest to use the API - fetch a task, clone it as many times as needed and enqueue it into the relevant queues.

Fetch a task - None
Clone a task - None
Enqueue a task (or many) - None ( None )

If you open developer tools (F12) in the ClearML UI, you will be able to see example usages for all these calls

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Write your answer

1K Views

1 Answer

one year ago