Hi all!
I'm struggling with a specific scenario, maybe you could help. I have 2 machines types (for example titan/a100) and 10 types of models with a generic code (I pass the model name as an arg). I want to create a task for each of the 20 runs. That's in order to compare each machine's results per tested model later.
I have implemented a loop, running over the models names. Inside I created a task for each machine type, and run the tasks remotely on a separate queue. I have an active agent on both machines. I want to enqueue the 10 experiments in each queue and let it do its job. But it doesn't seem to work as expected, I only got 2 tasks created. Started thinking this is not the right way to do it.
import timm
from pprint import pprint
import clearml
from clearml import Task
import subprocess
from datetime import date, datetime
now = str(date.today()) + '_' + str(str(datetime.now().hour) + '_' + str(datetime.now().minute) + '_' + str(datetime.now().second))
project_name = 'timm_repos_runner'
gpu_queue_name = 'gpu_queue'
gpu2_queue_name = 'gpu2_queue'
model_names = timm.list_models(pretrained=True)
pprint(model_names[1:10])
# for loop over all models (pretrained = true)
for model_name in model_names[1:10]:
pprint(f'running the model: {model_name}')
# execute the experiment on gpu1
task_gpu = Task.init(project_name=project_name, task_name=f'gpu_Remote_execution_{model_name},{now}')
task_gpu.execute_remotely(queue_name=gpu_queue_name, clone=True, exit_process=False)
pprint(f'running the model: {model_name} over gpu')
subprocess.run(['bash', '/workdisk/ydagan/clearml_infra/run_val_script_gpu.sh', model_name])
task_gpu.close()
# execute the experiment on gpu2
task_gpu2 = Task.init(project_name=project_name, task_name=f'gpu2_Remote_execution_{model_name},{now}')
task_gpu2.execute_remotely(queue_name=gpu2_queue_name, clone=True, exit_process=False)
pprint(f'running the model: {model_name} over gpu2')
subprocess.run(['docker','exec', '--privileged', '-it', 'torch_docker', 'bash', '/workdisk/ydagan/clearml_infra/run_val_script_gpu2.sh', model_name])
task_gpu2.close()
# TODO: get results of the two exps and compare