Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All! I'M Struggling With A Specific Scenario, Maybe You Could Help. I Have 2 Machines Types (For Example Titan/A100) And 10 Types Of Models With A Generic Code (I Pass The Model Name As An Arg). I Want To Create A Task For Each Of The 20 Runs. That'S I

Hi all!
I'm struggling with a specific scenario, maybe you could help. I have 2 machines types (for example titan/a100) and 10 types of models with a generic code (I pass the model name as an arg). I want to create a task for each of the 20 runs. That's in order to compare each machine's results per tested model later.
I have implemented a loop, running over the models names. Inside I created a task for each machine type, and run the tasks remotely on a separate queue. I have an active agent on both machines. I want to enqueue the 10 experiments in each queue and let it do its job. But it doesn't seem to work as expected, I only got 2 tasks created. Started thinking this is not the right way to do it.

import timm
from pprint import pprint
import clearml
from clearml import Task
import subprocess 
from datetime import date, datetime

now = str(date.today()) + '_' + str(str(datetime.now().hour) + '_' + str(datetime.now().minute) + '_' + str(datetime.now().second))
project_name = 'timm_repos_runner'
gpu_queue_name = 'gpu_queue'
gpu2_queue_name = 'gpu2_queue'
model_names = timm.list_models(pretrained=True)
pprint(model_names[1:10])

# for loop over all models (pretrained = true)
for model_name in model_names[1:10]:
    pprint(f'running the model: {model_name}')
    # execute the experiment on gpu1
    task_gpu = Task.init(project_name=project_name, task_name=f'gpu_Remote_execution_{model_name},{now}')
    task_gpu.execute_remotely(queue_name=gpu_queue_name, clone=True, exit_process=False)
    pprint(f'running the model: {model_name} over gpu')
    subprocess.run(['bash', '/workdisk/ydagan/clearml_infra/run_val_script_gpu.sh', model_name])
    task_gpu.close()
    
    # execute the experiment on gpu2
    task_gpu2 = Task.init(project_name=project_name, task_name=f'gpu2_Remote_execution_{model_name},{now}')
    task_gpu2.execute_remotely(queue_name=gpu2_queue_name, clone=True, exit_process=False)
    
    pprint(f'running the model: {model_name} over gpu2')
    subprocess.run(['docker','exec', '--privileged', '-it', 'torch_docker', 'bash', '/workdisk/ydagan/clearml_infra/run_val_script_gpu2.sh', model_name])
    task_gpu2.close()
    
# TODO: get results of the two exps and compare
  
  
Posted 12 months ago
Votes Newest

Answers


Hi @<1531807732334596096:profile|ObliviousClams17> , I think for your specific use case it would be easiest to use the API - fetch a task, clone it as many times as needed and enqueue it into the relevant queues.

Fetch a task - None
Clone a task - None
Enqueue a task (or many) - None ( None )

If you open developer tools (F12) in the ClearML UI, you will be able to see example usages for all these calls

  
  
Posted 12 months ago