we have some other parts, and for some cases we get initialization time can be about 10 times the experiment time
Before I dive into some agent in agent hacking, I would consider "caching" this preprocessing on an auxiliary Task as an artifact. Basically add another argument for the auxiliary Task, and fetch the data from it (obviously you will need to run it once before the optimizer launches the first experiment).
Now that is out of the way (which really would be the preferred engineering solution) 🙂
This sounds like it can work. we are talking about something like:
Exactly!
In order to do that we have a new "agent-Task" that we manually enqueue (this controls the number of machines that will be running the code). You can see below an "agent-Task" pulling Tasks from "default" queue and spawning them as subprocess (one process per agent-task). Notice I have not been able to fully test the code, but you can run it manually and verify it actually works 🙂 (btw: no need for the LocalClearmlJob, from the optimizer perepective it just launches jobs on the "default" queue)
Let me know it works 🤞
` import sys
import os
import subprocess
import time
from clearml.backend_api.session.client import APIClient
from clearml import Task
def spawn_sub_task(task):
# create the subprocess
cmd = task.data.execution.script.entrypoint
python = sys.executable
env = dict(**os.environ)
env['CLEARML_TASK_ID'] = env['TRAINS_TASK_ID'] = task.id
env['CLEARML_LOG_TASK_TO_BACKEND'] = 1
env['CLEARML_SIMULATE_REMOTE_TASK'] = 1
p = subprocess.Popen(args=[python, cmd], cwd=os.getcwd(), env=env)
p.wait()
return True
task = Task.init('project', 'agent task')
params = {'queue_name': 'default'}
task.connect(params)
c = APIClient()
queue_id = c.queues.get_all(name=params['queue_name'])[0].id
while True:
result = c.queues.get_next_task(queue=queue_id)
if not result or not result.entry:
time.sleep(5)
continue
run_task = Task.get_task(task_id=result.entry.task)
spawn_sub_task(run_task) `