What do you mean by "pull and report multiple trials" ? Spawn multiple processes with different parameters ?
Lets say you are doing bayesian sampling of some parameter with your optimizer, that means the next sample will be a function of previous samples. And all of this is contained in the optimizer state (in the optuna optimizer case in the study object). So to have an option to run some optimization in the way described in the example the communication with the optimizer task should have a synced state of the optimizer.
Pull : accessing a sample from the optimizer ( a point int the hyper plane) in an exclusive way (other machines won't run it again)
Report : push the result in such a way that it would be registered for the bayesian sampling for example
Multiple Trials : The same python script runs more then one without restarting
in terms of the bottleneck considerations, the ClearML agent setup is relatively small portion of the run initialization, we have some other parts, and for some cases we get initialization time can be about 10 times the experiment time
so scaling this overhead cost we are effectively losing (10 x #machines)X in performance for some HPO studies we are running
It does, I am familiar with it I used it many times
But it does make me think, if instead of changing the optimizer I launch a few workers that "pull" enqueued tasks, and then report values for them in such a way that the optimizer is triggered to collect the results? would it be possible?
Lets say I inherit from the Optimizer (you mean HyperParameterOptimizer class? or SearchStrategy?), implement a custom logic for experiment creation logic,
what does it actually exposes? creating an experiment means defining a task, enqueue it and then? I am trying to think what you meant I can put in the logic such that I get the desired effect
the solution you suggested works for the single machine case. The missing part is being able to access and "claim" spawn trials (samples in the HP plane), from multiple machines
So I can avoid running unnecessary common heavy setup, for a light weight experiment
The difference is that I want a single persistent machine, with a single persistent python script that can pull execute and report multiple tasks
This might work (I have to admit I haven't had the time to test, please let me know if it works, so we could push it as a cool new feature 🙂 )
` class LocalClearmlJob(ClearmlJob):
def init(self, *args, **kwargs):
super(LocalClearmlJob, self).init(*args, **kwargs)
def launch(self, queue_name=None):
# type: (str) -> bool
if self._is_cached_task:
return False
# create the subprocess
cmd = self.task.data.execution.script.entrypoint
python = sys.executable
env = dict(**os.environ)
env['CLEARML_TASK_ID'] = env['TRAINS_TASK_ID'] = self.task.id
env['CLEARML_LOG_TASK_TO_BACKEND'] = 1
env['CLEARML_SIMULATE_REMOTE_TASK'] = 1
p = subprocess.Popen(args=[python, cmd], cwd=os.getcwd(), env=env)
return True
an_optimizer = HyperParameterOptimizer(max_number_of_concurrent_tasks=1, ...)
an_optimizer.set_default_job_class(LocalClearmlJob)
an_optimizer.start()
an_optimizer.wait()
an_optimizer.stop() `
This will code will spin a subprocess running the original Task as if it is running by the agent, only locally.
This means the optimizer can control the parameters, and you are running all jobs locally.
wdyt?
Thanks Martin! I'll test it in the following days, I'll keep you updated!
But it does make me think, if instead of changing the optimizer I launch a few workers that "pull" enqueued tasks, and then report values for them in such a way that the optimizer is triggered to collect the results? would it be possible?
But this is Exactly how the optimizer works.
Regardless of the optimizer (OptimizerOptuna or OptimizerBOHB) both set the next step based on the scalars reported by the tasks executed by agents (on remote machines), then decide on the next set of parameters in a Bayesian manner. What am I missing here ?
this way I can avoid the heavy computation I describe above for each individual trial
the unclear part is how do I sample another point in the optimization space from the optimizer
Just so I'm clear on the issue, you want multiple machines to access the internals of the optimizer class ? or Do you just want a way to understand what is the optimizer sampling space (i.e. the parameters and options per parameter) ?
if I can't "pull", execute, report tasks from the same persistent python script it doesn't solve the problem of avoiding rerunning some heavy setup for a lightweight trial
I was hoping for something that I can scale
I want a manual way to access a global optimizer from multiple machines, it can be an agent, however the critical part is that machine will be able to pull and report multiple trials without restarting
however if I want multiple machines syncing with the optimizer, for pulling the sampled hyper parameters and reporting results, I can't see how it would work
I have to admit, this is where I'm loosing you.
I thought you wanted to avoid the agent, since you wanted to run everything locally, wasn't that the issue ?
Maybe there is some background missing here, let me see if I can explain how the optimizer works.
In your actual training code you have something like:params = {'lr': 0.3, 'key': 'option1'} task.connect(params) ... Logger.report_scalars(title='loss', series='l1', value=...)
The values could also be coming from argparser, but the concept is the same. Or TB reporting instead of using the report_scalars.
2. When running the optimizer you have to provide two things:
a. The scalar we are trying to optimize. In this example title='loss', series='l1'
b. The arguments we will change and the sampling range. For example General/lr
[0.01, 1.0, 0.02]
3. The optimizer (assuming active one and not randome/grid) Optuna for example, will sample new General/lr
values based on the reported
title='loss', series='l1' ` of the training code for us.
This is done automagically! Meaning:
The optimizer clones a Task, and changes the configuration/hyper-parameters (the effect is that task.connect when executed by the agent is now not storing the dict, but updating the dict from the backend). Then the optimizer launches the Task and actively in realtime pulls the scalars your training code reports (via the logger or TB). Finally the optimize can shutdown the training on the remote machine automatically and launch a new one.
Make sense ?
Another option is to pull Tasks from a dedicated queue and use the LocalClearMLJob to spwan them
This sounds like it can work. we are talking about something like:#<Machine 1> #Init Optimizer with some dedicated queue <Machine 2> **heavy one time Common Initialization** while True: # sample queue # enqueue with LocalClearMLJob # Execute Something # report results <Machine i> **heavy one time Common Initialization** while True: # sample **same** queue # enqueue with LocalClearMLJob # Execute Something # report results
?
if so can you share a small snippet of# enqueue with LocalClearMLJob
thanks AgitatedDove14 , I will be happy to test it, however I didn't understand it fully.
I can see how it works in the single machine case, however if I want multiple machines syncing with the optimizer, for pulling the sampled hyper parameters and reporting results, I can't see how it would work
something like in the snippet I shared above
AgitatedDove14 , I want multiple machines to access the synced state of the optimizer. which is part of the internals of the optimizer... and then report the results back to the optimizer such that the study object of the optimizer keeps track of the results and the next sample will be aware of all previous studies
the optimizer such that the study object of the optimizer keeps track of the results and the next sample will be aware of all previous studies
This is done from the optimizer side, by sampling the scalars reported by any experiment the optimizer created.
I am looking for a way to manually sample and report from and to the optimizer...
.. I can avoid running unnecessary common heavy setup, for a light weight experiment
Maybe it makes sense to inherit from the Optimizer and add some logic into the creation of a new experiment ? wouldn't that be easier (not saying we cannot store the internal state of the optimizer on an artifact for example, just wondering what would be the best option here). wdyt ?
Okay Now I get it!
Let me think about it for an hour or two 😄
it doesn't even need to be a sub process at this point.. it can be serial execution
to put it a bit differently, I am looking for a way to manually sample and report from and to the optimizer
something like in the example I shared<Machine 1> #Init Optimizer <Machine 2> **heavy one time Common Initialization** while True: #sample Optimizer # init task # Execute Something # report results <Machine i> **heavy one time Common Initialization** while True: #sample **same** Optimizer # init task # Execute Something # report results
let me try to explain myself again
we have some other parts, and for some cases we get initialization time can be about 10 times the experiment time
Before I dive into some agent in agent hacking, I would consider "caching" this preprocessing on an auxiliary Task as an artifact. Basically add another argument for the auxiliary Task, and fetch the data from it (obviously you will need to run it once before the optimizer launches the first experiment).
Now that is out of the way (which really would be the preferred engineering solution) 🙂
This sounds like it can work. we are talking about something like:
Exactly!
In order to do that we have a new "agent-Task" that we manually enqueue (this controls the number of machines that will be running the code). You can see below an "agent-Task" pulling Tasks from "default" queue and spawning them as subprocess (one process per agent-task). Notice I have not been able to fully test the code, but you can run it manually and verify it actually works 🙂 (btw: no need for the LocalClearmlJob, from the optimizer perepective it just launches jobs on the "default" queue)
Let me know it works 🤞
` import sys
import os
import subprocess
import time
from clearml.backend_api.session.client import APIClient
from clearml import Task
def spawn_sub_task(task):
# create the subprocess
cmd = task.data.execution.script.entrypoint
python = sys.executable
env = dict(**os.environ)
env['CLEARML_TASK_ID'] = env['TRAINS_TASK_ID'] = task.id
env['CLEARML_LOG_TASK_TO_BACKEND'] = 1
env['CLEARML_SIMULATE_REMOTE_TASK'] = 1
p = subprocess.Popen(args=[python, cmd], cwd=os.getcwd(), env=env)
p.wait()
return True
task = Task.init('project', 'agent task')
params = {'queue_name': 'default'}
task.connect(params)
c = APIClient()
queue_id = c.queues.get_all(name=params['queue_name'])[0].id
while True:
result = c.queues.get_next_task(queue=queue_id)
if not result or not result.entry:
time.sleep(5)
continue
run_task = Task.get_task(task_id=result.entry.task)
spawn_sub_task(run_task) `
that machine will be able to pull and report multiple trials without restarting
What do you mean by "pull and report multiple trials" ? Spawn multiple processes with different parameters ?
If this is the case: the internals of the optimizer could be synced to the Task so you can access them, but this is basically the internal representation, which is optimizer dependent, which one did you have in mind?
Another option is to pull Tasks from a dedicated queue and use the LocalClearMLJob to spwan them
(think another script in the same repository, just launching them, then the script is the Task we enqueue, this is actually an agent inside an agent).
Now going back to the initial problem we are trying to solve:
... without restarting
How long are those trial that restarting becomes a bottle neck ?
(Notice that git repo is cached, python packages are cached, and I would also recommended turning on full venv cache, this ends up in about 10 sec to spin a Task, not very long, I think...)
https://github.com/allegroai/clearml-agent/blob/351f0657c3dcf707659875d7e0a52fa387709978/docs/clearml.conf#L104
The difference is that I want a single persistent machine, with a single persistent python script that can pull execute and report multiple tasks
So basically instead of using the agent, so simply spin a sub process ?