Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All, I Am Trying To Execute Somewhat Custom Hpo Scheme With Clearml. I Would Want That A Single Running Python Script Will Be Able To Sample The Optimizer, Init A Task And Report The Result Multiple Times. I Didn'T Find Anything Similar In The Docs Or

Hi all, I am trying to execute somewhat custom HPO scheme with clearml.
I would want that a single running python script will be able to sample the optimizer, init a task and report the result multiple times. I didn't find anything similar in the docs or in the channel history.. any help will be appreciated! thanks

  
  
Posted 3 years ago
Votes Newest

Answers 30


the solution you suggested works for the single machine case. The missing part is being able to access and "claim" spawn trials (samples in the HP plane), from multiple machines

  
  
Posted 3 years ago

that machine will be able to pull and report multiple trials without restarting

What do you mean by "pull and report multiple trials" ? Spawn multiple processes with different parameters ?
If this is the case: the internals of the optimizer could be synced to the Task so you can access them, but this is basically the internal representation, which is optimizer dependent, which one did you have in mind?
Another option is to pull Tasks from a dedicated queue and use the LocalClearMLJob to spwan them
(think another script in the same repository, just launching them, then the script is the Task we enqueue, this is actually an agent inside an agent).
Now going back to the initial problem we are trying to solve:

... without restarting

How long are those trial that restarting becomes a bottle neck ?
(Notice that git repo is cached, python packages are cached, and I would also recommended turning on full venv cache, this ends up in about 10 sec to spin a Task, not very long, I think...)
https://github.com/allegroai/clearml-agent/blob/351f0657c3dcf707659875d7e0a52fa387709978/docs/clearml.conf#L104

  
  
Posted 3 years ago

to put it a bit differently, I am looking for a way to manually sample and report from and to the optimizer

  
  
Posted 3 years ago

it doesn't even need to be a sub process at this point.. it can be serial execution

  
  
Posted 3 years ago

if I can't "pull", execute, report tasks from the same persistent python script it doesn't solve the problem of avoiding rerunning some heavy setup for a lightweight trial

  
  
Posted 3 years ago

this way I can avoid the heavy computation I describe above for each individual trial

  
  
Posted 3 years ago

however if I want multiple machines syncing with the optimizer, for pulling the sampled hyper parameters and reporting results, I can't see how it would work

I have to admit, this is where I'm loosing you.
I thought you wanted to avoid the agent, since you wanted to run everything locally, wasn't that the issue ?
Maybe there is some background missing here, let me see if I can explain how the optimizer works.
In your actual training code you have something like:params = {'lr': 0.3, 'key': 'option1'} task.connect(params) ... Logger.report_scalars(title='loss', series='l1', value=...)The values could also be coming from argparser, but the concept is the same. Or TB reporting instead of using the report_scalars.
2. When running the optimizer you have to provide two things:
a. The scalar we are trying to optimize. In this example title='loss', series='l1'
b. The arguments we will change and the sampling range. For example General/lr [0.01, 1.0, 0.02]
3. The optimizer (assuming active one and not randome/grid) Optuna for example, will sample new General/lr values based on the reported title='loss', series='l1' ` of the training code for us.
This is done automagically! Meaning:
The optimizer clones a Task, and changes the configuration/hyper-parameters (the effect is that task.connect when executed by the agent is now not storing the dict, but updating the dict from the backend). Then the optimizer launches the Task and actively in realtime pulls the scalars your training code reports (via the logger or TB). Finally the optimize can shutdown the training on the remote machine automatically and launch a new one.
Make sense ?

  
  
Posted 3 years ago

something like in the snippet I shared above

  
  
Posted 3 years ago

something like in the example I shared
<Machine 1> #Init Optimizer <Machine 2> **heavy one time Common Initialization** while True: #sample Optimizer # init task # Execute Something # report results <Machine i> **heavy one time Common Initialization** while True: #sample **same** Optimizer # init task # Execute Something # report results

  
  
Posted 3 years ago

But it does make me think, if instead of changing the optimizer I launch a few workers that "pull" enqueued tasks, and then report values for them in such a way that the optimizer is triggered to collect the results? would it be possible?

  
  
Posted 3 years ago

What do you mean by "pull and report multiple trials" ? Spawn multiple processes with different parameters ?Lets say you are doing bayesian sampling of some parameter with your optimizer, that means the next sample will be a function of previous samples. And all of this is contained in the optimizer state (in the optuna optimizer case in the study object). So to have an option to run some optimization in the way described in the example the communication with the optimizer task should have a synced state of the optimizer.
Pull : accessing a sample from the optimizer ( a point int the hyper plane) in an exclusive way (other machines won't run it again)
Report : push the result in such a way that it would be registered for the bayesian sampling for example
Multiple Trials : The same python script runs more then one without restarting

in terms of the bottleneck considerations, the ClearML agent setup is relatively small portion of the run initialization, we have some other parts, and for some cases we get initialization time can be about 10 times the experiment time

so scaling this overhead cost we are effectively losing (10 x #machines)X in performance for some HPO studies we are running

  
  
Posted 3 years ago

I was hoping for something that I can scale

  
  
Posted 3 years ago

But it does make me think, if instead of changing the optimizer I launch a few workers that "pull" enqueued tasks, and then report values for them in such a way that the optimizer is triggered to collect the results? would it be possible?

But this is Exactly how the optimizer works.
Regardless of the optimizer (OptimizerOptuna or OptimizerBOHB) both set the next step based on the scalars reported by the tasks executed by agents (on remote machines), then decide on the next set of parameters in a Bayesian manner. What am I missing here ?

  
  
Posted 3 years ago

Okay Now I get it!
Let me think about it for an hour or two 😄

  
  
Posted 3 years ago

the optimizer such that the study object of the optimizer keeps track of the results and the next sample will be aware of all previous studies

This is done from the optimizer side, by sampling the scalars reported by any experiment the optimizer created.

I am looking for a way to manually sample and report from and to the optimizer...
.. I can avoid running unnecessary common heavy setup, for a light weight experiment

Maybe it makes sense to inherit from the Optimizer and add some logic into the creation of a new experiment ? wouldn't that be easier (not saying we cannot store the internal state of the optimizer on an artifact for example, just wondering what would be the best option here). wdyt ?

  
  
Posted 3 years ago

AgitatedDove14 , I want multiple machines to access the synced state of the optimizer. which is part of the internals of the optimizer... and then report the results back to the optimizer such that the study object of the optimizer keeps track of the results and the next sample will be aware of all previous studies

  
  
Posted 3 years ago

we have some other parts, and for some cases we get initialization time can be about 10 times the experiment time

Before I dive into some agent in agent hacking, I would consider "caching" this preprocessing on an auxiliary Task as an artifact. Basically add another argument for the auxiliary Task, and fetch the data from it (obviously you will need to run it once before the optimizer launches the first experiment).
Now that is out of the way (which really would be the preferred engineering solution) 🙂

This sounds like it can work. we are talking about something like:

Exactly!
In order to do that we have a new "agent-Task" that we manually enqueue (this controls the number of machines that will be running the code). You can see below an "agent-Task" pulling Tasks from "default" queue and spawning them as subprocess (one process per agent-task). Notice I have not been able to fully test the code, but you can run it manually and verify it actually works 🙂 (btw: no need for the LocalClearmlJob, from the optimizer perepective it just launches jobs on the "default" queue)
Let me know it works 🤞
` import sys
import os
import subprocess
import time
from clearml.backend_api.session.client import APIClient
from clearml import Task

def spawn_sub_task(task):
# create the subprocess
cmd = task.data.execution.script.entrypoint
python = sys.executable
env = dict(**os.environ)
env['CLEARML_TASK_ID'] = env['TRAINS_TASK_ID'] = task.id
env['CLEARML_LOG_TASK_TO_BACKEND'] = 1
env['CLEARML_SIMULATE_REMOTE_TASK'] = 1
p = subprocess.Popen(args=[python, cmd], cwd=os.getcwd(), env=env)
p.wait()
return True

task = Task.init('project', 'agent task')
params = {'queue_name': 'default'}
task.connect(params)

c = APIClient()
queue_id = c.queues.get_all(name=params['queue_name'])[0].id

while True:
result = c.queues.get_next_task(queue=queue_id)
if not result or not result.entry:
time.sleep(5)
continue
run_task = Task.get_task(task_id=result.entry.task)
spawn_sub_task(run_task) `

  
  
Posted 3 years ago

Thanks Martin! I'll test it in the following days, I'll keep you updated!

  
  
Posted 3 years ago

I want a manual way to access a global optimizer from multiple machines, it can be an agent, however the critical part is that machine will be able to pull and report multiple trials without restarting

  
  
Posted 3 years ago

let me try to explain myself again

  
  
Posted 3 years ago

thanks AgitatedDove14 , I will be happy to test it, however I didn't understand it fully.
I can see how it works in the single machine case, however if I want multiple machines syncing with the optimizer, for pulling the sampled hyper parameters and reporting results, I can't see how it would work

  
  
Posted 3 years ago

So I can avoid running unnecessary common heavy setup, for a light weight experiment

  
  
Posted 3 years ago

The difference is that I want a single persistent machine, with a single persistent python script that can pull execute and report multiple tasks

So basically instead of using the agent, so simply spin a sub process ?

  
  
Posted 3 years ago

the unclear part is how do I sample another point in the optimization space from the optimizer

Just so I'm clear on the issue, you want multiple machines to access the internals of the optimizer class ? or Do you just want a way to understand what is the optimizer sampling space (i.e. the parameters and options per parameter) ?

  
  
Posted 3 years ago

It does, I am familiar with it I used it many times

  
  
Posted 3 years ago

Lets say I inherit from the Optimizer (you mean HyperParameterOptimizer class? or SearchStrategy?), implement a custom logic for experiment creation logic,
what does it actually exposes? creating an experiment means defining a task, enqueue it and then? I am trying to think what you meant I can put in the logic such that I get the desired effect

  
  
Posted 3 years ago

Checking that for you

  
  
Posted 3 years ago

Another option is to pull Tasks from a dedicated queue and use the LocalClearMLJob to spwan themThis sounds like it can work. we are talking about something like:
#<Machine 1> #Init Optimizer with some dedicated queue <Machine 2> **heavy one time Common Initialization** while True: # sample queue # enqueue with LocalClearMLJob # Execute Something # report results <Machine i> **heavy one time Common Initialization** while True: # sample **same** queue # enqueue with LocalClearMLJob # Execute Something # report results?
if so can you share a small snippet of
# enqueue with LocalClearMLJob

  
  
Posted 3 years ago

The difference is that I want a single persistent machine, with a single persistent python script that can pull execute and report multiple tasks

  
  
Posted 3 years ago

This might work (I have to admit I haven't had the time to test, please let me know if it works, so we could push it as a cool new feature 🙂 )
` class LocalClearmlJob(ClearmlJob):
def init(self, *args, **kwargs):
super(LocalClearmlJob, self).init(*args, **kwargs)

def launch(self, queue_name=None):
    # type: (str) -> bool
    if self._is_cached_task:
        return False

    # create the subprocess
    cmd = self.task.data.execution.script.entrypoint
    python = sys.executable
    env = dict(**os.environ)
    env['CLEARML_TASK_ID'] = env['TRAINS_TASK_ID'] = self.task.id
    env['CLEARML_LOG_TASK_TO_BACKEND'] = 1
    env['CLEARML_SIMULATE_REMOTE_TASK'] = 1
    p = subprocess.Popen(args=[python, cmd], cwd=os.getcwd(), env=env)
    return True

an_optimizer = HyperParameterOptimizer(max_number_of_concurrent_tasks=1, ...)
an_optimizer.set_default_job_class(LocalClearmlJob)
an_optimizer.start()
an_optimizer.wait()
an_optimizer.stop() `
This will code will spin a subprocess running the original Task as if it is running by the agent, only locally.
This means the optimizer can control the parameters, and you are running all jobs locally.
wdyt?

  
  
Posted 3 years ago
1K Views
30 Answers
3 years ago
2 years ago
Tags