Unfortunately, no, I can't paste the whole code. In a nutshell, the trigger spawns a new GCE instance with a clearml-agent
running to schedule the experiments in Cloud.
This is an excerpt:
def gcp_start_trigger(task_id: str):
curr_task = Task.get_task(task_id)
#curr_task.reset(force=True)
config = extract_config(curr_task)
machine_type = config.get('machine-type')
queue_name = f"gcp/{machine_type}"
ensure_queue(queue_name) # creates a new queue if it doesn't exist
instance_name = name_generator(task_id)
print(config) # debug print
gpus = create_gpus(config) # define gpus
create_from_machine_type(
project_id=GOOGLE_PROJECT,
zone=f"{GOOGLE_ZONE}",
instance_name=instance_name,
machine_type=machine_type,
accelerators=gpus,
queue_name=queue_name
)
Task.dequeue(curr_task) # remove from an empty queue
Task.enqueue(curr_task, queue_name=queue_name) # put the task in a particular queue
return
def gcp_stop_trigger(task_id):
instance_name = name_generator(task_id)
delete_instance(
project_id=GOOGLE_PROJECT,
zone=f"{GOOGLE_ZONE}",
machine_name=instance_name
)
delete_disk(
project_id=GOOGLE_PROJECT,
zone=f"{GOOGLE_ZONE}",
machine_name=f"{instance_name}",
)
return
trigger = TriggerScheduler(pooling_frequency_minutes=10/60)
trigger.add_task_trigger(
trigger_required_tags=['google'],
schedule_function=gcp_start_trigger,
trigger_on_status=['queued'],
name="job_start",
)
trigger.add_task_trigger(
trigger_required_tags=['google'],
schedule_function=gcp_stop_trigger,
trigger_on_status=['failed', 'completed', 'stopped', 'closed'],
name="job_end",
)
trigger.start_remotely()
however, I don't think it's our code, since the trigger is not triggered at all, unless a new task is created :((
as for the clearml version, they differ:
- the clearml server we self-host shows this:
WebApp: 1.7.0-232 • Server: 1.7.0-232 • API: 2.21
- the installed clearml in a trigger task shows
clearml==1.8.2
- the installed clearml in the experiment task that attempts to trigger is
1.9.0
Is the trigger controller running on the services queue ?
Yes, yes it is
however, I don't think it's our code, since the trigger is not triggered at all, unless a new task is created :((
Yeah I think you are correct, I'm more interested in understanding the how you use it ...
BTW can you test with the latest clearml
python version (the trigger code is the important part)?
Yeah, you are right.
We use an empty queue to enqueue our tasks in, just to trigger the scheduler 😅 it's only importance is that the experiment is not enqueued anywhere else, but the trigger then enqueues it
It's just that the trigger is never triggered
(Except when a new task is created - this was not the case)
This is odd... can you post the entire trigger code ?
also what's the clearml version?
We use an empty queue to enqueue our tasks in, just to trigger the scheduler
it's only importance is that the experiment is not enqueued anywhere else, but the trigger then enqueues it
👍
It's just that the trigger is never triggered
(Except when a new task is created - this was not the case)
Is the trigger controller running on the services queue ?
Hi RotundHedgehog76
Notice that the "queued" is on the state of the Task, as well as the the tag
We tried to enqueue the stopped task at the particular queue and we added the particular tagWhat do you mean by specific queue ? this will trigger on any Queued Task with the 'particular-tag' ?