Unanswered
Hey,
We'Ve Experienced Some Issues With Clearml Trigger Schedulers We Were Playing With In The Last Few Days. This Is What Happened:
Unfortunately, no, I can't paste the whole code. In a nutshell, the trigger spawns a new GCE instance with a clearml-agent
running to schedule the experiments in Cloud.
This is an excerpt:
def gcp_start_trigger(task_id: str):
curr_task = Task.get_task(task_id)
#curr_task.reset(force=True)
config = extract_config(curr_task)
machine_type = config.get('machine-type')
queue_name = f"gcp/{machine_type}"
ensure_queue(queue_name) # creates a new queue if it doesn't exist
instance_name = name_generator(task_id)
print(config) # debug print
gpus = create_gpus(config) # define gpus
create_from_machine_type(
project_id=GOOGLE_PROJECT,
zone=f"{GOOGLE_ZONE}",
instance_name=instance_name,
machine_type=machine_type,
accelerators=gpus,
queue_name=queue_name
)
Task.dequeue(curr_task) # remove from an empty queue
Task.enqueue(curr_task, queue_name=queue_name) # put the task in a particular queue
return
def gcp_stop_trigger(task_id):
instance_name = name_generator(task_id)
delete_instance(
project_id=GOOGLE_PROJECT,
zone=f"{GOOGLE_ZONE}",
machine_name=instance_name
)
delete_disk(
project_id=GOOGLE_PROJECT,
zone=f"{GOOGLE_ZONE}",
machine_name=f"{instance_name}",
)
return
trigger = TriggerScheduler(pooling_frequency_minutes=10/60)
trigger.add_task_trigger(
trigger_required_tags=['google'],
schedule_function=gcp_start_trigger,
trigger_on_status=['queued'],
name="job_start",
)
trigger.add_task_trigger(
trigger_required_tags=['google'],
schedule_function=gcp_stop_trigger,
trigger_on_status=['failed', 'completed', 'stopped', 'closed'],
name="job_end",
)
trigger.start_remotely()
however, I don't think it's our code, since the trigger is not triggered at all, unless a new task is created :((
as for the clearml version, they differ:
- the clearml server we self-host shows this:
WebApp: 1.7.0-232 • Server: 1.7.0-232 • API: 2.21
- the installed clearml in a trigger task shows
clearml==1.8.2
- the installed clearml in the experiment task that attempts to trigger is
1.9.0
165 Views
0
Answers
one year ago
one year ago