To be more clear. An example use case for me would be, that I'm trying to make a pipeline which every time a new dataset/batch is published using clearml-data,
Get the data Train it Save the model and publish it
I want to start this process with a trigger when a dataset is published to the server. Any example which I can look to for accomplishing something like this?
VexedCat68 , what do you mean by trigger? You want some indication that a dataset whats published so you can move to the next step in your pipeline?
Hi VexedCat68
Check this example:
https://github.com/allegroai/clearml/blob/4f9aaa69ed2d5b8ea68ebee5508610d0b1935d5f/examples/scheduler/trigger_example.py#L44
So I took dataset trigger from this and added it to my own test code, which needs to run a task every time this trigger is activated.
It works, however it shows the task is enqueued and pending. Note I am using .start() and not .start_remotely() for now
So I just published a dataset once but it keeps scheduling task.
Okay so they run once i started a clear ml agent listening to that queue.
So it won't work without clearml-agent? Sorry for the barrage of questions. I'm just very confused right now.
Yes, for an enqueued task to run you require an agent to run against the task 🙂
I however have another problem. I have a dataset trigger that has a schedule task.
So in my head, every time i publish a dataset, it should get triggered and run that task.
But what's happening is, that I only publish a dataset once but every time it polls, it gets triggered and enqueues a task even though the dataset was published only once.
VexedCat68
But what's happening is, that I only publish a dataset once but every time it polls,
this seems wrong (i.e a bug?!), how do you setup the trigger ? is the Trigger Task constantly running or are you re-launching it?
This here shows my situation. You can see the code on the left and the tasks called 'Cassava Training' on the right. They keep getting enqueued even though I only sent a trigger once. By that I mean I only published a dataset once.
Also could you explain the difference between trigger.start() and trigger.start_remotely()
I'd like to add an update to this, when I use schedule function instead of schedule task with the dataset trigger scheduler, it works as intended. It runs the desired function when triggered. Then is asleep again next time since no other trigger was fired.
This problem occurs when I'm scheduling a task. Copies of the task keep being put on the queue even though the trigger only fired once.
Also, the task just prints a small string on the console.
Okay so when I add trigger_on_tags, the repetition issue is resolved.
Also could you explain the difference between trigger.start() and trigger.start_remotely()
Start will start the trigger process (the one "watching the changes") locally (this makes sense for debugging etc.)
start_remotely will launch the trigger process on the "services" where it should live forever 🙂
Okay so when I add trigger_on_tags, the repetition issue is resolved.
Nice!
This problem occurs when I'm scheduling a task. Copies of the task keep being put on the queue even though the trigger only fired once.
Hmm I think a bit lost here (and I have a feeling there is some hidden bug somewhere that I'd like us to fix)
How exactly do I make it trigger twice on the same Dataset?