Reputation
Badges 1
16 × Eureka!@<1523701070390366208:profile|CostlyOstrich36>
Lets say I have 2 datasets, each runs steps A-C, with continue_on_fail=True. after, I have a sted D that does something at the end of pipeline. D should run if step C1 and C2 (C for set 1 and C for step 2) completed, failed, skipped or aborted (as long as none of them is still in queue or running)
currently, D runs only if both C1 and C2 completed successfully
@<1523701070390366208:profile|CostlyOstrich36> I want the task to be queued and the pipeline to act like it's just a queued task and not fail
Hi, Johan, clearml sdk used in the task is on 1.11.1, is clearml doing something behind the scenes you feel like sharing?
cool, where can I submit it?
I also noticed the datetime strings for last_update and started to be None
maybe relaunch is not the proper solution, but I'm not sure what is, so I'm open to suggestions
Hi @<1523702000586330112:profile|FierceHamster54> , @<1523701087100473344:profile|SuccessfulKoala55> and @<1523701070390366208:profile|CostlyOstrich36> ,
I have a machine (for simplicity, let's say just one machine with ubuntu) with multiple repositories of my own packages (let's call them my-util and my-service), python and clearml. path to my-util is defined in the system PYTHONPATH. my-service is importing utils from-my-util.
On the same machine, running a code from my-service using ...
I have multiple combinations of functions I want to call (some steps need some of them while others need all of them, and some need non of them). writing a "do_x_than_y" function for every combination of x_y functions seems messy. wrapping it with support for iterable of functions seems more pythonic IMO.
ok, how would you approach all tasks in a pipeline and get their data?
@<1523701087100473344:profile|SuccessfulKoala55> great! So that means It is possible to catch tasks with status aborted and reason non-responsive and retry them so they will come back to queue? also, how do I change the timeout in clearml server?
@<1523701070390366208:profile|CostlyOstrich36> do you have any tips regarding this?
the last two were found by debugging, but active_duration (which i'm using) is in tha api and derived by them
hi, the first one. some ip for each instance for init.
this is a protected property and therefore should not be called from outside (meaning it's not good practice to do my_pipeline_controller._relaunch_node(failed_node)
I want to create a status_change_callback that checks if node failed due to connection loss, and if so re-adds the task to the queue
my current code looks like this:
def retry_on_connection_error(pipeline: PipelineController, node: PipelineController.Node, *_, **__) -> None:
if not (node.job is None):
is_stopp...
if I have to guess, it is. after all, the Autoscaler is not up, so it can't shut them down. this means you're waiting for amazon to take them away instead.