Reputation
Badges 1
16 × Eureka!Hi @<1523702000586330112:profile|FierceHamster54> , @<1523701087100473344:profile|SuccessfulKoala55> and @<1523701070390366208:profile|CostlyOstrich36> ,
I have a machine (for simplicity, let's say just one machine with ubuntu) with multiple repositories of my own packages (let's call them my-util and my-service), python and clearml. path to my-util is defined in the system PYTHONPATH. my-service is importing utils from-my-util.
On the same machine, running a code from my-service using ...
hi, the first one. some ip for each instance for init.
cool, where can I submit it?
@<1523701070390366208:profile|CostlyOstrich36> do you have any tips regarding this?
if I have to guess, it is. after all, the Autoscaler is not up, so it can't shut them down. this means you're waiting for amazon to take them away instead.
Hi, Johan, clearml sdk used in the task is on 1.11.1, is clearml doing something behind the scenes you feel like sharing?
currently, D runs only if both C1 and C2 completed successfully
Lets say I have 2 datasets, each runs steps A-C, with continue_on_fail=True. after, I have a sted D that does something at the end of pipeline. D should run if step C1 and C2 (C for set 1 and C for step 2) completed, failed, skipped or aborted (as long as none of them is still in queue or running)
I have multiple combinations of functions I want to call (some steps need some of them while others need all of them, and some need non of them). writing a "do_x_than_y" function for every combination of x_y functions seems messy. wrapping it with support for iterable of functions seems more pythonic IMO.
@<1523701070390366208:profile|CostlyOstrich36>
this is a protected property and therefore should not be called from outside (meaning it's not good practice to do my_pipeline_controller._relaunch_node(failed_node)
I want to create a status_change_callback that checks if node failed due to connection loss, and if so re-adds the task to the queue
my current code looks like this:
def retry_on_connection_error(pipeline: PipelineController, node: PipelineController.Node, *_, **__) -> None:
if not (node.job is None):
is_stopp...
maybe relaunch is not the proper solution, but I'm not sure what is, so I'm open to suggestions
I also noticed the datetime strings for last_update and started to be None
@<1523701087100473344:profile|SuccessfulKoala55> great! So that means It is possible to catch tasks with status aborted and reason non-responsive and retry them so they will come back to queue? also, how do I change the timeout in clearml server?
the last two were found by debugging, but active_duration (which i'm using) is in tha api and derived by them
@<1523701070390366208:profile|CostlyOstrich36> I want the task to be queued and the pipeline to act like it's just a queued task and not fail
ok, how would you approach all tasks in a pipeline and get their data?