Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Unanswered
Hello Clearml Community, I'M Using A Self-Hosted Clearml Setup, And While It Generally Performs Well, I Encounter Issues During Disruptions, Such As Temporary Network Problems. Here'S The Specific Scenario: I Have A Pipeline With Three Tasks. The First

Hello ClearML Community,

I'm using a self-hosted ClearML setup, and while it generally performs well, I encounter issues during disruptions, such as temporary network problems.

Here's the specific scenario: I have a pipeline with three tasks. The first task completes successfully, and the second task is queued on a worker node and begins processing. However, if the head node experiences issues such as a network problem or a reboot, communication between the head and compute nodes is lost. Consequently, the entire pipeline status changes to "Forced stop (non-responsive)," preventing the third task from starting. Despite this, the second task completes successfully, and once the head node is responsive again, all logs from the compute node are uploaded.

This situation is particularly problematic for long-running tasks. If the third task doesn't run, I have to restart the entire pipeline, which means waiting several days for the second task to finish again. I noticed there's a "Continue" option when the pipeline is in the "Forced stop (non-responsive)" state (accessible via right-click on the pipeline). Unfortunately, this option restarts the entire pipeline from the beginning.

I would like to ask:

  • Is the behavior I'm experiencing a bug in ClearML? Should the pipeline continue from the third task?
  • Is there an alternative way to resume the pipeline without starting from the beginning?
  
  
Posted 2 months ago
Votes Newest

Answers