Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Hi! I Have Some Agents On Gcp. Lately I Have Been Getting Some Experiments That Simply Stop Running (No Signs That The Experiment Crashed). Here Is A Plot That Shows The Resource Monitoring. Any Ideas On What Could Be Causing This?

Hi! I have some agents on GCP. Lately I have been getting some experiments that simply stop running (no signs that the experiment crashed). Here is a plot that shows the resource monitoring. Any ideas on what could be causing this?

Posted 2 years ago
Votes Newest

Answers 6

Hey CostlyOstrich36 ! I am using clearml==1.1.2 and clearml-agent==1.1.0 . Stopped is not the right word, more like frozen, it just froze at an epoch. The console on the agent shows epoch 33 first batch and the one at the server epoch 32 last batch. The experiment was running for ~6 hours.

Posted 2 years ago

GrievingTurkey78 Hi!

What versions of clearml and clearml-agent are you using? Also for how long were the experiments were going?

Seems like agent is still reporting iterations and usage for the experiment so what do you mean by stopped?

Posted 2 years ago

You can check the run time by switching to 'wall time' axis 🙂

Posted 2 years ago

Yeah, I experienced the same issue. Training stopps / freezes at the end of the 10th, or 15th epoch. Using pytorch_lightning as well.

Posted 2 years ago

I am using pytorch_lightning , I'll try to create a snippet I can share! Thanks 🙌

Posted 2 years ago

GrievingTurkey78 , what framework are you working with? Can you provide some more information regarding your environment - linux/windows, pip/conda? Can you provide maybe a snippet of your code I can try to run to reproduce?

Posted 2 years ago
6 Answers
2 years ago
one year ago