Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All, I'M Training A Model Using Aws Sagemaker And Monitoring With A Clearml Server On-Prem. Works Well Enough When The Training Is Split (Horovod - With A Task On Each Rank). But When I Try And Spawn Eval Jobs To Run On Different Aws Machines, It Seems

Hi all, I'm training a model using AWS sagemaker and monitoring with a clearML server on-prem. Works well enough when the training is split (horovod - with a task on each rank). But when I try and spawn eval jobs to run on different AWS machines, it seems that the Task.init kills the job.
Note that spawning the eval jobs work fine without clearML.
I'm a little out of my depth figuring out whats wrong, can anyone tell what I am missing?
Thanks!
Edit: in case it is relevant: the local training script creates the first task and the rest are in the cloud while the eval jobs are spawned from the training jobs so they all start in the cloud. That is the most significant difference I can think of between the training jobs that work fine and the eval jobs that fail.

  
  
Posted 2 years ago
Votes Newest

Answers 4


IrateDolphin19 , can you give a bit of an explanation on how and what you're doing, and what on the clearml side seems to fail - how do you create the tasks and manage them...

  
  
Posted 2 years ago

not sure if this makes it more or less clear 😕

  
  
Posted 2 years ago

I'm not familiar with pipelines, I don't believe I'm using it

  
  
Posted 2 years ago

Hi IrateDolphin19 ,

Can you give a bit of a simplistic schema of what you're doing or trying to achieve? Are you using pipelines for this?

  
  
Posted 2 years ago