Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hey All, We Have Clearml Running On Our K8S On Prem With 4 Worker Nodes For Clearml-Agents And One Node For Clearml-Server. We Would Like To Start Using Clearml Datasets And Running Pipelineg Training On It. The Datasets Might Be Very Large, About 20G. I

Hey all, we have clearml running on our k8s on prem with 4 worker nodes for Clearml-Agents and one node for Clearml-Server.
We would like to start using clearml datasets AND running pipelineg training on it.
The datasets might be very large, about 20g.
I am worried that the training might create high load on our network.
Would you recommend to:

  • Host the dataset on clearml-server ?
  • Host on S3 \ R2 ?
  • Host in our K8S with minio \ on specific NFS path?

Also, is it true unless streaming is explicitly enabled, ClearML Agent downloads the entire dataset before training begins.
How important is it that the PVC of clearml will be on SDD instead of HDD?

  
  
Posted 3 months ago
Votes Newest

Answers


Hi @<1726047624538099712:profile|WorriedSwan6> , to answer your questions:

Would you recommend to:

  • Host the dataset on clearml-server ?
  • Host on S3 \ R2 ?
  • Host in our K8S with minio \ on specific NFS path?

I personally like using AWS S3 if available or minio if running locally. It really depends on your infrastructure. I would suggest testing what setup works best for you.

Also, is it true unless streaming is explicitly enabled, ClearML Agent downloads the entire dataset before training begins.

What do you mean by streaming? Also, the agent orchestrates this, downloading data is done by your own code (Using clearml SDK of course).

How important is it that the PVC of clearml will be on SDD instead of HDD

Doesn't sound very critical to me

  
  
Posted 3 months ago