Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I’M Trying Out Clearml Pipelines From Decorators, And I’M Encountering A Few Problems I Don’T Know How To Solve.

Hi,
I’m trying out ClearML Pipelines from Decorators, and I’m encountering a few problems I don’t know how to solve.
I’d like to configure requirements file, docker image, docker command for my pipeline controller, but it seems I cannot set it up. Am I missing something? I’d like to setup uploading pipeline artifacts / outputs of pipeline steps to a GCP bucket. By default they are uploaded to a file server which seems suboptimal, but it seems there is no option to set it to gcp bucket by default. Am I missing something?

  
  
Posted one year ago
Votes Newest

Answers 6


I’d definitely prefer the ability to set a docker image/docker args/requirements config for the pipeline controller too

That makes sense, any chance you can open a github issue with feature request so that we do not forget ?

The current implementation will upload the result of the first component, and then the first thing the next component will do is download it.

If they are on the same machine, it should be cached when accessed the 2nd time

Wouldn’t it be more performant for the first component to store its result to the local cache along uploading it to file server? In that way, the next component if run on the same node wouldn’t need to download it from the file server.

I think you are correct since the first time, it will not pass through the cache...
Not sure if there is an easy "path" to tell the cache "put this file in the cache"...

  
  
Posted one year ago

Hi DizzyPelican17
I’d like to configure requirements file, docker image, docker command for my pipeline controller, but it seems I cannot set it up. Am I missing something?The decorator itself accepts those as arguments:
https://clear.ml/docs/latest/docs/references/sdk/automation_controller_pipelinecontroller#pipelinedecoratorcomponent
https://github.com/allegroai/clearml/blob/90f30e8d9a5ca9a1afa6b2e5ffccb96b0afe9c78/examples/pipeline/pipeline_from_decorator.py#L8

I’d like to setup uploading pipeline artifacts / outputs of pipeline steps to a GCP bucket. By default they are uploaded to a file server which seems suboptimal, but it seems there is no option to set it to gcp bucket by default. Am I missing something?Sure you can configure the file_server so every artifact is uploaded to GCP instead of the default file server:
https://github.com/allegroai/clearml/blob/90f30e8d9a5ca9a1afa6b2e5ffccb96b0afe9c78/docs/clearml.conf#L10
just put there: gs://bucket/folder do not forget to configure your credentials:
https://github.com/allegroai/clearml/blob/90f30e8d9a5ca9a1afa6b2e5ffccb96b0afe9c78/docs/clearml.conf#L126

  
  
Posted one year ago

I understand I can change the docker image for a component in the pipeline, but for the

it isn’t possible.

you can always to Task.current_task.connect() from the pipeline function itself, to connect more configuration arguments you basically add via the function itself, all the pipeline logic function arguments become pipeline arguments, it's kind of neat 🙂 regrading docker, the idea is that you use a very basic python docker (the default for services) queue for all the pipeline logic, that said, inside the pipeline function you can call Task.current_task.set_base_docker() and set the base docker to be used. The only caveat is that you first have to run it locally.
It might be a good idea to add docker option to the decorator itself regardless, like we have in the component, wdty?

  
  
Posted one year ago

btw, when running pipelines, a common pattern is that two consequential components end up on the same node. The current implementation will upload the result of the first component, and then the first thing the next component will do is download it. I assume get_local_copy method is used and the output is stored to local cache. Wouldn’t it be more performant for the first component to store its result to the local cache along uploading it to file server? In that way, the next component if run on the same node wouldn’t need to download it from the file server.

  
  
Posted one year ago

Thanks for the response!
I understand I can change the docker image for a component in the pipeline, but for the https://github.com/allegroai/clearml/blob/90f30e8d9a5ca9a1afa6b2e5ffccb96b0afe9c78/examples/pipeline/pipeline_from_decorator.py#L77 it isn’t possible. I see that you can just change the queue it runs on, not the docker image, nor its params, nor the requirements. thanks for this! I’ll try it!

  
  
Posted one year ago

Thanks for the extensive response! As the solution seems a bit hackish, I’d definitely prefer the ability to set a docker image/docker args/requirements config for the pipeline controller too

  
  
Posted one year ago
641 Views
6 Answers
one year ago
one year ago
Tags
gcp