Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi. I'M Running This Little Pipeline:

Hi.
I'm running this little pipeline:

` from clearml.automation.controller import PipelineDecorator
from clearml import TaskTypes

@PipelineDecorator.component(return_values=['run_dataset_path'], cache=False, task_type=TaskTypes.data_processing)
def make_dataset(datasets_path):
from fastai.vision.all import untar_data, URLs
from pathlib import Path
datasets_path = Path(datasets_path)

# The next line fetches files to /data/clearml_evaluation/fastai_image_classification/datasets/oxford-iiit-pet/
# or skips them if they are already thereS
run_dataset_path = str(untar_data(URLs.PETS, data=datasets_path) / "images")
return run_dataset_path

@PipelineDecorator.pipeline(name='fastai_image_classification_pipeline', project='lavi_evaluation', version='0.2')
def fastai_image_classification_pipeline(run_datasets_path):
print("make dataset")
run_dataset_path = make_dataset(datasets_path=run_datasets_path)
print(f"run_dataset_path: {run_dataset_path}")
print("pipeline complete")

if name == 'main':
PipelineDecorator.run_locally()
fastai_image_classification_pipeline("/data/clearml_evaluation/fastai_image_classification/datasets/") I expected the pipeline component to (quicly) download files to a local path on my laptop and return a string with the path to where the files were downloaded. When I run this I get the printout below and the code seems to hang. Looking at the component'a "artifacts" tab, I think that it is actually uploading all the files to files.clear.ml `

Am I doing something wrong? Why is this taking so long?

  
  
Posted 2 years ago
Votes Newest

Answers 27


I'm connecting to the hosted clear.ml
packages in use are:
# Python 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0] clearml == 1.6.2 fastai == 2.7.5
in case it matters, I'm running this code in a jupyter notebook within a docker container (to keep things vell isolated). The /data path is volume mapped to my local filesystem (and, in fact, already contains the dataset files, so the fastai call to untar_data should see the data there and return immediately)
That same make_data function call works as expected when not decorated (i.e. outside a pipeline)

  
  
Posted 2 years ago

The pipeline eventually completed after ~20 minutes and the log shows it has downloaded a 755mb file.
I can also download the zip file from the artifacts tab for the component now.
Why is the data being up/down loaded? Can I prevent that?
I get that clearml likes to take good care of my data but I must be doing something wrong here as it doesn't make sense for a dataset to be uploaded to files.clear.ml .

  
  
Posted 2 years ago

Note that if I change the component to return a regular meaningless string - "mock_path" , the pipeline completes rather quickly and the dataset is not uploaded.

  
  
Posted 2 years ago

Hi PanickyMoth78 ,

Note that if I change the component to return a regular meaningless string -

"mock_path"

, the pipeline completes rather quickly and the dataset is not uploaded. (edited)

I think it will use the cache from the second run, it should be much much quicker (nothing to download).

The files server is the default for saving all the artifacts, you can change this (default) with env var ( CLEARML_DEFAULT_OUTPUT_URI ) or config file ( sdk.development.default_output_uri ), or for each task in the Task.init call - you can get some examples in https://clear.ml/docs/latest/docs/faq#git-and-storage (the second issue)

  
  
Posted 2 years ago

Thanks TimelyPenguin76 .
From your reply I understand that I have control over what the destination is but that all files generated in a task get transferred regardless of the return_values decorator argument. Is that correct? Can I disable auto-save of artifacts?
Ideally, I'd like to have better control over what gets auto-saved. E.g. I'm happy for tensorboard events to be captured and shown in clearml and for matplotlib figures to be uploaded (perhaps to gcs) but I'd like to avoid auto-saving of dataset files and perhaps also of model files.
Can I have better control over what gets uploaded or, if that's not an option, turn it off and post manually?

  
  
Posted 2 years ago

Sure, all the auto magic can be configured too - https://clear.ml/docs/latest/docs/faq#experiments , search for Can I control what ClearML automatically logs? to view all the options 🙂

  
  
Posted 2 years ago

Hi again.
Thanks for the previous replies and links but I haven't been able to find the answer to my question: How do I prevent the content of a uri returned by a task from being saved by clearml at all.

I'm using this simplified snippet (that avoids fastai and large data)
` from clearml.automation.controller import PipelineDecorator
from clearml import TaskTypes

@PipelineDecorator.component(
return_values=["run_datasets_path"], cache=False, task_type=TaskTypes.data_processing
)
def make_dataset(datasets_path, run_id):
from pathlib import Path
run_datasets_path = Path(datasets_path) / run_id
run_datasets_path.mkdir(parents=True, exist_ok=True)
with open(run_datasets_path / 'very_large_data_file.txt', 'w') as fp:
fp.write('large amount of data\n')
return run_datasets_path

@PipelineDecorator.pipeline(
name="test_pipeline",
project="lavi_evaluation",
version="0.2",
)
def fastai_image_classification_pipeline(datasets_path, run_id):
print("make dataset")
run_dataset_path = make_dataset(datasets_path=datasets_path, run_id=run_id)
print(f"ret run_dataset_path: {run_dataset_path}")
print("pipeline complete")

if name == "main":
from pathlib import Path
PipelineDecorator.run_locally()
fastai_image_classification_pipeline("/data/my_datasets_path", 'run_id_1') The contents of run_datasets_path are zipped and saved to the clearml files server. I want them to go nowhere, not even to some alternative location The return value of my task is modified from the path where files are written by my task to the cache path that clearml uses. I'd like to understand why this happens (and how to avoid it). Also, i'd like to know why caching is applied in spite of the decorator containing cache=False `Help very much appreciated. I know that in real scenarios data generated within some node would need to go somewhere or it will be deleted but I'd like to see how this can be controlled and done with/without clearml automation.

  
  
Posted 2 years ago

image

  
  
Posted 2 years ago

image

  
  
Posted 2 years ago

How do I prevent the content of a uri returned by a task from being saved by clearml at all.

I think the safest way doing so it to change the clearml files server configuration in your ~/clearml.conf file, you can set https://github.com/allegroai/clearml/blob/master/docs/clearml.conf#L10 to some local mnt path for example of some internal storage service (like minio for example) and the default, including artifacts, debug images and more will be saved in this location by default. Can this solve the issue?

The contents of run_datasets_path are zipped and saved to the clearml files server. I want them to go nowhere, not even to some alternative locationsafest is not to return any value in this case (from the arg doc -
Notice! If not provided no results will be stored as artifacts.)

The return value of my task is modified from the path where files are written by my task to the cache path that clearml uses. I’d like to understand why this happens (and how to avoid it). Also, i’d like to know why caching is applied in spite of the decorator containing cache=FalseThis is cacheing for the Pipeline step (if the pipeline will run again, it will run the step again and wont use some prev step for this pipeline, it useful for uploading data for example, you can upload the data only one time and not in every run of the pipeline), make sense?

  
  
Posted 2 years ago

I found that instead of returning some_returned_url (which triggers zipping and saving of the filed under that url), I can wrap it in a dict: {"the url": some_returned_url} which then lets me pass back the url to the pipeline and only that dict gets uploaded (e.g. {'run_datasets_path': Path('/data/my_datasets_path/run_id_1')} ) I can divert all files that I do want uploaded and tracked by clearml to gs:// by adding at start of task-fuction: Logger.current_logger().set_default_upload_destination(" ")Is there a way to set the default upload destination for all tasks in my ~/clearml.conf ?

  
  
Posted 2 years ago

Is there a way to set the default upload destination for all tasks in my ~/clearml.conf

.. yes by setting files_server: gs://clearml-evaluation/

  
  
Posted 2 years ago

Hi there,

PanickyMoth78
I am having the same issue.
Some steps of the pipeline create huge datasets (some GBs) that I don’t want to upload or save.
Wrap the returns in a dict could be a solution, but honestly, I don’t like it.

AgitatedDove14 Is there any better way to avoid the upload of some artifacts of pipeline steps?

The image above shows an example of the first step of a training pipeline, that queries data from a feature store.
It gets the DataFrame, zip and upload it (this one is very small, but in practice they are really big)
How to avoid this?

  
  
Posted 2 years ago

Is there any better way to avoid the upload of some artifacts of pipeline steps?

How would you pass "huge datasets (some GBs)" between different machines without storing it somewhere?
(btw, I would also turn on component caching so if this is the same code with the same arguments the pipeline step is reused instead of reexecuted all over again)

  
  
Posted 2 years ago

Pipelines runs on the same machine.
We already have the feature-store to save all data, that’s why I don’t need to save it (just a reference of version of dataset).

I understand your point.
I can have different steps of the pipeline running on different machines. But this is not my use case.

  
  
Posted 2 years ago

We already have the feature-store to save all data, that’s why I don’t need to save it (just a reference of version of dataset).

that makes sense, so why don't you point to the feature store ?

I can have different steps of the pipeline running on different machines. But this is not my use case.

if they are running on the same machine you can basically return a path to the local storage or change the output_uri to the local storage, this will cause them to get serialized to the local machine's file system, wdyt?

  
  
Posted 2 years ago

that makes sense, so why don’t you point to the feature store ?

I did, the first step of the pipeline query the feature store. I mean, I set the data version as a parameter, then this steps query the data and return it (to be used in the next step)

  
  
Posted 2 years ago

this will cause them to get serialized to the local machine’s file system, wdyt?

I am about the disk space usage that may increase over time.
I just prefer do not worry about that

  
  
Posted 2 years ago

Well you do somehow need to pass the data, no?

  
  
Posted 2 years ago

These are the steps of the pipeline

  
  
Posted 2 years ago

The transformation has nome parameters that we change eventually
I could merge some steps, but as I may want to cache them in the future, I prefer to keep them separate

  
  
Posted 2 years ago

I could merge some steps, but as I may want to cache them in the future, I prefer to keep them separate

Makes total sense, my only question (and sorry if I'm dwelling too much in it) is how would you pass the data between step 2 to step 3, if this is a different process on the same machine ?

  
  
Posted 2 years ago

I see now.
I didn’t know that each steps runs in a different process
Thus, the return data from step 2 needs to be available somewhere to be used in step 3.

  
  
Posted 2 years ago

So, how wrap the returns in a dict could be a solution?
It will serialize the data on the dict? (leading to the same result, data storage somewhere)

  
  
Posted 2 years ago

Thus, the return data from step 2 needs to be available somewhere to be used in step 3.

Yep 🙂

It will serialize the data on the dict?

I thought it will just point to a local file location where you have the data 🙂

I didn’t know that each steps runs in a different process

Actually ! you can run them as functions as well, try:
if __name__ == '__main__': PipelineDecorator.debug_pipeline() # call pipeline function hereIt will just run them as functions (return values included)

  
  
Posted 2 years ago

Got it.
Thanks for explanation AgitatedDove14 ! 😀

  
  
Posted 2 years ago

Sure thing 🙂

  
  
Posted 2 years ago
1K Views
27 Answers
2 years ago
7 months ago
Tags