Hi. I'M Running This Little Pipeline:

Answered

Hi.
I'm running this little pipeline:

` from clearml.automation.controller import PipelineDecorator
from clearml import TaskTypes

@PipelineDecorator.component(return_values=['run_dataset_path'], cache=False, task_type=TaskTypes.data_processing)
def make_dataset(datasets_path):
from fastai.vision.all import untar_data, URLs
from pathlib import Path
datasets_path = Path(datasets_path)

# The next line fetches files to /data/clearml_evaluation/fastai_image_classification/datasets/oxford-iiit-pet/
# or skips them if they are already thereS
run_dataset_path = str(untar_data(URLs.PETS, data=datasets_path) / "images")
return run_dataset_path

@PipelineDecorator.pipeline(name='fastai_image_classification_pipeline', project='lavi_evaluation', version='0.2')
def fastai_image_classification_pipeline(run_datasets_path):
print("make dataset")
run_dataset_path = make_dataset(datasets_path=run_datasets_path)
print(f"run_dataset_path: {run_dataset_path}")
print("pipeline complete")

if name == 'main':
PipelineDecorator.run_locally()
fastai_image_classification_pipeline("/data/clearml_evaluation/fastai_image_classification/datasets/") I expected the pipeline component to (quicly) download files to a local path on my laptop and return a string with the path to where the files were downloaded. When I run this I get the printout below and the code seems to hang. Looking at the component'a "artifacts" tab, I think that it is actually uploading all the files to files.clear.ml `

Am I doing something wrong? Why is this taking so long?

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Votes Newest

Answers 27

that makes sense, so why don’t you point to the feature store ?

I did, the first step of the pipeline query the feature store. I mean, I set the data version as a parameter, then this steps query the data and return it (to be used in the next step)

  				
Posted 
	2 years ago

					More  		
  Report
		
					IrritableGiraffe81
				
					0
					 × 1

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Pipelines runs on the same machine.
We already have the feature-store to save all data, that’s why I don’t need to save it (just a reference of version of dataset).

I understand your point.
I can have different steps of the pipeline running on different machines. But this is not my use case.

  				
Posted 
	2 years ago

					More  		
  Report
		
					IrritableGiraffe81
				
					0
					 × 1

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

I'm connecting to the hosted clear.ml
packages in use are:
# Python 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0] clearml == 1.6.2 fastai == 2.7.5
in case it matters, I'm running this code in a jupyter notebook within a docker container (to keep things vell isolated). The /data path is volume mapped to my local filesystem (and, in fact, already contains the dataset files, so the fastai call to untar_data should see the data there and return immediately)
That same make_data function call works as expected when not decorated (i.e. outside a pipeline)

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Thanks TimelyPenguin76 .
From your reply I understand that I have control over what the destination is but that all files generated in a task get transferred regardless of the return_values decorator argument. Is that correct? Can I disable auto-save of artifacts?
Ideally, I'd like to have better control over what gets auto-saved. E.g. I'm happy for tensorboard events to be captured and shown in clearml and for matplotlib figures to be uploaded (perhaps to gcs) but I'd like to avoid auto-saving of dataset files and perhaps also of model files.
Can I have better control over what gets uploaded or, if that's not an option, turn it off and post manually?

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Hi PanickyMoth78 ,

Note that if I change the component to return a regular meaningless string -

"mock_path"

, the pipeline completes rather quickly and the dataset is not uploaded. (edited)

I think it will use the cache from the second run, it should be much much quicker (nothing to download).

The files server is the default for saving all the artifacts, you can change this (default) with env var ( CLEARML_DEFAULT_OUTPUT_URI ) or config file ( sdk.development.default_output_uri ), or for each task in the Task.init call - you can get some examples in https://clear.ml/docs/latest/docs/faq#git-and-storage (the second issue)

  				
Posted 
	2 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

The pipeline eventually completed after ~20 minutes and the log shows it has downloaded a 755mb file.
I can also download the zip file from the artifacts tab for the component now.
Why is the data being up/down loaded? Can I prevent that?
I get that clearml likes to take good care of my data but I must be doing something wrong here as it doesn't make sense for a dataset to be uploaded to files.clear.ml .

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

I found that instead of returning some_returned_url (which triggers zipping and saving of the filed under that url), I can wrap it in a dict: {"the url": some_returned_url} which then lets me pass back the url to the pipeline and only that dict gets uploaded (e.g. {'run_datasets_path': Path('/data/my_datasets_path/run_id_1')} ) I can divert all files that I do want uploaded and tracked by clearml to gs:// by adding at start of task-fuction: Logger.current_logger().set_default_upload_destination(" ")Is there a way to set the default upload destination for all tasks in my ~/clearml.conf ?

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Sure, all the auto magic can be configured too - https://clear.ml/docs/latest/docs/faq#experiments , search for Can I control what ClearML automatically logs? to view all the options 🙂

  				
Posted 
	2 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Hi there,

PanickyMoth78
I am having the same issue.
Some steps of the pipeline create huge datasets (some GBs) that I don’t want to upload or save.
Wrap the returns in a dict could be a solution, but honestly, I don’t like it.

AgitatedDove14 Is there any better way to avoid the upload of some artifacts of pipeline steps?

The image above shows an example of the first step of a training pipeline, that queries data from a feature store.
It gets the DataFrame, zip and upload it (this one is very small, but in practice they are really big)
How to avoid this?

  				
Posted 
	2 years ago

					More  		
  Report
		
					IrritableGiraffe81
				
					0
					 × 1

this will cause them to get serialized to the local machine’s file system, wdyt?

I am about the disk space usage that may increase over time.
I just prefer do not worry about that

  				
Posted 
	2 years ago

					More  		
  Report
		
					IrritableGiraffe81
				
					0
					 × 1

Note that if I change the component to return a regular meaningless string - "mock_path" , the pipeline completes rather quickly and the dataset is not uploaded.

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Is there a way to set the default upload destination for all tasks in my ~/clearml.conf

.. yes by setting files_server: gs://clearml-evaluation/

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

The transformation has nome parameters that we change eventually
I could merge some steps, but as I may want to cache them in the future, I prefer to keep them separate

  				
Posted 
	2 years ago

					More  		
  Report
		
					IrritableGiraffe81
				
					0
					 × 1

I could merge some steps, but as I may want to cache them in the future, I prefer to keep them separate

Makes total sense, my only question (and sorry if I'm dwelling too much in it) is how would you pass the data between step 2 to step 3, if this is a different process on the same machine ?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I see now.
I didn’t know that each steps runs in a different process
Thus, the return data from step 2 needs to be available somewhere to be used in step 3.

  				
Posted 
	2 years ago

					More  		
  Report
		
					IrritableGiraffe81
				
					0
					 × 1

We already have the feature-store to save all data, that’s why I don’t need to save it (just a reference of version of dataset).

that makes sense, so why don't you point to the feature store ?

I can have different steps of the pipeline running on different machines. But this is not my use case.

if they are running on the same machine you can basically return a path to the local storage or change the output_uri to the local storage, this will cause them to get serialized to the local machine's file system, wdyt?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

So, how wrap the returns in a dict could be a solution?
It will serialize the data on the dict? (leading to the same result, data storage somewhere)

  				
Posted 
	2 years ago

					More  		
  Report
		
					IrritableGiraffe81
				
					0
					 × 1

Hi again.
Thanks for the previous replies and links but I haven't been able to find the answer to my question: How do I prevent the content of a uri returned by a task from being saved by clearml at all.

I'm using this simplified snippet (that avoids fastai and large data)
` from clearml.automation.controller import PipelineDecorator
from clearml import TaskTypes

@PipelineDecorator.component(
return_values=["run_datasets_path"], cache=False, task_type=TaskTypes.data_processing
)
def make_dataset(datasets_path, run_id):
from pathlib import Path
run_datasets_path = Path(datasets_path) / run_id
run_datasets_path.mkdir(parents=True, exist_ok=True)
with open(run_datasets_path / 'very_large_data_file.txt', 'w') as fp:
fp.write('large amount of data\n')
return run_datasets_path

@PipelineDecorator.pipeline(
name="test_pipeline",
project="lavi_evaluation",
version="0.2",
)
def fastai_image_classification_pipeline(datasets_path, run_id):
print("make dataset")
run_dataset_path = make_dataset(datasets_path=datasets_path, run_id=run_id)
print(f"ret run_dataset_path: {run_dataset_path}")
print("pipeline complete")

if name == "main":
from pathlib import Path
PipelineDecorator.run_locally()
fastai_image_classification_pipeline("/data/my_datasets_path", 'run_id_1') The contents of run_datasets_path are zipped and saved to the clearml files server. I want them to go nowhere, not even to some alternative location The return value of my task is modified from the path where files are written by my task to the cache path that clearml uses. I'd like to understand why this happens (and how to avoid it). Also, i'd like to know why caching is applied in spite of the decorator containing cache=False `Help very much appreciated. I know that in real scenarios data generated within some node would need to go somewhere or it will be deleted but I'd like to see how this can be controlled and done with/without clearml automation.

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Is there any better way to avoid the upload of some artifacts of pipeline steps?

How would you pass "huge datasets (some GBs)" between different machines without storing it somewhere?
(btw, I would also turn on component caching so if this is the same code with the same arguments the pipeline step is reused instead of reexecuted all over again)

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

How do I prevent the content of a uri returned by a task from being saved by clearml at all.

I think the safest way doing so it to change the clearml files server configuration in your ~/clearml.conf file, you can set https://github.com/allegroai/clearml/blob/master/docs/clearml.conf#L10 to some local mnt path for example of some internal storage service (like minio for example) and the default, including artifacts, debug images and more will be saved in this location by default. Can this solve the issue?

The contents of run_datasets_path are zipped and saved to the clearml files server. I want them to go nowhere, not even to some alternative locationsafest is not to return any value in this case (from the arg doc -
Notice! If not provided no results will be stored as artifacts.)

The return value of my task is modified from the path where files are written by my task to the cache path that clearml uses. I’d like to understand why this happens (and how to avoid it). Also, i’d like to know why caching is applied in spite of the decorator containing cache=FalseThis is cacheing for the Pipeline step (if the pipeline will run again, it will run the step again and wont use some prev step for this pipeline, it useful for uploading data for example, you can upload the data only one time and not in every run of the pipeline), make sense?

  				
Posted 
	2 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Thus, the return data from step 2 needs to be available somewhere to be used in step 3.

Yep 🙂

It will serialize the data on the dict?

I thought it will just point to a local file location where you have the data 🙂

I didn’t know that each steps runs in a different process

Actually ! you can run them as functions as well, try:
if __name__ == '__main__': PipelineDecorator.debug_pipeline() # call pipeline function hereIt will just run them as functions (return values included)

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Sure thing 🙂

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Got it.
Thanks for explanation AgitatedDove14 ! 😀

  				
Posted 
	2 years ago

					More  		
  Report
		
					IrritableGiraffe81
				
					0
					 × 1

Well you do somehow need to pass the data, no?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

These are the steps of the pipeline

  				
Posted 
	2 years ago

					More  		
  Report
		
					IrritableGiraffe81
				
					0
					 × 1

Write your answer

2K Views

27 Answers

2 years ago

one year ago