Is there any better way to avoid the upload of some artifacts of pipeline steps?
How would you pass "huge datasets (some GBs)" between different machines without storing it somewhere?
(btw, I would also turn on component caching so if this is the same code with the same arguments the pipeline step is reused instead of reexecuted all over again)
I could merge some steps, but as I may want to cache them in the future, I prefer to keep them separate
Makes total sense, my only question (and sorry if I'm dwelling too much in it) is how would you pass the data between step 2 to step 3, if this is a different process on the same machine ?
We already have the feature-store to save all data, that’s why I don’t need to save it (just a reference of version of dataset).
that makes sense, so why don't you point to the feature store ?
I can have different steps of the pipeline running on different machines. But this is not my use case.
if they are running on the same machine you can basically return a path to the local storage or change the output_uri to the local storage, this will cause them to get serialized to the local machine's file system, wdyt?
I see now.
I didn’t know that each steps runs in a different process
Thus, the return data from step 2 needs to be available somewhere to be used in step 3.
How do I prevent the content of a uri returned by a task from being saved by clearml at all.
I think the safest way doing so it to change the clearml files server configuration in your ~/clearml.conf
file, you can set https://github.com/allegroai/clearml/blob/master/docs/clearml.conf#L10 to some local mnt path for example of some internal storage service (like minio for example) and the default, including artifacts, debug images and more will be saved in this location by default. Can this solve the issue?
The contents of run_datasets_path
are zipped and saved to the clearml files server. I want them to go nowhere, not even to some alternative locationsafest is not to return any value in this case (from the arg doc -Notice! If not provided no results will be stored as artifacts.
)
The return value of my task is modified from the path where files are written by my task to the cache path that clearml uses. I’d like to understand why this happens (and how to avoid it). Also, i’d like to know why caching is applied in spite of the decorator containing cache=False
This is cacheing for the Pipeline step (if the pipeline will run again, it will run the step again and wont use some prev step for this pipeline, it useful for uploading data for example, you can upload the data only one time and not in every run of the pipeline), make sense?
Is there a way to set the default upload destination for all tasks in my ~/clearml.conf
.. yes by setting files_server:
gs://clearml-evaluation/
Pipelines runs on the same machine.
We already have the feature-store to save all data, that’s why I don’t need to save it (just a reference of version of dataset).
I understand your point.
I can have different steps of the pipeline running on different machines. But this is not my use case.
Note that if I change the component to return a regular meaningless string - "mock_path"
, the pipeline completes rather quickly and the dataset is not uploaded.
Hi PanickyMoth78 ,
Note that if I change the component to return a regular meaningless string -
"mock_path"
, the pipeline completes rather quickly and the dataset is not uploaded. (edited)
I think it will use the cache from the second run, it should be much much quicker (nothing to download).
The files server is the default for saving all the artifacts, you can change this (default) with env var ( CLEARML_DEFAULT_OUTPUT_URI
) or config file ( sdk.development.default_output_uri
), or for each task in the Task.init
call - you can get some examples in https://clear.ml/docs/latest/docs/faq#git-and-storage (the second issue)
The transformation has nome parameters that we change eventually
I could merge some steps, but as I may want to cache them in the future, I prefer to keep them separate
that makes sense, so why don’t you point to the feature store ?
I did, the first step of the pipeline query the feature store. I mean, I set the data version as a parameter, then this steps query the data and return it (to be used in the next step)
I found that instead of returning some_returned_url
(which triggers zipping and saving of the filed under that url), I can wrap it in a dict: {"the url": some_returned_url}
which then lets me pass back the url to the pipeline and only that dict gets uploaded (e.g. {'run_datasets_path': Path('/data/my_datasets_path/run_id_1')}
) I can divert all files that I do want uploaded and tracked by clearml to gs://
by adding at start of task-fuction: Logger.current_logger().set_default_upload_destination("
")
Is there a way to set the default upload destination for all tasks in my ~/clearml.conf
?
Thus, the return data from step 2 needs to be available somewhere to be used in step 3.
Yep 🙂
It will serialize the data on the dict?
I thought it will just point to a local file location where you have the data 🙂
I didn’t know that each steps runs in a different process
Actually ! you can run them as functions as well, try:if __name__ == '__main__': PipelineDecorator.debug_pipeline() # call pipeline function here
It will just run them as functions (return values included)
Got it.
Thanks for explanation AgitatedDove14 ! 😀
Well you do somehow need to pass the data, no?
So, how wrap the returns in a dict could be a solution?
It will serialize the data on the dict? (leading to the same result, data storage somewhere)
Hi again.
Thanks for the previous replies and links but I haven't been able to find the answer to my question: How do I prevent the content of a uri returned by a task from being saved by clearml at all.
I'm using this simplified snippet (that avoids fastai and large data)
` from clearml.automation.controller import PipelineDecorator
from clearml import TaskTypes
@PipelineDecorator.component(
return_values=["run_datasets_path"], cache=False, task_type=TaskTypes.data_processing
)
def make_dataset(datasets_path, run_id):
from pathlib import Path
run_datasets_path = Path(datasets_path) / run_id
run_datasets_path.mkdir(parents=True, exist_ok=True)
with open(run_datasets_path / 'very_large_data_file.txt', 'w') as fp:
fp.write('large amount of data\n')
return run_datasets_path
@PipelineDecorator.pipeline(
name="test_pipeline",
project="lavi_evaluation",
version="0.2",
)
def fastai_image_classification_pipeline(datasets_path, run_id):
print("make dataset")
run_dataset_path = make_dataset(datasets_path=datasets_path, run_id=run_id)
print(f"ret run_dataset_path: {run_dataset_path}")
print("pipeline complete")
if name == "main":
from pathlib import Path
PipelineDecorator.run_locally()
fastai_image_classification_pipeline("/data/my_datasets_path", 'run_id_1') The contents of
run_datasets_path are zipped and saved to the clearml files server. I want them to go nowhere, not even to some alternative location The return value of my task is modified from the path where files are written by my task to the cache path that clearml uses. I'd like to understand why this happens (and how to avoid it). Also, i'd like to know why caching is applied in spite of the decorator containing
cache=False `Help very much appreciated. I know that in real scenarios data generated within some node would need to go somewhere or it will be deleted but I'd like to see how this can be controlled and done with/without clearml automation.
I'm connecting to the hosted clear.ml
packages in use are:# Python 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0] clearml == 1.6.2 fastai == 2.7.5
in case it matters, I'm running this code in a jupyter notebook within a docker container (to keep things vell isolated). The /data
path is volume mapped to my local filesystem (and, in fact, already contains the dataset files, so the fastai call to untar_data should see the data there and return immediately)
That same make_data function call works as expected when not decorated (i.e. outside a pipeline)
Sure, all the auto magic can be configured too - https://clear.ml/docs/latest/docs/faq#experiments , search for Can I control what ClearML automatically logs?
to view all the options 🙂
Hi there,
PanickyMoth78
I am having the same issue.
Some steps of the pipeline create huge datasets (some GBs) that I don’t want to upload or save.
Wrap the returns in a dict could be a solution, but honestly, I don’t like it.
AgitatedDove14 Is there any better way to avoid the upload of some artifacts of pipeline steps?
The image above shows an example of the first step of a training pipeline, that queries data from a feature store.
It gets the DataFrame, zip and upload it (this one is very small, but in practice they are really big)
How to avoid this?
Thanks TimelyPenguin76 .
From your reply I understand that I have control over what the destination is but that all files generated in a task get transferred regardless of the return_values
decorator argument. Is that correct? Can I disable auto-save of artifacts?
Ideally, I'd like to have better control over what gets auto-saved. E.g. I'm happy for tensorboard events to be captured and shown in clearml and for matplotlib figures to be uploaded (perhaps to gcs) but I'd like to avoid auto-saving of dataset files and perhaps also of model files.
Can I have better control over what gets uploaded or, if that's not an option, turn it off and post manually?
The pipeline eventually completed after ~20 minutes and the log shows it has downloaded a 755mb file.
I can also download the zip file from the artifacts tab for the component now.
Why is the data being up/down loaded? Can I prevent that?
I get that clearml likes to take good care of my data but I must be doing something wrong here as it doesn't make sense for a dataset to be uploaded to files.clear.ml
.
this will cause them to get serialized to the local machine’s file system, wdyt?
I am about the disk space usage that may increase over time.
I just prefer do not worry about that