I'll try to shed some light on these modules and use cases.
Storagemanager is general speaking, low level access to http/object-storage/files utility. In most cases there is no need to directly use it if objects are already stored/managed on clearml (for example artifacts/models/datasets). But, it is quite handy to use with your S3 buckets etc.
Artifacts: Passing an artifact between Tasks will usually be something like:
artifact_object = Task.get_task('task_id').artifacts['my_artifact'].get()Which will download (and cache) the artifact and will also de-serialize it into a python object
Datasets are just a way to get a folder with files without worrying about where I'm running (i.e. accessing my dataset anywehere)
Usually it will be something like
my_local_dataset_copy_directory = Dataset.get('dataset_id').get_local_copy()Make sense ?
AgitatedDove14 as always much obliged to your fast responses this is actually incredible!
Yeah a bit clearer, something like this in the docs would be really helpful 😉 At least the last part as Storagemanager is actually quite clear.
Maybe I can sum up my understanding?
So am I right in the assumption that I can manage data and the passing of such between tasks either by
Managing them in a folder structure via datasets with the potential issue of syncing a lot of data between tasks and works (obviously accounting for caching on workers between tasks) Managing them as artefacts of Tasks and passing them explicitly coupled to tasks to another task? In that case loosing some of the tracing of datasets and the nice graph? Use both as necessary simultaneously 😄
Btw I sometimes get a gzip error when I am accessing artefacts via the '.get()' part.
JealousParrot68 Some usability comments - Since ClearML is opinionated, there are several pipeline workflow behaviors that make sense if you use Datasets and Artefacts interchangeably, e.g. the step caching AgitatedDove14 mentioned. Also for Datasets, if you combine them with a dedicated subproject like I did on my show, then you have the pattern where asking for the dataset of that subproject will always give you the most up-to-date dataset. Thus you can reuse your pipelines without having to know exactly which version should be used "right now". This embodies our ideal of "decoupling code from data".
Re: boilerplate / fluidity in roles - I think this is what makes ClearML shine for R&D workflows. We can't hope to guess exactly how everyone's MLOps is taking shape, but we can help you get what you need with the fewest lines of code possible.
What my show is aiming to convey at that arc is that you can quickly build on top of our abstractions the functionality that suits you the best. As usually occurs, should you find that you are writing the same code over and over - you could always refactor that out (and maybe submit a nice PR? 😍 ).
Hope this helps a bit as well. If there is anything you'd like to see me going over in the show, let me know 😉
AgitatedDove14 Might be just an error on my side but if I use a pandas DataFrame as an Artefact and then use the .get() method in another task I get a compression error. If I use .get_local_copy() I can use:
df = pd.read_csv(task.artifacts['bla'].get_local_copy(), compression=None) and it works. But I need the compression=None otherwise I'll get the same error as with
.get() I'll build a minimal example tomorrow for you
This is the same as:
There is something odd happening in the files-server as it replaces the header (i.e. guessing the content o fthe stream) and this breaks the download (what happens is the clients automatically ungzip the csv).
We are working on a hit fix to he issue (BTW: if you are using object-storage / shared folders, this will not happen)
Ok the caching part is nice. I think the tricky part (as always) are going to be all the edge cases. E.g. in my preprocessing pipeline I might have a lot of tasks so that I can parallelise nicely but at the cost of quite a lot of boiler plate code for getting and writing artefacts as well as having a lot of tasks in the UI. Lets see
JealousParrot68 yes this seems like a correct description.
The main diff between 1 & 2 is what is the actual data, if this is training/testing data, then Dataset would make sense, if this is a part of a preprocessing pipeline, then artifacts make more sense (notice we added pipeline step caching in the artifacts, so that you can reuse steps if they have the same parameters/code, which means you are able to clone a pipeline and rerun without repeating unnecessary data processing.