Hmm Is There Any Clear (Pun Intended) Documentation On The Roles Of Storagemanager, Dataset And Artefacts? It Seems To Me There Are Various Overlapping Roles And I'M Not Sure I Fully Grasp The Best Way Of Using Them. Especially When Looking At The Way Da

Answered

Hmm is there any clear (pun intended) documentation on the roles of Storagemanager, Dataset and artefacts? It seems to me there are various overlapping roles and I'm not sure I fully grasp the best way of using them.

Especially when looking at the way datasets and artefacts are used in this series: https://www.youtube.com/watch?v=c2rBYMARCSc
as opposed to the pipeline tutorial in the github repo which works with registering parameters etc.

Also my impression is that I end up writing a lot of boilerplate code passing datasets/artefacts between different tasks when I am building a pipeline. Is that intended or am I missing something?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JealousParrot68
				
					0
					 × 1

Votes Newest

Answers 18

The file itslef is csv.gz compressed, it's actually sending from the file-server back that messes things
(you can test with output_uri=/tmp/folder )

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

maybe a pandas version issue?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JealousParrot68
				
					0
					 × 1

Working on it as we speak 🙂 probably a day worst case 2. This is quite strange and we are not sure where is the fault, as nothing in the code itself changed...

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

any idea when that hot fix is coming?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JealousParrot68
				
					0
					 × 1

Ah I see, ok I'll have to wait then thanks

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JealousParrot68
				
					0
					 × 1

Hi JealousParrot68
This is the same as:
https://clearml.slack.com/archives/CTK20V944/p1627819701055200
and,
https://github.com/allegroai/clearml/issues/411

There is something odd happening in the files-server as it replaces the header (i.e. guessing the content o fthe stream) and this breaks the download (what happens is the clients automatically ungzip the csv).
We are working on a hit fix to he issue (BTW: if you are using object-storage / shared folders, this will not happen)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Minimum example:
df = pd.DataFrame([[1,2,3], [1,2,3]])
task.upload_artifact('test', df)
task.artifacts['test'].get()

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JealousParrot68
				
					0
					 × 1

AgitatedDove14 any idea on what that is?
If I have a task and I upload a dataframe with task.upload_artifact('test', dataframe)
and then on the same task to task.artifacts['test'].get() I always get an error ...

Are you able to reproduce it?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JealousParrot68
				
					0
					 × 1

AgitatedDove14 Might be just an error on my side but if I use a pandas DataFrame as an Artefact and then use the .get() method in another task I get a compression error. If I use .get_local_copy() I can use: df = pd.read_csv(task.artifacts['bla'].get_local_copy(), compression=None) and it works. But I need the compression=None otherwise I'll get the same error as with .get() I'll build a minimal example tomorrow for you

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JealousParrot68
				
					0
					 × 1

JealousParrot68 Some usability comments - Since ClearML is opinionated, there are several pipeline workflow behaviors that make sense if you use Datasets and Artefacts interchangeably, e.g. the step caching AgitatedDove14 mentioned. Also for Datasets, if you combine them with a dedicated subproject like I did on my show, then you have the pattern where asking for the dataset of that subproject will always give you the most up-to-date dataset. Thus you can reuse your pipelines without having to know exactly which version should be used "right now". This embodies our ideal of "decoupling code from data".

Re: boilerplate / fluidity in roles - I think this is what makes ClearML shine for R&D workflows. We can't hope to guess exactly how everyone's MLOps is taking shape, but we can help you get what you need with the fewest lines of code possible.
What my show is aiming to convey at that arc is that you can quickly build on top of our abstractions the functionality that suits you the best. As usually occurs, should you find that you are writing the same code over and over - you could always refactor that out (and maybe submit a nice PR? 😍 ).

Hope this helps a bit as well. If there is anything you'd like to see me going over in the show, let me know 😉

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					GrumpyPenguin23
				
					0
					 × 1

Ok the caching part is nice. I think the tricky part (as always) are going to be all the edge cases. E.g. in my preprocessing pipeline I might have a lot of tasks so that I can parallelise nicely but at the cost of quite a lot of boiler plate code for getting and writing artefacts as well as having a lot of tasks in the UI. Lets see

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JealousParrot68
				
					0
					 × 1

Hmm this is odd, is this a download issue? if this is reproducible maybe we should investigate further...

I'll keep you informed as I play around with it 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JealousParrot68
				
					0
					 × 1

Btw I sometimes get a gzip error when I am accessing artefacts via the '.get()' part.

Hmm this is odd, is this a download issue? if this is reproducible maybe we should investigate further...

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

JealousParrot68 yes this seems like a correct description.
The main diff between 1 & 2 is what is the actual data, if this is training/testing data, then Dataset would make sense, if this is a part of a preprocessing pipeline, then artifacts make more sense (notice we added pipeline step caching in the artifacts, so that you can reuse steps if they have the same parameters/code, which means you are able to clone a pipeline and rerun without repeating unnecessary data processing.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I'm going to answer soon 🙏

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					GrumpyPenguin23
				
					0
					 × 1

point 3. being showcased in GrumpyPenguin23 video

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JealousParrot68
				
					0
					 × 1

AgitatedDove14 as always much obliged to your fast responses this is actually incredible!

Yeah a bit clearer, something like this in the docs would be really helpful 😉 At least the last part as Storagemanager is actually quite clear.

Maybe I can sum up my understanding?
So am I right in the assumption that I can manage data and the passing of such between tasks either by
Managing them in a folder structure via datasets with the potential issue of syncing a lot of data between tasks and works (obviously accounting for caching on workers between tasks) Managing them as artefacts of Tasks and passing them explicitly coupled to tasks to another task? In that case loosing some of the tracing of datasets and the nice graph? Use both as necessary simultaneously 😄
Btw I sometimes get a gzip error when I am accessing artefacts via the '.get()' part.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JealousParrot68
				
					0
					 × 1

Hi JealousParrot68
I'll try to shed some light on these modules and use cases.
Storagemanager is general speaking, low level access to http/object-storage/files utility. In most cases there is no need to directly use it if objects are already stored/managed on clearml (for example artifacts/models/datasets). But, it is quite handy to use with your S3 buckets etc.

Artifacts: Passing an artifact between Tasks will usually be something like:
artifact_object = Task.get_task('task_id').artifacts['my_artifact'].get()Which will download (and cache) the artifact and will also de-serialize it into a python object

Datasets are just a way to get a folder with files without worrying about where I'm running (i.e. accessing my dataset anywehere)
Usually it will be something like
my_local_dataset_copy_directory = Dataset.get('dataset_id').get_local_copy()Make sense ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

18 Answers

4 years ago

2 years ago