Is There An Elegant Way Of Accessing A Specific File Entry From A Dataset Without Using Io Operations To Locate The File From The Cache Folder? The File Is Intended To Be Used To Create A Dataframe. At The Moment I'M Using The Code Below. The Problem Is T

Answered

Is there an elegant way of accessing a specific file entry from a dataset without using IO operations to locate the file from the cache folder? The file is intended to be used to create a dataframe. At the moment I'm using the code below. The problem is that as files are added to the dataset, the index of the target file changes and the logic has to be adjusted. Perhaps this is possible by logging a dataframe artifact and using the storage manager to retrieve the artifact?

dataset = Dataset.get(
  dataset_project="Project",
  dataset_name="Dataset name",
  alias="something"
  ).get_local_copy()

file_for_df = os.path.join(dataset, os.listdir(dataset)[2])

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SkinnyBat30
				
					0
					 × 1

Votes Newest

Answers 6

I found this... It works as long as the initial data files uploaded are converted to csv files (e.g., excel, .sav, .spss etc).

preprocess_task = Task.get_task(task_id='xxx123')
local_csv = preprocess_task.artifacts['data'].get_local_copy()

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SkinnyBat30
				
					0
					 × 1

Thanks!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SkinnyBat30
				
					0
					 × 1

It is deterministic. When you do Dataset.get(), clearML downloads file state.json, where you can see all relative file paths and chunks number

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CornyHedgehog13
				
					0
					 × 1

There is no natural way to expose single files in Datasets. However it looks like you found an appropriate workaround 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Thanks. @<1584716355783888896:profile|CornyHedgehog13> , I considered this. is the chunk order deterministic? As in, can I rely on chunk [0] always referring to the same file object if additional files are added?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SkinnyBat30
				
					0
					 × 1

You can get a chunk number that contains your file and download that chunk

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CornyHedgehog13
				
					0
					 × 1

Write your answer

2K Views

6 Answers

2 years ago