Is There An Elegant Way Of Accessing A Specific File Entry From A Dataset Without Using Io Operations To Locate The File From The Cache Folder? The File Is Intended To Be Used To Create A Dataframe. At The Moment I'M Using The Code Below. The Problem Is T

Answered

Is there an elegant way of accessing a specific file entry from a dataset without using IO operations to locate the file from the cache folder? The file is intended to be used to create a dataframe. At the moment I'm using the code below. The problem is that as files are added to the dataset, the index of the target file changes and the logic has to be adjusted. Perhaps this is possible by logging a dataframe artifact and using the storage manager to retrieve the artifact?

dataset = Dataset.get(
  dataset_project="Project",
  dataset_name="Dataset name",
  alias="something"
  ).get_local_copy()

file_for_df = os.path.join(dataset, os.listdir(dataset)[2])

  				
Posted 
	one year ago

					More  		
  Report
		
					SkinnyBat30
				
					0
					 × 1

Votes Newest

Answers 6

It is deterministic. When you do Dataset.get(), clearML downloads file state.json, where you can see all relative file paths and chunks number

  				
Posted 
	one year ago

					More  		
  Report
		
					CornyHedgehog13
				
					0
					 × 1

I found this... It works as long as the initial data files uploaded are converted to csv files (e.g., excel, .sav, .spss etc).

preprocess_task = Task.get_task(task_id='xxx123')
local_csv = preprocess_task.artifacts['data'].get_local_copy()

  				
Posted 
	one year ago

					More  		
  Report
		
					SkinnyBat30
				
					0
					 × 1

There is no natural way to expose single files in Datasets. However it looks like you found an appropriate workaround 🙂

  				
Posted 
	one year ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Thanks!

  				
Posted 
	one year ago

					More  		
  Report
		
					SkinnyBat30
				
					0
					 × 1

Thanks. CornyHedgehog13 , I considered this. is the chunk order deterministic? As in, can I rely on chunk [0] always referring to the same file object if additional files are added?

  				
Posted 
	one year ago

					More  		
  Report
		
					SkinnyBat30
				
					0
					 × 1

You can get a chunk number that contains your file and download that chunk

  				
Posted 
	one year ago

					More  		
  Report
		
					CornyHedgehog13
				
					0
					 × 1

Write your answer

1K Views

6 Answers

one year ago