Hi, I Am Having Difficulties When Using The Dataset Functionality. I Am Trying To Create A Dataset With The Following Simple Code:

Answered

Hi, I am having difficulties when using the Dataset functionality.

I am trying to create a dataset with the following simple code:
` from clearml import Task, Dataset

task = Task.init(project_name="myproject", task_name="mytask")

dataset = Dataset.create(
dataset_name="training_split",
dataset_project=task.get_project_name(),
use_current_task=False,
)
dataset.add_files(
path="/home/user/server_local_storage/data/splits/training/",
)
dataset.upload(output_url="/home/user/server_local_storage/clearml_training_dataset")
dataset.finalize() When executing this code, the following exception is raised: Traceback (most recent call last):
File "/home/user/myproject/lab.py", line 16, in <module>
dataset.finalize()
File "/home/user/.conda/envs/myenv/lib/python3.9/site-packages/clearml/datasets/dataset.py", line 483, in finalize
self._serialize(update_dependency_chunk_lookup=True)
File "/home/user/.conda/envs/myenv/lib/python3.9/site-packages/clearml/datasets/dataset.py", line 1182, in _serialize
state['dependency_chunk_lookup'] = self._build_dependency_chunk_lookup()
File "/home/user/.conda/envs/myenv/lib/python3.9/site-packages/clearml/datasets/dataset.py", line 1516, in _build_dependency_chunk_lookup
return dict(chunks_lookup)
File "/home/user/.conda/envs/myenv/lib/python3.9/site-packages/clearml/datasets/dataset.py", line 1514, in <lambda>
lambda d: (d, Dataset.get(dataset_id=d).get_num_chunks()),
File "/home/user/.conda/envs/myenv/lib/python3.9/site-packages/clearml/datasets/dataset.py", line 971, in get_num_chunks
return sum(self._get_dependency_chunk_lookup().values())
File "/home/user/.conda/envs/myenv/lib/python3.9/site-packages/clearml/datasets/dataset.py", line 1786, in _get_dependency_chunk_lookup
self._dependency_chunk_lookup = self._build_dependency_chunk_lookup()
File "/home/user/.conda/envs/myenv/lib/python3.9/site-packages/clearml/datasets/dataset.py", line 1516, in _build_dependency_chunk_lookup
return dict(chunks_lookup)
File "/home/user/.conda/envs/myenv/lib/python3.9/site-packages/clearml/datasets/dataset.py", line 1514, in <lambda>
lambda d: (d, Dataset.get(dataset_id=d).get_num_chunks()),
File "/home/user/.conda/envs/myenv/lib/python3.9/site-packages/clearml/datasets/dataset.py", line 935, in get
raise ValueError('Could not load Dataset id={} state'.format(task.id))
ValueError: Could not load Dataset id=390dc4ca338942aebc2c9ceca2a671d5 state Debugging for a while, I figured out the path I specify in output_url (in the upload ` method) is actually prefixed with 'file://', so on my machine that path points to nowhere and the dataset is never stored there. I think the error is due to this, am I right? what do you think?

  				
Posted 
	3 years ago

					More  		
  Report
		
					GiganticTurtle0
				
					0
					 × 1

Votes Newest

Answers 29

GiganticTurtle0 , will do 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Thanks you for noticing the issue!

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

too large to be stored in the .cache path? It will be stored there anyway?

oh that is exactly why the latest release supports chunks, so you can get a partial copy 🙂
nonetheless, the assumption is that you will have to end up with the data locally, otherwise the network becomes a huge bottleneck
make sense ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Well I tried several things but none of them have worked. I'm a bit lost

  				
Posted 
	3 years ago

					More  		
  Report
		
					GiganticTurtle0
				
					0
					 × 1

Mmm but what if the dataset size is too large to be stored in the .cache path? It will be stored there anyway?

  				
Posted 
	3 years ago

					More  		
  Report
		
					GiganticTurtle0
				
					0
					 × 1

AgitatedDove14 Oops, something still seems to be wrong. When trying to retrieve the dataset using get_local_copy() I get the following error:
Traceback (most recent call last): File "/home/user/myproject/lab.py", line 27, in <module> print(dataset.get_local_copy()) File "/home/user/.conda/envs/myenv/lib/python3.9/site-packages/clearml/datasets/dataset.py", line 554, in get_local_copy target_folder = self._merge_datasets( File "/home/user/.conda/envs/myenv/lib/python3.9/site-packages/clearml/datasets/dataset.py", line 1342, in _merge_datasets target_base_folder = self._create_ds_target_folder( File "/home/user/.conda/envs/myenv/lib/python3.9/site-packages/clearml/datasets/dataset.py", line 1291, in _create_ds_target_folder cache.lock_cache_folder(local_folder) File "/home/user/.conda/envs/myenv/lib/python3.9/site-packages/clearml/storage/cache.py", line 248, in lock_cache_folder lock.acquire(timeout=0) File "/home/user/.conda/envs/myenv/lib/python3.9/site-packages/clearml/utilities/locks/utils.py", line 130, in acquire fh = self._get_fh() File "/home/user/.conda/envs/myenv/lib/python3.9/site-packages/clearml/utilities/locks/utils.py", line 200, in _get_fh return open(self.filename, self.mode, **self.file_open_kwargs) FileNotFoundError: [Errno 2] No such file or directory: '/home/user/.clearml/cache/storage_manager/datasets/.lock.000.ds_38e9acc8d56441999e806815abddee82.clearml'Main code is the same as above, I'm just adding dataset.get_local_copy() at the end. It seems it resolves the path with a .lock file. Weird...

  				
Posted 
	3 years ago

					More  		
  Report
		
					GiganticTurtle0
				
					0
					 × 1

Found it
GiganticTurtle0 you are 🧨 ! thank you for stumbling across this one as well.
Fix will be pushed later today 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

But what I get with

get_local_copy()

is the following path: ...

Get local path will return an immutable copy of the dataset, by definition this will not be the "source" storing the data.
(Also notice that the dataset itself is stored in zip files, and when you get the "local-copy" you get the extracted files)
Make sense ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I can't figure out what might be going on

  				
Posted 
	3 years ago

					More  		
  Report
		
					GiganticTurtle0
				
					0
					 × 1

GiganticTurtle0 , I tried running the same script as before and added dataset.get_local_copy() at the end and it managed to work fine. Do you have any other changes? Are you on the latest repo code?

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

GiganticTurtle0 , I managed to reproduce and make it work once. Let me take a look

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

GiganticTurtle0 fix was just pushed to GitHub 🙂
pip install git+

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 In the 'status.json' file I could see the 'is_dirty' flag is set to True

  				
Posted 
	3 years ago

					More  		
  Report
		
					GiganticTurtle0
				
					0
					 × 1

Great! Thanks for the heads up!

  				
Posted 
	3 years ago

					More  		
  Report
		
					GiganticTurtle0
				
					0
					 × 1

Indeed it does! But what still puzzles me so badly is why I get below path when running dataset.get_local_copy() on one of the machines of my cluster:
/home/user/.clearml/cache/storage_manager/datasets/.lock.000.ds_61ff8d4335dd4b74bd78c3576fa44131.clearml
Why is it pointing to a .lock file?

  				
Posted 
	3 years ago

					More  		
  Report
		
					GiganticTurtle0
				
					0
					 × 1

Hi GiganticTurtle0
Let me check

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks to you for fixing it so quickly!

  				
Posted 
	3 years ago

					More  		
  Report
		
					GiganticTurtle0
				
					0
					 × 1

FileNotFoundError: [Errno 2] No such file or directory: '/home/user/.clearml/cache/storage_manager/datasets/.lock.000.ds_38e9acc8d56441999e806815abddee82.clearml'

Let me check this issue, it seems like the locking mechanism should have figured that there is no lock...

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Fix pushed to github 🙂
pip install git+

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes, I'm working with the latest commit. Anyway, I have tried to run dataset.get_local_copy() on another machine and it works. I have no idea why this happens. However, on the new machine get_local_copy() does not return the path I expect. If I have this code:
dataset.upload( output_url="/home/user/server_local_storage/mock_storage" )I would expect the dataset to be stored under the path specified in output_url . But what I get with get_local_copy() is the following path:
'/home/user/.clearml/cache/storage_manager/datasets/ds_98d1bfbbb7334f50a4113409b4d691be'
Is this usual?

  				
Posted 
	3 years ago

					More  		
  Report
		
					GiganticTurtle0
				
					0
					 × 1

In fact, the datasets directory does not even exist

  				
Posted 
	3 years ago

					More  		
  Report
		
					GiganticTurtle0
				
					0
					 × 1

Well the 'state.json' file is actually removed after the exception is raised

  				
Posted 
	3 years ago

					More  		
  Report
		
					GiganticTurtle0
				
					0
					 × 1

GiganticTurtle0 what's the Dataset Task status?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

datasedataset upload

  				
Posted 
	2 years ago

					More  		
  Report
		
					AlertTurkey7
				
					0

GiganticTurtle0 , it looks like an issue with the latest RC. We're working on it to fix it 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Thanks, I'd appreciate it if you let me know when it's fixed :D

  				
Posted 
	3 years ago

					More  		
  Report
		
					GiganticTurtle0
				
					0
					 × 1

By adding the slash I have been able to see that indeed the dataset is stored in output_url . However, when calling finalize , I get the same error. And yes, I have installed the version corresponding to the last commit :/

  				
Posted 
	3 years ago

					More  		
  Report
		
					GiganticTurtle0
				
					0
					 × 1

Also, can you try with dataset.upload(output_url="/home/user/server_local_storage/clearml_training_dataset/")

(note the added '/' at the end of the line)

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

GiganticTurtle0 , are you using the latest release or the RC?

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Write your answer

1K Views

29 Answers

3 years ago

2 years ago