Hi, I'M Trying To Upload Data From My S3 Bucket To Clearml Dataset Where I Can Start Versioning It All For My Ml Project. I Have Connected Successfully To My S3, Correctly Configured My Clearml.Conf File, But I Am Struggling With Some Task Initialization

Answered

Hi, I'm trying to upload data from my s3 bucket to clearml dataset where i can start versioning it all for my ML project. I have connected successfully to my s3, correctly configured my clearml.conf file, but I am struggling with some task initialization when it comes to uploading subfolders of s3 bucket directory.
I am receiving this error log message

ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
Dataset 'VisionAI_data' found, creating a new version...
Adding files from:


2024-07-01 17:04:24,711 - clearml.storage - INFO - Uploading: 5.00MB / 32.85MB @ 8.16MBs to /var/folders/zm/vf43rrfs5y5f4tsfqhb0tgdc0000gn/T/state.2m6gxtp_.json
2024-07-01 17:04:24,774 - clearml.storage - INFO - Uploading: 10.00MB / 32.85MB @ 79.69MBs to /var/folders/zm/vf43rrfs5y5f4tsfqhb0tgdc0000gn/T/state.2m6gxtp_.json
2024-07-01 17:04:24,829 - clearml.storage - INFO - Uploading: 15.00MB / 32.85MB @ 91.88MBs to /var/folders/zm/vf43rrfs5y5f4tsfqhb0tgdc0000gn/T/state.2m6gxtp_.json
2024-07-01 17:04:24,897 - clearml.storage - INFO - Uploading: 20.00MB / 32.85MB @ 73.79MBs to /var/folders/zm/vf43rrfs5y5f4tsfqhb0tgdc0000gn/T/state.2m6gxtp_.json
2024-07-01 17:04:24,995 - clearml.storage - INFO - Uploading: 25.00MB / 32.85MB @ 51.01MBs to /var/folders/zm/vf43rrfs5y5f4tsfqhb0tgdc0000gn/T/state.2m6gxtp_.json
2024-07-01 17:04:25,343 - clearml.storage - INFO - Uploading: 30.00MB / 32.85MB @ 14.38MBs to /var/folders/zm/vf43rrfs5y5f4tsfqhb0tgdc0000gn/T/state.2m6gxtp_.json
2024-07-01 17:04:28,106 - clearml.storage - INFO - Uploading: 32.85MB / 32.85MB @ 1.03MBs to /var/folders/zm/vf43rrfs5y5f4tsfqhb0tgdc0000gn/T/state.2m6gxtp_.json
2024-07-01 17:04:28,791 - clearml.Task - ERROR - Action failed <400/110: tasks.add_or_update_artifacts/v2.10 (Invalid task status: expected=created, status=completed)> (task=d685ecee84434b469bca416fafb8bc48, artifacts=[{'key': 'state', 'type': 'dict', 'uri': '

 to ClearML/.datasets/VisionAI_data/VisionAI_data.d685ecee84434b469bca416fafb8bc48/artifacts/state/state.json', 'content_size': 34450423, 'hash': 'a59aae25c98cc9a251ff989768e5c622b475516ce52ec4b030cb837a65d41a4f', 'timestamp': 1719878668, 'type_data': {'preview': 'Dataset state\nFiles added/modified: 112716 - total size 70.64 GB\nCurrent dependency graph: {\n  "d685ecee84434b469bca416fafb8bc48": []\n}\n', 'content_type': 'application/json'}, 'display_data': [('files added', '112716'), ('files removed', '0'), ('files modified', '2')]}], force=True)
Traceback (most recent call last):
  File "/Users/rishiarjun/Desktop/VisionAI/Vision-ML/DataEngineering/S3Connect.py", line 44, in <module>
    create_or_update_dataset_from_s3(bucket_name, dataset_name, dataset_project)
  File "/Users/rishiarjun/Desktop/VisionAI/Vision-ML/DataEngineering/S3Connect.py", line 37, in create_or_update_dataset_from_s3
    dataset.finalize()
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/clearml/datasets/dataset.py", line 828, in finalize
    raise ValueError("Cannot finalize dataset, status '{}' is not valid".format(status))
ValueError: Cannot finalize dataset, status 'completed' is not valid

Heres my script. Its fairly straightforward- establish connection, create task, check if dataset exists, then upload 3 folders from the VisionAI1 bucket in s3

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ShakySnake40
				
					0
					 × 1

Votes Newest

Answers 5

Hi @<1719162259181146112:profile|ShakySnake40> ! It looks like you are trying to update an already finalized dataset. Datasets that are finalized cannot be updated. In general, you should create a new dataset that inherits from the dataset you want to update (via the parent_datasets argument in Dataset.create ) and operate on that dataset instead

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

try:
        dataset = Dataset.get(dataset_name=dataset_name, dataset_project=dataset_project)
        Logger.current_logger().report_text(f"Dataset '{dataset_name}' found, creating a new version...")

What do i modify here so that it allows for a new version to be created of the dataset everytime i rerun the script?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ShakySnake40
				
					0
					 × 1

Something like:

dataset = Dataset.create(dataset_name=dataset_name, dataset_porject=dataset_project, parent_datasets=[dataset.id])

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

What if i already had uploaded the data, and want to update it with new data? Wouldn't make sense to create new dataset right? And what does dataset.id represent?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ShakySnake40
				
					0
					 × 1

@<1719162259181146112:profile|ShakySnake40> the data is still present in the parent and it won't be uploaded again. Also, when you pull a child dataset you are also pulling the dataset's parent data. dataset.id is a string that uniquely identifies each dataset in the system. In my example, you are using the ID to reference a dataset which would be a parent of the newly created dataset (that is, after getting the dataset via Dataset.get )

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

Write your answer

2K Views

5 Answers

one year ago