Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I'M Trying To Upload Data From My S3 Bucket To Clearml Dataset Where I Can Start Versioning It All For My Ml Project. I Have Connected Successfully To My S3, Correctly Configured My Clearml.Conf File, But I Am Struggling With Some Task Initialization

Hi, I'm trying to upload data from my s3 bucket to clearml dataset where i can start versioning it all for my ML project. I have connected successfully to my s3, correctly configured my clearml.conf file, but I am struggling with some task initialization when it comes to uploading subfolders of s3 bucket directory.
I am receiving this error log message

ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
Dataset 'VisionAI_data' found, creating a new version...
Adding files from: 

2024-07-01 17:04:24,711 - clearml.storage - INFO - Uploading: 5.00MB / 32.85MB @ 8.16MBs to /var/folders/zm/vf43rrfs5y5f4tsfqhb0tgdc0000gn/T/state.2m6gxtp_.json
2024-07-01 17:04:24,774 - clearml.storage - INFO - Uploading: 10.00MB / 32.85MB @ 79.69MBs to /var/folders/zm/vf43rrfs5y5f4tsfqhb0tgdc0000gn/T/state.2m6gxtp_.json
2024-07-01 17:04:24,829 - clearml.storage - INFO - Uploading: 15.00MB / 32.85MB @ 91.88MBs to /var/folders/zm/vf43rrfs5y5f4tsfqhb0tgdc0000gn/T/state.2m6gxtp_.json
2024-07-01 17:04:24,897 - clearml.storage - INFO - Uploading: 20.00MB / 32.85MB @ 73.79MBs to /var/folders/zm/vf43rrfs5y5f4tsfqhb0tgdc0000gn/T/state.2m6gxtp_.json
2024-07-01 17:04:24,995 - clearml.storage - INFO - Uploading: 25.00MB / 32.85MB @ 51.01MBs to /var/folders/zm/vf43rrfs5y5f4tsfqhb0tgdc0000gn/T/state.2m6gxtp_.json
2024-07-01 17:04:25,343 - clearml.storage - INFO - Uploading: 30.00MB / 32.85MB @ 14.38MBs to /var/folders/zm/vf43rrfs5y5f4tsfqhb0tgdc0000gn/T/state.2m6gxtp_.json
2024-07-01 17:04:28,106 - clearml.storage - INFO - Uploading: 32.85MB / 32.85MB @ 1.03MBs to /var/folders/zm/vf43rrfs5y5f4tsfqhb0tgdc0000gn/T/state.2m6gxtp_.json
2024-07-01 17:04:28,791 - clearml.Task - ERROR - Action failed <400/110: tasks.add_or_update_artifacts/v2.10 (Invalid task status: expected=created, status=completed)> (task=d685ecee84434b469bca416fafb8bc48, artifacts=[{'key': 'state', 'type': 'dict', 'uri': '
 to ClearML/.datasets/VisionAI_data/VisionAI_data.d685ecee84434b469bca416fafb8bc48/artifacts/state/state.json', 'content_size': 34450423, 'hash': 'a59aae25c98cc9a251ff989768e5c622b475516ce52ec4b030cb837a65d41a4f', 'timestamp': 1719878668, 'type_data': {'preview': 'Dataset state\nFiles added/modified: 112716 - total size 70.64 GB\nCurrent dependency graph: {\n  "d685ecee84434b469bca416fafb8bc48": []\n}\n', 'content_type': 'application/json'}, 'display_data': [('files added', '112716'), ('files removed', '0'), ('files modified', '2')]}], force=True)
Traceback (most recent call last):
  File "/Users/rishiarjun/Desktop/VisionAI/Vision-ML/DataEngineering/S3Connect.py", line 44, in <module>
    create_or_update_dataset_from_s3(bucket_name, dataset_name, dataset_project)
  File "/Users/rishiarjun/Desktop/VisionAI/Vision-ML/DataEngineering/S3Connect.py", line 37, in create_or_update_dataset_from_s3
    dataset.finalize()
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/clearml/datasets/dataset.py", line 828, in finalize
    raise ValueError("Cannot finalize dataset, status '{}' is not valid".format(status))
ValueError: Cannot finalize dataset, status 'completed' is not valid 

Heres my script. Its fairly straightforward- establish connection, create task, check if dataset exists, then upload 3 folders from the VisionAI1 bucket in s3

  
  
Posted one year ago
Votes Newest

Answers 5


@<1719162259181146112:profile|ShakySnake40> the data is still present in the parent and it won't be uploaded again. Also, when you pull a child dataset you are also pulling the dataset's parent data. dataset.id is a string that uniquely identifies each dataset in the system. In my example, you are using the ID to reference a dataset which would be a parent of the newly created dataset (that is, after getting the dataset via Dataset.get )

  
  
Posted one year ago

What if i already had uploaded the data, and want to update it with new data? Wouldn't make sense to create new dataset right? And what does dataset.id represent?

  
  
Posted one year ago

Something like:

dataset = Dataset.create(dataset_name=dataset_name, dataset_porject=dataset_project, parent_datasets=[dataset.id])
  
  
Posted one year ago

try:
        dataset = Dataset.get(dataset_name=dataset_name, dataset_project=dataset_project)
        Logger.current_logger().report_text(f"Dataset '{dataset_name}' found, creating a new version...")

What do i modify here so that it allows for a new version to be created of the dataset everytime i rerun the script?

  
  
Posted one year ago

Hi @<1719162259181146112:profile|ShakySnake40> ! It looks like you are trying to update an already finalized dataset. Datasets that are finalized cannot be updated. In general, you should create a new dataset that inherits from the dataset you want to update (via the parent_datasets argument in Dataset.create ) and operate on that dataset instead

  
  
Posted one year ago