Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I'M Trying To Upload Data From My S3 Bucket To Clearml Dataset Where I Can Start Versioning It All For My Ml Project. I Have Connected Successfully To My S3, Correctly Configured My Clearml.Conf File, But I Am Struggling With Some Task Initialization

Hi, I'm trying to upload data from my s3 bucket to clearml dataset where i can start versioning it all for my ML project. I have connected successfully to my s3, correctly configured my clearml.conf file, but I am struggling with some task initialization when it comes to uploading subfolders of s3 bucket directory.
I am receiving this error log message

ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
Dataset 'VisionAI_data' found, creating a new version...
Adding files from: 

2024-07-01 17:04:24,711 - clearml.storage - INFO - Uploading: 5.00MB / 32.85MB @ 8.16MBs to /var/folders/zm/vf43rrfs5y5f4tsfqhb0tgdc0000gn/T/state.2m6gxtp_.json
2024-07-01 17:04:24,774 - clearml.storage - INFO - Uploading: 10.00MB / 32.85MB @ 79.69MBs to /var/folders/zm/vf43rrfs5y5f4tsfqhb0tgdc0000gn/T/state.2m6gxtp_.json
2024-07-01 17:04:24,829 - clearml.storage - INFO - Uploading: 15.00MB / 32.85MB @ 91.88MBs to /var/folders/zm/vf43rrfs5y5f4tsfqhb0tgdc0000gn/T/state.2m6gxtp_.json
2024-07-01 17:04:24,897 - clearml.storage - INFO - Uploading: 20.00MB / 32.85MB @ 73.79MBs to /var/folders/zm/vf43rrfs5y5f4tsfqhb0tgdc0000gn/T/state.2m6gxtp_.json
2024-07-01 17:04:24,995 - clearml.storage - INFO - Uploading: 25.00MB / 32.85MB @ 51.01MBs to /var/folders/zm/vf43rrfs5y5f4tsfqhb0tgdc0000gn/T/state.2m6gxtp_.json
2024-07-01 17:04:25,343 - clearml.storage - INFO - Uploading: 30.00MB / 32.85MB @ 14.38MBs to /var/folders/zm/vf43rrfs5y5f4tsfqhb0tgdc0000gn/T/state.2m6gxtp_.json
2024-07-01 17:04:28,106 - clearml.storage - INFO - Uploading: 32.85MB / 32.85MB @ 1.03MBs to /var/folders/zm/vf43rrfs5y5f4tsfqhb0tgdc0000gn/T/state.2m6gxtp_.json
2024-07-01 17:04:28,791 - clearml.Task - ERROR - Action failed <400/110: tasks.add_or_update_artifacts/v2.10 (Invalid task status: expected=created, status=completed)> (task=d685ecee84434b469bca416fafb8bc48, artifacts=[{'key': 'state', 'type': 'dict', 'uri': '
 to ClearML/.datasets/VisionAI_data/VisionAI_data.d685ecee84434b469bca416fafb8bc48/artifacts/state/state.json', 'content_size': 34450423, 'hash': 'a59aae25c98cc9a251ff989768e5c622b475516ce52ec4b030cb837a65d41a4f', 'timestamp': 1719878668, 'type_data': {'preview': 'Dataset state\nFiles added/modified: 112716 - total size 70.64 GB\nCurrent dependency graph: {\n  "d685ecee84434b469bca416fafb8bc48": []\n}\n', 'content_type': 'application/json'}, 'display_data': [('files added', '112716'), ('files removed', '0'), ('files modified', '2')]}], force=True)
Traceback (most recent call last):
  File "/Users/rishiarjun/Desktop/VisionAI/Vision-ML/DataEngineering/S3Connect.py", line 44, in <module>
    create_or_update_dataset_from_s3(bucket_name, dataset_name, dataset_project)
  File "/Users/rishiarjun/Desktop/VisionAI/Vision-ML/DataEngineering/S3Connect.py", line 37, in create_or_update_dataset_from_s3
    dataset.finalize()
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/clearml/datasets/dataset.py", line 828, in finalize
    raise ValueError("Cannot finalize dataset, status '{}' is not valid".format(status))
ValueError: Cannot finalize dataset, status 'completed' is not valid 

Heres my script. Its fairly straightforward- establish connection, create task, check if dataset exists, then upload 3 folders from the VisionAI1 bucket in s3

  
  
Posted 6 months ago
Votes Newest

Answers 5


@<1719162259181146112:profile|ShakySnake40> the data is still present in the parent and it won't be uploaded again. Also, when you pull a child dataset you are also pulling the dataset's parent data. dataset.id is a string that uniquely identifies each dataset in the system. In my example, you are using the ID to reference a dataset which would be a parent of the newly created dataset (that is, after getting the dataset via Dataset.get )

  
  
Posted 6 months ago

try:
        dataset = Dataset.get(dataset_name=dataset_name, dataset_project=dataset_project)
        Logger.current_logger().report_text(f"Dataset '{dataset_name}' found, creating a new version...")

What do i modify here so that it allows for a new version to be created of the dataset everytime i rerun the script?

  
  
Posted 6 months ago

Hi @<1719162259181146112:profile|ShakySnake40> ! It looks like you are trying to update an already finalized dataset. Datasets that are finalized cannot be updated. In general, you should create a new dataset that inherits from the dataset you want to update (via the parent_datasets argument in Dataset.create ) and operate on that dataset instead

  
  
Posted 6 months ago

What if i already had uploaded the data, and want to update it with new data? Wouldn't make sense to create new dataset right? And what does dataset.id represent?

  
  
Posted 6 months ago

Something like:

dataset = Dataset.create(dataset_name=dataset_name, dataset_porject=dataset_project, parent_datasets=[dataset.id])
  
  
Posted 6 months ago