Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Dear Community, I Have Tried To Use

Dear community,
I have tried to use clearml-data sync to update a previously created clearml dataset with a folder's content. There had been no change since the time I created the dataset, so this was the output :

clearml-data - Dataset Management & Versioning CLI
Syncing dataset id <ID> to local folder .
Generating SHA2 hash for x files
100%|█████████████████████████████████████████| x/x [x<x,  xit/s]
Hash generation completed
Sync completed: 0 files removed, 0 added, 0 modified
Finalizing dataset
Pending uploads, starting dataset upload to 

File compression and upload completed: total size 0 B, 0 chunk(s) stored (average size 0 B)
clearml - INFO - No pending files, skipping upload.
clearml.Task - ERROR - Action failed <400/110: tasks.add_or_update_artifacts/v2.10 (Invalid task status: expected=created, status=completed)> (task=<ID>, artifacts=[{'key': 'state', 'type': 'dict', 'uri': '
<PROJECT_NAME>/.datasets/DATASET_NAME/DATASET_FOLDER/artifacts/state/state.json', 'content_size': xxx, 'hash': 'xxxx', 'timestamp': xxx, 'type_data': {'preview': 'Dataset state\nFiles added/modified: x - total size x MB\nCurrent dependency graph: {\n  "xxx": []\n}\n', 'content_type': 'application/json'}, 'display_data': [('files added', 'x'), ('files modified', '0'), ('files removed', '0')]}], force=True)

Is it normal that it is crashing ? Did I do anything wrong or is it related to the fact that there was no change ?
Thanks in advance !

  
  
Posted 2 months ago
Votes Newest

Answers 4


Hi @<1523701435869433856:profile|SmugDolphin23> , and thank you for your prompt response.

For my understanding, what is the intended workflow if I intend to keep the same dataset (which should therefore have the same name as it has in the past, and everything should be similar), but generate a new version of that dataset ? Is this what a child dataset is meant to be, or does it mean that I should not have finalised my dataset to begin with ? If the latter, when am I supposed to know when I can finalise a dataset ?

I am particularly puzzled because, according to the documentation of clearml-data sync , "This option is useful in case a user has a single point of truth (i.e. a folder) which updates from time to time", which to me means that I can use this regularly when I update my "truth folder", but the documentation also states "This command also uploads the data and finalizes the dataset automatically.", which means that then I can no longer use this command. Did I misunderstand something ?

Thank you in advance for your support !

  
  
Posted 2 months ago

Thanks for your answers !

  
  
Posted 2 months ago

@<1668427963986612224:profile|GracefulCoral77> You can both create a child or keep the same dataset as long as it is not finalized.
You can skip the finalization using the --skip-close argument. Anyhow, I can see why the current workflow is confusing. I will discuss it with the team, maybe we should allow syncing unfinalized datasets as well.

  
  
Posted 2 months ago

Hi @<1668427963986612224:profile|GracefulCoral77> ! The error is a bit misleading. What it actually means is that you shouldn't attempt to modify a finalized clearml dataset (I suppose that is what you are trying to achieve). Instead, you should create a new dataset that inherits from the finalized one and sync that dataset, or leave the dataset in an unfinalized state

  
  
Posted 2 months ago