I'M A Bit Confused. It Seems Like Something Has Changed With How Clearml Handles Recording Datasets In Tasks. It Used To Be The Case That When I Would Create A Dataset Under A Task, Clearml Would Record The Id Of The Dataset In The Hyperparameters/Datase

Answered

I'm a bit confused. It seems like something has changed with how ClearML handles recording datasets in tasks.

It used to be the case that when I would create a dataset under a task, ClearML would record the ID of the dataset in the Hyperparameters/Datasets section automatically (see first attached image). I'd also get all the relevant datasets under the info section. Now I don't get either (see second image), instead I get a "General" section in hyperparameters that tells me how many files were changed but nothing about the ID of the dataset. From what I can tell now there's nowhere the task is storing the dataset ID so I have no means to track the dataset with the task.

I did update from 1.11.0 to 1.11.1 but I had the problem even when I reverted back to 1.11.0. I'm using ClearML using the web client, I'm not self hosting or using a SAAS.

I'll reply with a snippet with the code I'm using to construct the dataset. It's a bit of a long process because I do a series of things. Namely: check to see if a dataset exists with that name already, compare the local data to the remote data, and if there's a change, I upload the new dataset as a child of the last version.

  				
Posted 
	one year ago

					More  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Votes Newest

Answers 7

Alright, I'll try and put that together for Monday.

  				
Posted 
	one year ago

					More  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Hi EnthusiasticCow4 ! Note that the Datasets section is created only if you get the dataset with an alias? are you sure that number_of_datasets_on_remote != 0 ?
If so, can you provide a short snippet that would help us reproduce? The code you posted looks fine to me, not sure what the problem could be.

  				
Posted 
	one year ago

					More  		
  Report
		
					SmugDolphin23
				
					0

I see. Thanks for the insight. That seems to be the case. I'm struggling a bit with datasets. For example, if I wanted to trace the genealogy of a dataset that's used by traditional tasks and pipelines. I'll try and write something up about the challenges around that when I get the chance. But your comment revealed another issue:

It appears that the partial name matching isn't going well. I'm unclear why this wouldn't be matching. In the attached photo you can see the input for partial_name is '[LTV] Dataset Test'
and you can see from the unfiltered search there are many datasets titled identically. Yet, with that search criteria I get 0 results. One would assume that a partial match would include perfect matches?

  				
Posted 
	one year ago

					More  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

PRed: None

  				
Posted 
	one year ago

					More  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

EnthusiasticCow4 a PR would be greatly appreciated. If the problem lies in _query_tasks then it should be addressed there

  				
Posted 
	one year ago

					More  		
  Report
		
					SmugDolphin23
				
					0

The plot thickens. It seems like there's something odd going on with the interaction between [LTV] and additional text. If I just search [LTV] it works, if I just search Dataset Test it works, but if I put them together it breaks the search. Now that I think about it, there's other oddities that seem to happen in the web interface that might be explained by some bugs around using brackets in names.

  				
Posted 
	one year ago

					More  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Yes, it indeed appears to be a regex issue. If I run:

Dataset.list_datasets(
                dataset_project=self.task.get_project_name(),
                partial_name=re.escape('[LTV] Dataset Test'),
                only_completed=True,
            )

It works as expected. I'm not sure how raw you want to leave the partial_name features. I could create a PR to fix this but would you want me to re.escape at the list_datasets() level? Or go deeper and do it at Task._query_tasks() level?

  				
Posted 
	one year ago

					More  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Write your answer

1K Views

7 Answers

one year ago