I'M A Bit Confused. It Seems Like Something Has Changed With How Clearml Handles Recording Datasets In Tasks. It Used To Be The Case That When I Would Create A Dataset Under A Task, Clearml Would Record The Id Of The Dataset In The Hyperparameters/Datase

Answered

I'm a bit confused. It seems like something has changed with how ClearML handles recording datasets in tasks.

It used to be the case that when I would create a dataset under a task, ClearML would record the ID of the dataset in the Hyperparameters/Datasets section automatically (see first attached image). I'd also get all the relevant datasets under the info section. Now I don't get either (see second image), instead I get a "General" section in hyperparameters that tells me how many files were changed but nothing about the ID of the dataset. From what I can tell now there's nowhere the task is storing the dataset ID so I have no means to track the dataset with the task.

I did update from 1.11.0 to 1.11.1 but I had the problem even when I reverted back to 1.11.0. I'm using ClearML using the web client, I'm not self hosting or using a SAAS.

I'll reply with a snippet with the code I'm using to construct the dataset. It's a bit of a long process because I do a series of things. Namely: check to see if a dataset exists with that name already, compare the local data to the remote data, and if there's a change, I upload the new dataset as a child of the last version.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Votes Newest

Answers 7

PRed: None

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Alright, I'll try and put that together for Monday.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

@<1545216070686609408:profile|EnthusiasticCow4> a PR would be greatly appreciated. If the problem lies in _query_tasks then it should be addressed there

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

Yes, it indeed appears to be a regex issue. If I run:

Dataset.list_datasets(
                dataset_project=self.task.get_project_name(),
                partial_name=re.escape('[LTV] Dataset Test'),
                only_completed=True,
            )

It works as expected. I'm not sure how raw you want to leave the partial_name features. I could create a PR to fix this but would you want me to re.escape at the list_datasets() level? Or go deeper and do it at Task._query_tasks() level?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

The plot thickens. It seems like there's something odd going on with the interaction between [LTV] and additional text. If I just search [LTV] it works, if I just search Dataset Test it works, but if I put them together it breaks the search. Now that I think about it, there's other oddities that seem to happen in the web interface that might be explained by some bugs around using brackets in names.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

I see. Thanks for the insight. That seems to be the case. I'm struggling a bit with datasets. For example, if I wanted to trace the genealogy of a dataset that's used by traditional tasks and pipelines. I'll try and write something up about the challenges around that when I get the chance. But your comment revealed another issue:

It appears that the partial name matching isn't going well. I'm unclear why this wouldn't be matching. In the attached photo you can see the input for partial_name is '[LTV] Dataset Test'
and you can see from the unfiltered search there are many datasets titled identically. Yet, with that search criteria I get 0 results. One would assume that a partial match would include perfect matches?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Hi @<1545216070686609408:profile|EnthusiasticCow4> ! Note that the Datasets section is created only if you get the dataset with an alias? are you sure that number_of_datasets_on_remote != 0 ?
If so, can you provide a short snippet that would help us reproduce? The code you posted looks fine to me, not sure what the problem could be.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

Write your answer

2K Views

7 Answers

2 years ago