Hi PanickyMoth78
dataset name is ignored if
use_current_task=True
Kind of, it stores the Dataset on the Task itself (then dataset.name becomes the Task name), actually we should probably deprecate this feature, I think this is too confusing?!
What was the use case for using it ?
I have a task where I create a dataset but I also create a set of matplotlib figures, some numeric statistics and a pandas table that describe the data which I wish to have associated with the dataset and vieawable from the clearml web page for the dataset.
I don't mind assigning to the task the same name that I'd assign to the dataset. I just think that the create function should expect dataset_name
to be None in the case of use_current_task=True
(or allow the dataset name to differ from the task name)
I have a task where I create a dataset but I also create a set of matplotlib figures, some numeric statistics and a pandas table that describe the data which I wish to have associated with the dataset and vieawable from the clearml web page for the dataset.
Oh sure, use https://clear.ml/docs/latest/docs/references/sdk/dataset#get_logger they will be visible on the Dataset page on the version in question
I just think that the create function should expect
dataset_name
to be None in the case of
use_current_task=True
(or allow the dataset name to differ from the task name)
I think you are correct, at least we should output a warning that it is ignored ... I'll make sure we do 🙂
Oh sure, use
they will be visible on the Dataset page on the version in question
That sounds simple enough.
Though I imagine I'd need to explicitly report every figure. Correct?
Yep the automagic only kick in with Task.init... The main difference and the advantage of using a Dataset object is the underlying Task resides in a specific structure that is used when searching based on project/name/version, but other than that, it should just work
hmm.
this isn't supported though:dataset_args = dataset.connect(dataset_args)
Hmm interesting...
of course you can do:dataset._task.connect(...)
But maybe it should be public?!
How are you using that (I mean in the context of a Dataset)?
I was doing it with the task that I had been using. Mostly for logging arguments that control what the dataset will contain.
I think your use case is the original idea behind "use_current_task" option, it was basically designed to connect code that creates the Dataset together with the dataset itself.
I think the only caveat in the current implementation is that it should "move" the current Task into the dataset project / set the name. wdyt?
Yeah. I was only using the task for the process of creating the dataset.
My code does start out with a step that checks for the existence of the dataset, returning it if it exists (search by project name/dataset name/version) rather than recreating it.
I noticed the name mismatch when that check kept failing me...
I think that init-ing the encompassing task with the relevant dataset name still allows me to search for the dataset by dataset_name=task_name / project_name (shared by both dataset and task) / dataset_version.
So I guess I'll switch back to initiating a task (with the dataset name as the task name) and setting the use_current_task=True
in dataset create().
Does that alleviate the concern around:
The main difference and the advantage of using a Dataset object is the underlying Task resides in a specific structure that is used when searching based on project/name/version,
?
here is what I do:
` try:
dataset = Dataset.get(
dataset_project=bucket_name,
dataset_name=dataset_name,
dataset_version=dataset_version,
)
print(
f"dataset found {dataset.project}/{dataset.name} v{dataset.version}\n(id: {dataset.id})"
)
return dataset
except ValueError:
pass
task = Task.current_task()
if task is None:
task = Task.init(
project_name=bucket_name, task_name=dataset_name
)
dataset = Dataset.create(
dataset_name=dataset_name, # has no effect
dataset_project=bucket_name,
dataset_version=dataset_version,
output_uri=f"gs://{bucket_name}",
description=f"cropped_images",
use_current_task=True,
) `having run this once, the dataset.get will find the dataset the next time around
Just verified the with the code base, should work out of the box 🙂 nothing to worry about