I don't mind assigning to the task the same name that I'd assign to the dataset. I just think that the create function should expect dataset_name
to be None in the case of use_current_task=True
(or allow the dataset name to differ from the task name)
I just think that the create function should expect
dataset_name
to be None in the case of
use_current_task=True
(or allow the dataset name to differ from the task name)
I think you are correct, at least we should output a warning that it is ignored ... I'll make sure we do 🙂
Yep the automagic only kick in with Task.init... The main difference and the advantage of using a Dataset object is the underlying Task resides in a specific structure that is used when searching based on project/name/version, but other than that, it should just work
Hi PanickyMoth78
dataset name is ignored if
use_current_task=True
Kind of, it stores the Dataset on the Task itself (then dataset.name becomes the Task name), actually we should probably deprecate this feature, I think this is too confusing?!
What was the use case for using it ?
I have a task where I create a dataset but I also create a set of matplotlib figures, some numeric statistics and a pandas table that describe the data which I wish to have associated with the dataset and vieawable from the clearml web page for the dataset.
Oh sure, use https://clear.ml/docs/latest/docs/references/sdk/dataset#get_logger they will be visible on the Dataset page on the version in question
I have a task where I create a dataset but I also create a set of matplotlib figures, some numeric statistics and a pandas table that describe the data which I wish to have associated with the dataset and vieawable from the clearml web page for the dataset.
Oh sure, use
they will be visible on the Dataset page on the version in question
That sounds simple enough.
Though I imagine I'd need to explicitly report every figure. Correct?
here is what I do:
` try:
dataset = Dataset.get(
dataset_project=bucket_name,
dataset_name=dataset_name,
dataset_version=dataset_version,
)
print(
f"dataset found {dataset.project}/{dataset.name} v{dataset.version}\n(id: {dataset.id})"
)
return dataset
except ValueError:
pass
task = Task.current_task()
if task is None:
task = Task.init(
project_name=bucket_name, task_name=dataset_name
)
dataset = Dataset.create(
dataset_name=dataset_name, # has no effect
dataset_project=bucket_name,
dataset_version=dataset_version,
output_uri=f"gs://{bucket_name}",
description=f"cropped_images",
use_current_task=True,
) `having run this once, the dataset.get will find the dataset the next time around
Just verified the with the code base, should work out of the box 🙂 nothing to worry about
Yeah. I was only using the task for the process of creating the dataset.
My code does start out with a step that checks for the existence of the dataset, returning it if it exists (search by project name/dataset name/version) rather than recreating it.
I noticed the name mismatch when that check kept failing me...
I think that init-ing the encompassing task with the relevant dataset name still allows me to search for the dataset by dataset_name=task_name / project_name (shared by both dataset and task) / dataset_version.
So I guess I'll switch back to initiating a task (with the dataset name as the task name) and setting the use_current_task=True
in dataset create().
Does that alleviate the concern around:
The main difference and the advantage of using a Dataset object is the underlying Task resides in a specific structure that is used when searching based on project/name/version,
?
I think your use case is the original idea behind "use_current_task" option, it was basically designed to connect code that creates the Dataset together with the dataset itself.
I think the only caveat in the current implementation is that it should "move" the current Task into the dataset project / set the name. wdyt?
Hmm interesting...
of course you can do:dataset._task.connect(...)
But maybe it should be public?!
How are you using that (I mean in the context of a Dataset)?
I was doing it with the task that I had been using. Mostly for logging arguments that control what the dataset will contain.
hmm.
this isn't supported though:dataset_args = dataset.connect(dataset_args)