We are working hard on release 1.7 once that is out we will push an RC for review (I hope) 🙂
I'll give it a shot. Honestly, the SDK documentation for both InputModel and OutputModel is (sorry)
horrible
...
I have to agree, we are changing this interface, I do not think it is good 😞
I'll give it a shot. Honestly, the SDK documentation for both InputModel and OutputModel is (sorry) horrible ...
Can't wait for the documentation revamping.
What if I have multiple files that are not in the same folder? (That is the current use-case)
I think you can do weights_filenames= ['a_folder/firstfile.bin', 'b_folder/secondfile.bin']
(it will look for a common file path for both so it retains the folder structure)
Our workaround now for using a
Dataset
as we do, is to store the dataset ID as a configuration parameter, so it's always included too
Exactly, so with Input Model it's the same only kind of built in 🙂
packages an entire folder as zip
What if I have multiple files that are not in the same folder? (That is the current use-case)
It otherwise makes sense I think 🙂
Our workaround now for using a Dataset
as we do, is to store the dataset ID as a configuration parameter, so it's always included too 😉
This seems to only work for a single file (weights_path implies a single file, not multiple ones). Is that the case?See update_weights_package
actually packages an entire folder as zip and will do the extraction when you get it back (check the function docstring, I think you can also specify wildcard etc if needed)
Why do you see this as preferred to the dataset method we have now?
So it answers a few requirements that you raised
It is fully visible as part of the project and separate entity When you clone a Task it will go with it (and will let you change it in the UI if needed) It is not actually data but additional required inputs to execute (closer to input model than to standalone dataset) It has simple interface that does not require differentiable storage but allow multiple "versions" nonetheless It is coupled with the Task & Project and not a "standalone" datasetwdyt?
I'm not entirely sure I understand the flow but I'll give it a go. I have two final questions:
This seems to only work for a single file (weights_path implies a single file, not multiple ones). Is that the case? Why do you see this as preferred to the dataset method we have now? 🤔
Why is it using an OutputModel and an InputModel?
So calling OutputModel will create the new Model entity and upload the data, InputModel will store it as required input Model.
Basically on the Task you have input & output section, when you clone the Task you are copying the input section into the newly created Task, and the assumption is that when you execute it, your code will create the output section.
Here when you clone the Task you will be clone the reference to the InputModel (i.e. you data), and it will always go with you.
wdyt?
Looks good! Why is it using an OutputModel and an InputModel?
LOL love that approach.
Basically here is what I'm thinking,
` from clearml import Task, InputModel, OutputModel
task = Task.init(...)
run this part once
if task.running_locally():
my_auxiliary_stuff = OutputModel()
my_auxiliary_stuff.system_tags = ["DATA"]
my_auxiliary_stuff.update_weights_package(weights_path="/path/to/additional/files")
input_my_auxiliary = InputModel(model_id=my_auxiliary_stuff.id)
task.connect(input_my_auxiliary, "my_auxiliary")
task.execute_remotely()
my_auxiliary_path = task.models["input"]["my_auxiliary"].get_weights_package(return_path=True) `I might have some typos but it should do the trick.
You will Have a "Model" with all your auxiliary data, and when you clone the Tasks it will copy the reference to the data, But when you delete a Task it will not by default delete the Model (aka data)
WDYT?
I'm not sure what you mean by "entity", but honestly anything work. We're already monkey-patching our way 😄
Hmm, maybe the right way to do so is to abuse "models" which have entity, you can specify a system_tag on them, they can store a folder (and extract it if you need), they are on projects and they are cloned and can be changed.
wdyt?
I commented on your suggestion to this on GH. Uploading the artifacts would happen via some SDK before switching to remote execution.
When cloning a task (via WebUI or SDK), a user should have an option to also clone these input artifacts or simply linking to the original. If linking to the original, then if the original task is deleted - it is the user's mistake.
Alternatively, this potentially suggests "Input Datasets" (as we're imitating now), such that they are not tied to the original task. These can also hold references to all tasks that use them, so deleting them would be made harder
A definite maybe, they may or may not be used, but we'd like to keep that option
The precursor to the question is the idea of storing local files as "input artifacts" on the Task, which means that if the Task is cloned the links go with it. Let's assume for a second this is the case, how would you upload these artifacts in the first place?
can I assume these files are reused
A definite maybe, they may or may not be used, but we'd like to keep that option 🙃
Maybe the "old" way Dataset were shown is better suited ?
It was, but then it's gone now 😞
I see your point, this actually might be a "bug"?!
I would say so myself, but could be also by design..?
Awesome, I'll ask Product to reach out
LMK, happy to help out!
I know our use case is maybe a very different one, but generalizing from it would surely be beneficial 🙂
Yes. Because my old
has never been resolved (though closed), we use the dataset object to upload e.g. local files needed for remote execution.
Ohh No I remember... following this line, can I assume these files are reused, i.e. this is not a "per instance" . I have to admit that I have a feeling this is a very unique usecase. and Maybe the "old" way Dataset were shown is better suited ?
No, I mean why does it show up in the task view (see attached image), forcing me to click twice on the same project name.
I see your point, this actually might be a "bug"?!
I'm a bit lost of words in describing this. Would be happy to show quickly via e.g. a Slack call/huddle.
Awesome, I'll ask Product to reach out 🙂
Basically you have the details from the Dataset page, why should it be mixed with the others ?
Because maybe it contains code and logs on how to prepare the dataset. Or maybe the user just wants increased visibility for the dataset itself in the tasks view.
why would you need the Dataset Task itself is the main question?
For the same reason as above. Visibility and ease of access. Coupling relevant tasks and dataset in the same project makes it easier to understand that they're linked together.
Not sure I can imagine one, can you provide an example?
Yes. Because my old https://github.com/allegroai/clearml/issues/395 has never been resolved (though closed), we use the dataset object to upload e.g. local files needed for remote execution. These are not the same as actual datasets, but can be reused and can be useful for introspection.
What you mean here is, if the dataset ".dataset" project is already hidden, why do we also "hide" the Tasks inside ?
No, I mean why does it show up in the task view (see attached image), forcing me to click twice on the same project name.
I'm a bit lost of words in describing this. Would be happy to show quickly via e.g. a Slack call/huddle.
Why does ClearML hide the dataset task from the main WebUI?
Basically you have the details from the Dataset page, why should it be mixed with the others ?
If I specified a project for the dataset, I specifically want it there, in that project, not hidden away in some
.datasets
hidden sub-project.
This maybe a request for "Dataset" tab under project, why would you need the Dataset Task itself is the main question?
Not all dataset objects are equal, and perhaps not all of them should appear in the
Datasets
panel.
Not sure I can imagine one, can you provide an example?
If a dataset is already hidden - its project should not appear anywhere in the project view. Users anyway can't access it from the UI (since it's hidden), but now have additional clutter and require additional clicks to get to where they wanted.What you mean here is, if the dataset ".dataset" project is already hidden, why do we also "hide" the Tasks inside ?
Those are cool and very welcome additions (hopefully the additional info in the Info
tab will be a link?) 😁
The main issue is the clutter that the forced renaming creates, as shown in the pictures I attached in the other thread.
Why does ClearML hide the dataset task from the main WebUI? Users should have some control over that. If I specified a project for the dataset, I specifically want it there, in that project, not hidden away in some .datasets
hidden sub-project. Not all dataset objects are equal, and perhaps not all of them should appear in the Datasets
panel. If a dataset is already hidden - its project should not appear anywhere in the project view. Users anyway can't access it from the UI (since it's hidden), but now have additional clutter and require additional clicks to get to where they wanted.
For now we've monkey-patched it to our usecase:
LOL, that's a cool hack
That gives us the benefit of creating "local datasets" (confined to the scope of the project, do not appear in
Datasets
tabs, but appear as normal tasks within the project)
So what would be a "perfect" solution here?
I think I'm missing the point on why it became an issue in the first place.
Notice that in new versions Dataset will be registered on the Tasks that use them (they are already there in the Info Tab, and will be part of the configuration as well, so that you can override them if you wish when running remotely).
The second point is to better highlight the "creating Task" of a dataset, so that the preprocessing code is more visible in the Dataset UI.
What else am I missing ?
That gives us the benefit of creating "local datasets" (confined to the scope of the project, do not appear in Datasets
tabs, but appear as normal tasks within the project)
For now we've monkey-patched it to our usecase:
` Dataset._Dataset__hidden_tag = "active"
def foo(cls, dataset_project, dataset_name):
dataset_project = dataset_project or "Datasets"
return dataset_project, dataset_project.rpartition("/")[0]
Dataset._build_hidden_project_name = foo `
AgitatedDove14 Basically the fact that this happens without user control is very frustrating - https://github.com/allegroai/clearml/blob/447714eaa4ac09b4d44a41bfa31da3b1a23c52fe/clearml/datasets/dataset.py#L191
Well, -ish. Ideally what we're after is one of the following:
Couple a task with a dataset. Keep it visible in it's destined location. Create a dataset separately from the task. Have control over its visibility and location. If it's hidden, it should not affect normal UI interaction (most annoying is having to click twice on the same project name when there are hidden datasets, which do not appear in the project view)
The current implementation (since 1.6.3 I think) creates the issues in the linked comment (with images to visualize).
Understood, basically the moment we add nested project view to the dataset (and pipelines for that matter, and both are already being worked on), it should solve everything. Is that correct?
Hi AgitatedDove14 !
Ah, thanks! I'll use the artifacts for linking.
We've forgone the "use current task" already because it indeed made things even more difficult (the task that was used is then automatically hidden by this automatic renaming of dataset tasks).
The current implementation (since 1.6.3 I think) creates the issues in the linked comment (with images to visualize).
Hi UnevenDolphin73
Is there an easy way to add a link to one of the tasks panels? (as an artifact, configuration, info, etc)?
You can add a link as an artifact, that is probably the easiest:tasl.upload_artifact(name="just link", artifact_object="
")
EDIT: And follow up regarding the dataset. As discussed somewhere previously, the datasets are now automatically moved to a hidden "sub-project" prefixed with
.datasets
. This creates several annoyances that I believe should be treated: ...
Yes Datasets from the UI should be accessed from the Datasets tab (the .datasets etc. we can think about as implementation details)
That said I think the main issue is what happens if you do "use current Task" for the dataset, then things become more complicated and less intuitive, is this the correct context ?