Reputation
Badges 1
371 × Eureka!Then I can use ClearML-Data with it properly.
Thanks for the help. I'll try to continue working on the vm for now.
Yeah I think I did. I followed the tutorial on the repo.
'dataset' is the name of my Dataset Object
Thank you for the help.
I'm not sure what dataset task is. I mainly just created dataset using ClearML.Dataset.Create
My current approach is, watch a folder, when there are sufficient data points, just move N of them into another folder and create a raw dataset and call the pipeline with this dataset.
It gets downloaded, preprocessed, and then uploaded again.
In the final step, the preprocessed dataset is downloaded and is used to train the model.
AgitatedDove14 Just wanted to confirm in what kind of file is the string artifact stored in? txt file or pkl file?
Basically when I have to re run the experiment with different hyperparameters, I should clone the previous experiment and change the hyperparameters then before putting it in the queue?
You mean I should set it to this?
To me it still looks like the only difference is that the non mutable copy is downloaded to the cache folder while mutable copy downloads to the directory I want. I could delete files from both sets so it seems like it's up to the user to make sure not to mutate the non mutable download in the cache folder.
Also, the steps say that I should run the serving process on the default queue but I've run it on a queue I created called a serving queue and have an agent listening for it.
I understand your problem. I think you normally can specify where you want the data to be stored in a conf file somewhere. people here can better guide you. However in my experience, it kinda uploads the data and stores it in its own format.
Sorry for the late response. Agreed, that can work, although I would prefer a way to access the data by M number of batches added instead of a certain range, since these cases aren't interchangeable. Also a simple thing that can be done is that you can create an empty Dataset in the start, and then make it the parent of every dataset you add.
Basically when I'm loading the model in InputModel, it loads it fine but I can't seem to get a local copy.
Yeah exact same usage.
Here's the screenshot TimelyPenguin76
Here they are. I've created and published the dataset. Then when I try to get a local copy, the code works but i'm not sure how to proceed to be able to use that data.
I get what you're saying. I was considering training on just the new data to see how it works. To me it felt like that was the fastest way to deal with data drift. I understand that it may introduce instability however. I was curious how other developers who have successfully managed to set up continuous training deal with it. 100% new data, or a ratio between new and old data. And if it is the latter, what should be the case, which should be the majority, old data or new data?
Honestly anything. I tried looking up on youtube but There's very little material there, especially which is up to date. It's understandable given that ClearML is still in beta. I can look at courses / docs. I just want to be pointed in the right direction as to what I should look up and study
It works, however it shows the task is enqueued and pending. Note I am using .start() and not .start_remotely() for now
Anyway I restarted the triton serving engine.
We want to get a clearer picture here to compare versioning with ClearML Data vs our own custom versioning
I'd like to add an update to this, when I use schedule function instead of schedule task with the dataset trigger scheduler, it works as intended. It runs the desired function when triggered. Then is asleep again next time since no other trigger was fired.
I actually just asked about this in another thread. Here's the link. Asking about the usage of the upload_artifact
AgitatedDove14 Can you help me with this? Maybe something like storing the returned values or something in a variable outside the pipeline?