VexedCat68 , correct. But not only arg parse. The entire configuration section 🙂
VexedCat68 actually a few users already suggested we auto log the dataset ID used as an additional configuration section, wdyt?
That is true. If I'm understanding correctly, by configuration parameters, you mean using arg parse right?
Basically trying to keep track of how much of the tracking and record keeping is done by ClearML for me? And what things do I need to keep a track of manually in a database.
VexedCat68 , do you mean does it track which version was fetched or does it track everytime a version is fetched?
I'm not sure about auto logging, since you might be using different datasets or you might get a dataset but might not use it based on specific conditions. However as a developer choosing to use such as ClearML who considers it more of an ecosystem instead of just a continuous training pipeline, I would want as many aspects of the MLOPS process and the information around the experiment to be able to be logged within the bounds of ClearML without having to use external databases or libraries.
Yes, I was referring to logging the "clearlm-data" Dataset ID on the Task itself, not an external database.
Make sense?
Let me try to be a bit more clear.
If I have a training task in which I'm getting multiple ClearML Datasets from multiple ClearML IDs. I get local copies, train the model, save the model, and delete the local copy in that script.
Does ClearML keep track of which data versions were gotten and used from ClearML Data?
It does to me. However I'm proposing a situation where a user gets N number of Datasets using Dataset.get, but uses m number of datasets for training where m < n. Would it make sense to only log the m datasets that were used for training? How would that be done?
VexedCat68 , that's a good question! I'm not sure that ClearML keeps track of that, I need to check on that.
However, I think a neat solution could be using the datasets as task configuration parameters. This way you can track which datasets were used and you can set up new runs with different datasets.