Great discussion, I agree with you both. For me, we are not using clearml-data, so I am a bit curious how does a "published experiment" locked everything (including input? I assume someone can still just go inside the S3 bucket and delete the file without Clearml noticing).
From my experience, absolute reproducibility is code + data + parameter + execution sequence. For example, random seed or some parallelism can cause different result and could be tricky to deal with sometimes. We did build an internal system to ensure reproducibility. ClearML is experiment tracking component, then we integrate with Kedro for pipeline + parameters + data, so everything is tracked automatically.
I have been thinking to replace the data tracking component, our solution works fine but it is not the most efficient one. With GBs size of artifacts generated in every experiment, we have increasing need to do housekeeping regularly. Thus I am studying what's the best way to do so. "Tag" and "publish experiment" is what we are considering.
And as for clearml-data I would love to have more examples but not 100% sure what to focus on as using clearml-data is a bit...simple? In my, completely biased, eyes. I assume you're looking for workflow examples, and would love to get some inspiration 🙂
AnxiousSeal95
I think I can definitely see value in that.
I found that once you go beyond the easy examples, where you are largely using datasets that curated as part of a python package, then it took a bit of effort to get my head around the dataset tools.
Likewise with the deployment side of things, and the Triton inference engine, there are certain aspects of that which I am relatively new to, so to go from the simple Keras example, to getting a feeling that the tool will cover the use cases we may encounter is quite a lot of work.
To some extent, there is no substitute for self learning and spending time working out how to use a given system, so I fully realize that to become knowledgeable enough to draw comparisons and make conclusions on whether the platform is something we think would add value to our process in the future if we adopted it, you have to get your hands reasonably dirty.
Having said all that, I definitely see how spending time with an expert user of the system would help short cut some of that learning, and likely give a more representative impression of how the system would fit with our potential use cases and what it is capable of doing.
BTW, I suggest for new questions, just ask in the clearml-community. I'm really happy to help but I almost missed this message 😄
VivaciousPenguin66 What's your thought on Prefect? There are so many pipeline library and I wasn't so sure how different are they. I have experience with Airflow. With Kedro, we were in hope that data scientist will write the pipeline themselves with minimal effort to handover to another engineer to work on. For serious production (need to scale), we consider convert Kedro pipeline to Airflow, there are plugin to do that, tho I am not sure how mature they are.
EnviousStarfish54 we are at the beginning phases exploring potential solutions to MLops. So I have only been playing with the tools, including the dataset side of things. However, I think that an integral part of capturing a model in its entirety is being able to make sure that you know what into making it. So I see being able to version and difference datasets as just as important as the code, or the environment in which it is run.
Yup, I am only more familiar with the experiment tracking part, so I don't know if I have a good understanding before I have reasonable knowledge of the entire ClearML system.
VivaciousPenguin66 How are you using the dataset tool? Love to hear more about that.
Or even better dataset_v1.2.34_alpha_with_that_thingy_change_-2_copy_copy.zip
I think I tend to view some is better than none!
Oh I did not realize I asked this in a old thread, sorry about that.
AnxiousSeal95 At first sight, the pipeline logic of ClearML seems binding with ClearML quite a bit. Back then I was considering I need something that can convert to Production pipeline (e.g. Airflow DAGs) easily, as we need pipelines not just for Experiments, Airflow seems to be the default one.
Also, clearml-data was not available when we started the development of internal framework. As for clear-agent, from my previous experience, it seems not working great with Window sometimes, and also the logic catching python environment often fail.
To use clearml-agent, data versioning need to go first since the agent would need to know where to get the data. It need to cache environment/data so the overhead of running experiments should be as low as possible. Maybe using clearml-pipeline would make it easier.
At the time I try that, it looks like it is not mature yet so I didn't spend too many time on it. And since we already have a working solution, so we do not have the motivation to switch yet.
For me, some real case study/tutorial that put every clearml-agent component together would provide some context that how ClearML thinks it is the best way to structure things. If you know some good source of documentation, please let me know! As I look up the documentation, most of the time it is a snippet that achieve a certain thing.
Cool, versioning the difference is useful. It also depends on what kind of data. For example, for tabular data, database might be a natural choice, however, how to integrate it and keep track of the metadata could be tricky. While for images, it probably more suitable for blob storage or per file basis.
AnxiousSeal95 , I would also warmly second what EnviousStarfish54 says regarding end to end use cases of real case studies, with a dataset that is more realistic than say MNIST or the like, so it is easier to see how to structure things.
I understand one of the drivers has been flexibility with robustness when you need it, however as a reference point from the people who made it, then examples of how you the creators would structure things would help in our thinking of how we might use it. We may of course decide to try and use it in different ways, but it at least gives a reference.
I think the best model name is person_detector_lr0.001_batchsz32_accuracy0.63.pkl 😄
VivaciousPenguin66 This is very true! We are trying to explain the benefits of this method. Some people like it and some people like the flexibility. We do have our philosophy in mind when we create "best practices" and obviously features to ClearML but ultimately people should do what makes them the most productive!
If we are getting philosophical, I think it's the state of the industry and as it progresses, these standard methods would become more prominent.
also, to add to what you wrote, the difference in ML VS SW engineering is that a model isn't just the code, but once you decoupled the parameters, code and data(The way we do), you have these 3 components to track so the Model is a result of a combination of these 3.
I have been using this line to prevent experiments won't accidentally sent to the public server (I have my custom self-hosted server)Task.set_credentials("PLACEHOLDER", "PLACEHOLDER","PLACEHOLDER")
However, when I upgraded from 0.17.5 -> > 1.0.0. Weird stuff happen.
Since upgrade from v0.17.5 -> > 1.0.0, it has issue replacing the credentials.
Expected Behavior:
Conf should replace the "PLACEHOLDER" is the conf file exist. Else it should fails the experiment.
What happened:
The experiment fail when conf file exists. I can guarantee the conf file is valid, since if I removeTask.set_credentials("PLACEHOLDER", "PLACEHOLDER","PLACEHOLDER")
The experiment was init sucessfully and send to my custom domain, this indicate the conf file is read, but fail to replace credentials.
EnviousStarfish54 BTW, as for absolute reproducibility, you are obviously right. If you use S3 to store the data, and you changed the data in S3 then we can't catch it.
Our design compresses (zips) the files and store them in a version somewhere. If this is modified than you are trying hard to break stuff 🙂 (Although you can). This is not the most efficient space-wise when it comes to images \ videos, for these, you can save links, but I think it's only in the enterprise version but then, you usually don't modify images \ videos (you rather delete \ add new ones).
lol...... mine is best_model_20210611_v1.pkl
and better_model_20210611_v2.pkl
or best_baseline_model_with_more_features.pkl
EnviousStarfish54 interesting thoughts thank you for sharing.
We are looking at a hybrid platform like you, but have chosen Prefect for the pipeline orchestration, and we are considering what system to adopt for experiment and model tracking, and ease of deployment.
Hi EnviousStarfish54 If you want to not send info to the server, I suggest you to set an environment variable, this way as long as the machine has this envvar set it won't send to the server
EnviousStarfish54 VivaciousPenguin66 Another question if we're in a sharing mood 😉 Do you think a video \ audio session with one of our experts, where you present a problem you're having (let's say large size of artifacts) and he tries to help you, or even can give some example code \ code skeleton. Would something like that be of interest? Would you spend some time in such monthly session?
We all remember the days of dataset_v1.2.34_alpha_with_that_thingy_change_-2.zip
EnviousStarfish54 VivaciousPenguin66 So for random seed we have a way to save it so this should be possible and reproducible.
As for execution progress I totally agree. We do have our pipelining solution but I see it's very common to use us only for experiment tracking and use other tools for pipelining as well.
Not trying to convert anyone but may I ask why did you choose to use another tool and not the built-in pipelining feature in ClearML? Anything missing? Or did you just build the infra already and didn't want to convert? Or something else?
A bit of advertisement here (I don't feel bad as it IS the ClearML slack 😄 ), we tried to design pipelines so that DS could write it themselves, and then execute with agents (which should abstract the Devops setup). I'd like to know if and where did we fail in that mission 😮 .
As for cleanup, Doesn't a stage in the pipeline that removes unnecessary artifacts at the end of the run make sense? Or some service that runs once a week and anything older than X days, it removes the associated data from storage?
It's good that you have version your dataset with name, I have seen many trained model that people just replace the dataset directly.
i.e. some files in a shared drive, then someone silently updated the files and all the experiments become invalid and no one knows when did that happened.
AnxiousSeal95 absolutely agree with you!
When you are put in a situation when a production model has failed, or is not performing how is expected, then if you as a company your deriving revenue off that service, you very quickly have to diagnose what the severity of the problem is, and what is potentially causing it. As you clearly make out, the degrees of freedom which go into why a given model may behave differently include the code itself, the data, the pre-processing steps, the training parameters and the deployment environment itself.
The ability to lock all this down with publishing, as well as being able to difference an entire experiment with another, is very powerful, and immediately helps to close down the potential avenues of investigation for the cause of a model failure.
I have been there before, when a version of numpy was bumped and the setup.py file didn't specify the exact version, and when a deployment environment was recreated, it didn't install the same version of numpy, which led to diverging answers. And you won't be surprised to hear that during a development process of a model, checking package versions is not the first thing you do, you assume it's the code, or the data!
LOL Love this Thread and sorry I didn't answer earlier!
VivaciousPenguin66 EnviousStarfish54 I totally agree with you. We do have answers to "how do you do X or Y" but we don't have workflows really.
What would be a logical place to start? Would something like "training a Yolo V3 person detector on COCO dataset and how you enable continuous training (let's say adding PASCAL dataset afterwords) be something interesting?
The only problem is the friction between atomic and big picture. In atomic, I'm giving you a step by step guide on how to do things. It's very easy when I use a 100 lines script of pytorch mnist (which also auto-downloads the dataset which weighs like 10MB).
It's harder when it's a GIANT repository with lots of built-in preprocessing and a 20GB dataset.
We do have big-picture blog posts like https://clear.ml/blog/how-theator-built-a-continuous-training-framework-to-scale-up-its-surgical-intelligence-platform/ https://clear.ml/blog/good-testing-data-is-all-you-need-guest-post/ and https://clear.ml/blog/how-trigo-built-a-scalable-ai-development-deployment-pipeline-for-frictionless-retail/ but I feel they are more "philisophical" and thought pieces than actually stuff that you can start working on tomorrow morning.
So my question to you is, what kind of examples would be helpful for you guys?