Hey AgitatedDove14 , thanks for the feedback, I appreciate you taking the time to explain your position on all these points. Will do my best to address your feedback:
Open Source License: I agree about the license and that others are using it. And note we didn’t write that you are not open source (as opposed to say wandb, or comet). The purpose of the table and post is to aid people making decisions about which tools to use, and using an SSPL license may be a significant consideration for some, and so we thought it was important to point this out clearly. Platform & Language Agnostic: I agree with your statement about TB and MLflow, and that’s why we also didn’t give them a green checkmark here. All the tools have a theoretical API, but if they don’t make it easy to use from any programming language out of the box, like DVC does, it doesn’t meet the criteria we intended. Experiment Data Storage: I re-read this, and I think we chose the name of this criteria poorly. If we treat storage as “where is the raw experiment data saved” then all cloud solutions are local as well (they all have some folder where raw data is saved, but it’s not really parsable in a meaningful way) and the criteria kind of loses its point. What we meant is where can I view it in a way that I can make sense of my data. We will probably rename it to something like “Simple accessible data”. Easy-to-setup: Let me start by saying I agree with the point about KF being significantly harder to install, and we indeed gave it the worst score on this front. What we had in mind when comparing this, is for example MLflow. Where installing it sums up to
pip install mlflow and running it is
mlflow ui . I agree that it is really important for data scientist to work with docker (I even wrote https://dagshub.com/blog/setting-up-data-science-workspace-with-docker/ ), but it is still a more advanced process for many people. Again, the thought is “Can I guarantee someone with no DevOps experience can use this tool?” Scalability: I agree with your entire point, and we also gave you the checkmark here. If I missed something, please let me know.
To summarize, I honestly feel like you don’t come off less favorably than anyone else mentioned in the article. I’m open to continuing the discussion, as it advances our understanding of the field, and we’re also open to being convinced otherwise. I really appreciate the feedback.
Always great to have people joining the conversation, especially if they are the decision makers a.k.a can amend mistakes 🙂
If I can summarize a few points here (and feel free to fill in / edit any mistake or leftovers)
Open-Source license: This is basically the mongodb license, which is as open as possible with the ability to, at the end, offer some protection against Amazon giants stealing APIs (like they did for both mongodb and elastic search) Platform & language agnostic: ClearML is definitely python oriented, that said it is totally platform agnostic, Linux, Windows, and MacOS are supported. For that matter I'm pretty sure TB/MlFlow etc. are the same. (or I might have misunderstood the definition of "platform agnostic") Experiment Data storage: local storage (i.e. shared folders), on-prem object storage (e.g. minio / ceph) and cloud object storage (s3 / gs/ azure), are all supported Easy-to-setup: Considering the fact that docker-compose is the de facto standard to spin a server (with both Windows and MacOS supported, on to top of the standard linux distros) and that we also provide a pre-built AMI and GCP images, I'd say that it is very easy to install (for comparison, just try to setup Kubeflow in a usable way, now this is challenging) Scalable for large number of users: The plain vanilla install scales to a few millions, and the cloud ready helm chart is unlimited (basically clusters of all the databases). We have experienced with millions of experiments, hundreds of running agents and a few dozen UI users, all running smoothly.
wdyt? could we fix the comparison chart?
Hi! Dean from DAGsHub here 🙂
Saw this thread and thought it makes sense to respond. With the post you linked, we wanted to do an objective comparison based on people who have used the product. We realize all products evolve continuously so we’d be happy to update based on feedback to make it more accurate.
It’s also important to note that we were critical of all tools, including our own solution for experiment tracking and visualization.
We also honestly think these comparisons are helpful to the community at large, and are not trying to knock anyone just give a clear (ml) picture 😉 .
Hey AgitatedDove14 , thanks for the detailed response. I wanted to make time to respond to your points appropriately:
Apologies, I made a mistake, and the license name is SSPL, not SPSS (which is a tool for statistics by IBM). In any case, SSPL is a new license, which is considered problematic by many. There are a lot of examples of people complaining about the change in Elastic’s licensing, considering it moving from Open Source to Source-Available (in some people’s minds this is closed source, but I don’t have a horse in that race). The definition of SSPL ( https://en.wikipedia.org/wiki/Server_Side_Public_License ) is Source-Available, but to be fair I think we’ll leave the checkmark and the warning and just write SSPL below , so people can make their own judgement. If no-one cares about the license, then that info won’t deter anyone.
So since we write Platform and language agnostic, we agree? I explained above regarding language agnosticism, so hopefully that settles it.
A DB might be as good as JSON, or might not be. The criterion (or job to be done) is I want to crack open a notebook with pandas to do smart analysis on my logged data easily, without resorting to reading API docs or using custom libraries. In DVC, that requires I parse
csvs. That’s easy. With MLflow, It’s a bit harder since each param and metric is save to a separate file with an intuitive format to understand at a glance. Admittedly, I couldn’t find any information on doing something equivalent with ClearML in your docs, so if you can point me to where it explains these things, we can evaluate whether it meets the criterion. Otherwise, I feel that the fact that not being able to find this is a good indicator that ClearML doesn’t meet this criterion.
ClearML SaaS is not open source, so that would also not be an apples-to-apples comparison. You raise a good point which is team support out of the box – It might make sense to add it as a criteria as well, since I think this is an important consideration for, well, teams. To be sure (I couldn’t find it in your documentation) – does the Open Source ClearML come with RBAC? In any case it seems that we agree that TB and MLflow are easier to setup in the absolute sense, but might offer less capabilities compared to ClearML/DAGsHub.
I’m attaching a screen capture of your column. Not sure where you see question marks for CLearML?
Sorry, I missed the reply.
"I think we’ll leave the checkmark and the warning and just write SSPL below," Sounds like a good solution 👍
2. I have to admit, I would just write "language agnostic", but I will not insist further, so if you feel "platform" helps in explaining the reasoning, I'm with you.
3. "... to do smart analysis on my logged data easily, ..."
If this is the criteria, none of the options is Very easy, but they all have an interface.. not sure how to compare, but the title of the row is very very confusing.
I would say, direct access to the logged metrics?! Again I have to admit, I can't see any reason why one would care, other than your point on ease of use, which is never as simple as one would like, but with all the options there, there is always an interface for accessing the data.
(clearml api functions below, but to your point, this is definitely worth adding an example to the docs, thanks!)
Bottom line, maybe change the row description ? and highlight that your solution is easy to work with?
4. Basically you either pull wandb/coment/neptubne etc. from this comparison or you mark everything with a check-mark.
The ClearML free hosting is equivalent with the other platforms (in terms of RBAC, i.e. login, private/shared workspaces etc. The free hosting is Not the clearml demoapp public testing, the free tier is a dedicated hosted solution for users)
The open-source server is as easy as possible with docker-compose, that said, this is out of scope for this row comparison as half the columns are not installable.
If you want to allow details you have to add (SaaS)/(standalone) next to each cell, then you can support this mix
5. My bad, I was looking at the image in the post on your website, where it is still double question marks
Anyhow I appropriate the willingness to fix the table, this is unfortunately truly unique 🙂
Thanks CynicalBee90 I appreciate the discussion! since I'm assuming you will actually amend the misrepresentation in your table, let me followup here.
SPSS license may be a significant consideration for some, and so we thought it was important to point this out clearly.
SPSS is fully open-source compliant unless you have the intention of selling it as a service, I hardly think this is any users consideration, just like anyone would be using mongodb or elastic search without thinking twice, basically you should fix it green check mark
2. I'm with TrickySheep9 , seems like all solutions are as platform agnostic as possible, can't actually see any difference on any of the columns, how is DVC more agnostic than mlflow for example? I guess what you should have there is "programming language agnostic", in which case, I think the table holds.
What we meant is where can I view it in a way that I can make sense of my data. We will probably rename it to something like “Simple accessible data”.
How is that different? Is storing json files better than API access and DB access ? I do not get it? what am I missing here? Are you saying that TB local files are somehow more readable than a DB with API? what's the value for a user?
Is it standalone server-less installation what you are actually after ?
4. pip install MLFlow will not give multiple users access, this is not apples-to-apples in terms of base-line requirements, i.e. more than a single user (basically single user could just use TB, not ideal but usable, it all breaks when you start collaborating).
Again, the thought is “Can I guarantee someone with no DevOps experience can use this tool?”
Well ClearML free SaaS offering solves that as well (as is the case for wandb or comet for that matter)
Basically all options are easy to setup, either they have pip install Or they offer free SaaS...
5. Well in the table I see two question marks not a green check mark, could you fix that ?