If I understand you correctly, I think there is the same purpose for any FS design choices. For example, I saw both approaches data store and declarative feature computation on demand.
An important part for product is basically support of 2 features:
feature retrieval for a batch of entities (useful for generating training sets together with ability to do time travel; batch prediction (offline) ); feature serving in near real-time (online).
This is minimum functionality I would expect a feature store do on a lower level. However, there are more purpose on it for feature reuse, governance and declarative feature engineering, so data scientists could focus on what data they need instead of how to get it.
There are plenty of good materials about it on http://featurestore.org .
My favourite is the one that https://www.youtube.com/watch?v=E8839ENL-WY , because it touches the higher level picture of how feature store helps Data Scientists to progress further towards deploying model to production and continuously monitor its performance rather then too deep tech dive into challenges of having one place to avoid feature training serving skew.
Hey Slava, I don't mean to be "that guy" but, I am interested in what do you think a feature store means/implies/should do. The term is still (to my mind) very open to interpretation.. so I would honestly love to hear from you (and others)
The enterprise feature store we have should probably be more named as "data store but with advanced search/update capabilities" but.. that's not as nice sounding.
If you mean feature store as 'data ingestion via a DSL with type checking' then this is not where we are.
I think part of the confusion centers around data engineers from a maths/statistics background vs computer scientists with types. Computer science people usually reply that data ingestion and splicing/mapping etc should be done by (say) python or R scripts. Data engineers (from what I can see) want this to be abstracted away from them - they don't want to deal with python or R itself.
I believe that's a fair, high level assessment of the situation, but if you think differently, I am all ears 🙂
Would be helpful to have pointers to front-end part as well, like if we would like to maintain single UI for feature governance and use our own feature store back-end that would make it easier. There must be already some logging possibility within experiment tracking, so it should not be too hard to log features and probably dataset that is used + tags for models metadata can hold required schema, so the only questionable part is monitoring of feature skew.
Anyway, thank you for confirming that clearml is extensible enough on this part, it would probably be even better if you guys can make some docs on 3rd party feature store integration. 🙏
that's... a very good question. When I was using Feast, it was that more than one person was interested in using the ingested data, so it became that 'single source of truth'. From then on, ClearML was used to do the actual pipeline flow and training/testing/serving runs and, since it's all python shop, it worked pretty well. We used it offline, since we didn't care about online with having features at inference time. I should probably write up something about this when I have the time come to think of it... hrrrmm...
honestly, I don't think the feature store we have would suit your needs. It is much closer to a data store in functionality with some nice to haves, rather than a feature store that is missing some bits.
Personally, I have used Feast before with a client, but only because it's a "pip install" to get it into place. It's a much lower barrier to entry than most of the others (again, bear in mind, I am a pythonista)