Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Unanswered
With


AgitatedDove14 ,

Often a question is asked about a data science project at the beginning, which are like "how long will that take?" or "what are the chances it will work to this accuracy?".

To the uninitiated, these would seem like relatively innocent and easy to answer questions. If a person has a project management background, with more clearly defined technical tasks like software development or mechanical engineering, then often work packages and uncertainties relating to outcomes are much less and more clearly defined.

However, anyone who has attempted to get information out of complex and imperfect data, know that the power of any model, and the success of any given project is largely dependent on the data. A lot of the aspects of the data are generally unknown prior to undertaking a project, so the risk at the beginning of any data science project is large. It is large in both a time vs reward point-of-view and a final result point of view, both of which are highly uncertain. The key to successful projects at this point is to rapidly understand the data to a point when you can start to reduce these uncertainties.

In the beginning of the project, you are focused solely on this, and less on quality of code, how easy it is to deploy etc etc. Because of this you cannot be too rigid in how you define process to do work (that is make code) and provide results, as the possible range of outcomes from these processes can be large. It's no surprise that applications like Jupyter Notebooks are so popular, because they provide the ability to code fast and visualize results quickly and inline with the code, as an aid to reduce the lead time to data understanding.

As data scientists we spend a lot of time at that end of the spectrum, looking at data and visualising it in adhoc ways to determine the value and the power of data. The main focus here is understanding, not production ready code. And because less projects make it to deployable models, we as a group are not as experienced at deployment as we are at the beginning bit I describe above. This is likely a key factor in why it takes organisations a lot of work to take development models into production, because the person developing those models isn't really thinking about deployment, or doesn't even have much experience to put things into context during the development phase.

So, what I am referring to is the ability of a system to allow some rigor and robustness of tracking of experiments, and also enforcing some thoughts on how things might be deployed, early on in the development process, whilst not being overly prescriptive and cumbersome that it takes away from the effort to understand the data, is a very valuable thing indeed to have. It balances the need for quick answers at the beginning, with hopefully a considerably easier journey to deployment should a project make it fruition and add value to a particular problem that is being solved.

  
  
Posted 3 years ago
161 Views
0 Answers
3 years ago
one year ago