Nicola Bortignon - ML projects: things to know before getting started

Machine learning projects are complex and carry with them typical mistakes that can lead to lots of effort but no outcome. The good news is that everything that is true for Software Engineering does apply to this new area of Computer Science.

Following, 40 points that I came to formulate in the last 10 years of working on ML project at Google, SoundCloud and Spotify.

Pre-requirements:

Launch your product without ML. (Trivia: As of 2016, The amount of product shipped at Google with ML in the first iteration was 0)

Make clear what is the goal of your product feature. If more than one, stack rank them.

Design and implement Metrics (including negative one) that can track the goal of your product (step 2).
Understand how certain interventions/treatments affect user behavior.

Product Managers, Product Analysts and Data Scientists are your best friend here.

Define your product thresholds. Usually, product has requirements that you are not allowed to corrupt. Have clear metrics to measure those thresholds, and alerting when you cross them.
Include measurement of differential performance (i.e. different performance for different subpopulations such as gender, race, age).

Assess performance of the existing solution. Understand what is good enough and what is not. Try to guess what are the expected improvement that you’d like to get out of introducing ML. ML systems are expensive to develop (and to maintain), make sure the cost/opportunity is sound.

Start your ML Project by focusing on the setup:

Start by identify a limited number of simple signals you want to use, make sure they are available and of high quality.

Make sure your data collection infrastructure works, both for training and inference.
a. If you cannot collect the same feature at inference, don’t use that feature for training.
b. Be mindful of of real time signals. They are often hard to access. Also, most likely there are simpler signals to be used.

You don’t need a model to test your infrastructure. Test your infrastructure now.

Decide the requirements of your system based on the features you want to use
a. Freshness of the data
b. Issues with the data
c. System implication of using that data
d. GDPR compliance
e. Plan for the worst-case scenario of your upstream dependencies

Write your documentation. Each feature should have a clear profiling including

a. Source of the feature
b. SLA of the system providing such feature
c. How we plan to monitor the quality of this feature
d. Post processing needed on the feature (ie.: category to int, etc)
e. Be mindful of feature availability, it can create unforeseen sampling bias.

Always separate pre-processing and post-processing from the model.
Re-ranking, etc, should be always treated as an independent post inference module.
If that is part of your model, you are more likely doing something wrong.

Backend engineering, Data Engineering and ML engineering will always be your bottleneck. Don’t over-index your team with researchers/data scientist. Product success in an ML project is most often about engineering than science. “Science is necessary but not sufficient”

Understand and agree with your team if you want to iteratively prototype, or work in a throwaway fashion. Make clear what is the relationship between exploration and production, especially from an engineering standpoint.

You need to understand the ins and outs of your A/B test setup. Design it before moving forward with Modeling. It allows you to avoid coupling or introducing statistical bias down the line.

Implement your objective function:

Make clear what metrics among the existing one (point 3) that you are trying to optimize. A multi-objective optimization is an option, make sure first you have a clear understanding of how those metrics interact with each other

Translate that metric into an objective function.

Your ML objective function should be something that is easy to measure and is a proxy for the “true” goal of your product improvement. The best practice here is to start from user research and work back to an optimizable metric. Avoid lazy practice of just looking at things like “accuracy”, “likelihood”, etc.
a. If you cannot prove that your objective function has a clear impact on the true objective, rethink about it.
b. If you cannot come up with an objective function that proxy your true objective, you need to rethink your product objective.

Ask if you can make your objective function even simpler.

Translate your objective function into an offline evaluation procedure.
At any time, you should be able to assess if your model output is improving or not your product objective? If you cannot tell, you are either missing data (go back and collect more data) or your objective function is not a good proxy for the product objective.

Your first model

At this point you should have a clear end-to-end toy backend outputting a static model. If you haven’t go back to 8.
Most likely your team have a heterogeneous set of skills. It’s hard to keep a strict sequence of what should happen first. There is a risk involved in not getting the backend done first. This can impact the type of model you can support, the features that you will be able to use, the overall end user experience (latency, etc). When possible try to reduce this risks.
Some potential way to simplify the first end to end iteration:
– If you are building a recommendation system for dog breeds, create a model that always output Golden Retrievers.
– If you are working on a decision tree, always output YES
– If your model is outputting a likelihood score, try to always return a uniform distribution.

Get ready to ship your model as soon as possible, if you don’t know how to do it, don’t start modeling, but go back and think your infra first.

When you sample data for training purpose make sure your sampling procedure is fair.

Start with an interpretable model, your debug will be easier. Model Interpretability is negatively correlated with flexibility. NeuralNets can learn more complicated patterns than Tree based approaches, which in turn are more powerful than Logistic Regression. Interpretability goes the other way. Always start with the most simple, interpretable models. Go more complex only if the model is clearly underfitting the data.
Always avoid to start with a *NN model. You lose interpretability, and, despite the early stage wins, you will be slowed down on iterations.
(Trivia: Did you know that at FB interns get to work with the majority of *NN, FTE teams almost always prefer Interpretability)

Pick existing features. Avoid using features coming out from other ML systems. Especially from unsupervised or deep models.
The primary issue with factored models and deep models is that they are nonconvex. Thus, there is no guarantee that an optimal solution can be approximated or found, and the local minima found on each iteration can be different. This variation makes it hard to judge whether the impact of a change to your system is meaningful or random. By creating a model without deep features, you can get an excellent baseline performance. After this baseline is achieved, you can try more esoteric approaches.

Separate Spam Filtering and Quality Ranking in a Policy Layer (similar to point 11). ML is not excuse to use bad engineering practices. Modularize your data processing and system logic.

Make sure you have online instrumentation (tracking) to evaluate the performance of your model before start working on your model.

Have a clear process to understand the difference between your online and offline evaluation, and your online and offline performance.
Understand the possible scenario and have an opinion if it’s a good thing or not (before you see the actual results)

Iterate

Clean up features that you are no longer using. Try to stack rank the feature you are using from the most important to the least.

Evaluate to add more feature, while keeping the following constraints:
a. Prefer features related to content that can generalize to different context (ie.: hearted artists, or total plays)
b. Use features only when you can explain why you are using them (interpretability)
c. When you discretize a feature don’t over think about it (ie.: Age. Common sense bands are good enough without the need of spending too much time discussing if 25y old belongs to the ‘young adult‘ or ‘adult‘ band)
d. Talk to humans that have business expertise, if available, can be useful

Look for error in your prediction, and make sure that part of the population is enough well represented in the data

Look for error in your prediction (through negative metrics monitoring, point 3) and try to come up with a feature that can help improve that prediction. The main thing is to look at prediction/reality mismatches but also observing points that stick out in the feature space is often useful.
Outlier analysis is often very useful

Introduce Importance weight as a counter measurement of sampling (iteration over 17). Importance weighting means that if you decide that you are going to sample example X with a 30% probability, then give it a weight of 10/3

Always have clear what is your fallback strategy. The fallback strategy should optimize for the majority of the existing user base. You should also know in what condition your inference component will fail (ie.: missing data in input) and what is your approach to mitigate the failure (both from an engineering and from a user experience standpoint)

Human Analysis of the system.

You are not the typical end user. You came a long way in trusting ML to be better than human, don’t give up now. Stop trying to debug the model given your own assumption on how it should work (ie.: a genre classification system predicting Eminem to be a Pop artist. No matter if this sounds like profanity to you, trust the system to have recognize that for over 30M listeners in US, Eminem IS a pop artist). As it true for standard product development so it is for ML.
User Researchers are you best friend here.

Measure the delta of your metrics between your models.
Make sure that a model, when compared with itself, has a low (ideally zero) symmetric difference
Example: For ranking systems, run the two models in parallel and compare the size of the symmetric difference of the results weighted by ranking position. If the difference is very small, then you can tell without running an experiment that there will be little change. If the difference is very large, then you want to make sure that the change is good

If there is a property of your system that you don’t like, make it part of the loss function.

Automatic Analysis of the system

The best way to make sure that you train like you serve is to save the set of features used at serving time, and then pipe those features to a log to use them at training time.

Your model should be instrumented too. How many failing inferences? What is the computational inference cost? Missing values for features? Latency? Etc.

At any time, you should understand if your system is drifting (virtual or real concept drift). If you don’t have metrics to measure that you most likely have picked wrong metrics for begin with or you have not instrumented your product correctly

Aging of your model should be expressed with a clear, computable, easy to interpret formula. If you cannot do so, you most likely have not enough control over your model

BONUS

questions that you should keep asking yourself while involved a ML project:
– Have we shipped something in production in the last 2 weeks?
– Is there anything that we are doing in this iteration to simplify our model?
– Can we answer with a Yes or No if our model is performing better/worse than the previous one?
– Is our product better than before, if yes under what conditions?

Additional Reading material:
The Ux of AI (Google)
An Introduction to human-centered machine learning (Google)
FBLearner: An introduction to ML at Facebook (Facebook)