Nicola Bortignon - 4 Books for Data Scientists

Data Science is a complex topic to master!

Not only there is a plethora of resources available, they also age very fast. Couple this with a lot of technical jargon and you can see why people get lost even before starting.
However, this is only part of the story. You can not master Data Science without undergoing the grind yourself. You have to spend hours understanding the nuances of a mathematical formula or an engineering implementation, its importance and
the impact on your models. It’s hard work.

Before you get there, it’s important that your grasp completely the principles and the reasons why we are doing data science in the way we are doing today.

The following 4 books provide a thoughtful journey from high-level reasoning to tips and tricks from masters in the field.

During 2016, I’ve been re-reading those books in the very same sequence.
Hope you’ll find interesting at least as much as I did.

The Signal and the Noise

(and the summary version)

Stats guru and political forecaster Nate Silver reveals why most predictions fail, and shows how we can isolate a true “signal” from a universe of increasingly big and noisy data.

Making decisions based on an assessment of future outcomes is a natural and inescapable part of the human condition. Indeed, as Nate Silver points out, “prediction is indispensable to our lives. Every time we choose a route to work, decide whether to go on a second date, or set money aside for a rainy day, we are making a forecast about how the future will proceed–and how our plans will affect the odds for a favorable outcome”. And over and above these private decisions, prognosticating does, of course, bleed over into the public realm; as indeed whole industries from weather forecasting, to sports betting, to financial investing are built on the premise that predictions of future outcomes are not only possible, but can be made reliable. As Silver points out, though, there is a wide discrepancy across industries and also between individuals regarding just how accurate these predictions are. In The Signal and the Noise, Silver attempts to get to the bottom of all of this prediction-making to uncover what separates the accurate from the misguided.

In doing so, the author first takes us on a journey through financial crashes, political elections, baseball games, weather reports, earthquakes, disease epidemics, sports bets, chess matches, poker tables, and the good ol’ American economy, as we explore what goes into a well-made prediction and its opposite. The key teaching of this journey is that wise predictions come out of self-awareness, humility, and attention to detail: lack of self-awareness causes us to make predictions that tell us what we’d like to hear, rather than what is true (or most likely the case); lack of humility causes us to feel more certain than is warranted, leading us to rash decisions; and lack of attention to detail (in conjunction with self-serving bias and rashness) leads us to miss the key variables that make all the difference. Attention to detail is what we need to capture the signal in the noise (the key variable[s] in the sea of data and information that are integral in determining future outcomes), but without self-awareness and humility, we don’t even stand a chance.

While self-awareness requires us to make an honest assessment of our particular biases, humility requires us to take a probabilistic approach to our predictions. Specifically, Silver advises a Bayesian approach. Bayes’ theorem has it that when it comes to making a prediction, the most prudent way to proceed is to first come up with an initial probability of a particular event occurring (rather than a black and white prediction of the form ‘I believe x will occur’). Next, we must continually adjust this initial probability as new information filters in.

The level of certainty that we can place on our initial estimate of the probability of a particular event (and the degree to which we can accurately refine it moving forward) is limited by the complexity of the field in which we are making our prediction, and also the amount and quality of the information that we have access to.
It is also important to recognise that while additional information can help us no matter what field we are trying to make our prediction in, we must be careful not to think that information can stand on its own. Indeed, additional information (when it is not met with insightful analysis) often does nothing more than draw our attention away from the key variables that truly make a difference. In other words, it creates more noise, which can make it more difficult to identify the signal. It is for this reason that predictive models that rely on statistics and statistics alone are often not very effective (though they do often help a seasoned expert who is able to apply insightful analysis to them).

Each of the fields that Silver analyses is a lot noisier than many of us would like to think (thus making them very difficult to predict precisely). Nevertheless, the author argues, within each there are certain signals that can help us make better predictions regarding them, and which should help make the world a safer and more livable place.

Silver makes a very strong argument that by applying a few simple principles (and putting in a lot of hard work in identifying key variables) our predictive powers should take a great boost indeed.

This book is a first imprint of the Data Scientists mindset. It’s also a go to guide to more seasoned Scientists willing to remind themselves the principles at the base of their day to day work.

Data Science for Business: What you need to know about data mining and data-analytic thinking

Written by renowned data science experts Foster Provost and Tom Fawcett, Data Science for Business introduces the fundamental principles of data science, and walks you through the “data-analytic thinking” necessary for extracting useful knowledge and business value from the data you collect.
This guide also helps you understand the many data-mining techniques in use today. Based on an MBA course Provost has taught at New York University over the past ten years, Data Science for Business provides examples of real-world business problems to illustrate these principles.
You’ll not only learn how to improve communication between business stakeholders and data scientists, but also how participate intelligently in your company’s data science projects.

You’ll also discover how to think data-analytically, and fully appreciate how data science methods can support business decision-making.
Understand how data science fits in your organisation and how you can use it for competitive advantage.
Treat data as a business asset that requires careful investment if you’re to gain real value. Approach business problems data-analytically, using the data-mining process to gather good data in the most appropriate way.
Learn general concepts for actually extracting knowledge from data.

If you are a data scientist, working on data science in your day to day, this book is a must read. It will help develop the business acumen to make you work effective and impactful.

The Data Science Handbook: Advice and Insight from 25 Amazing Data Scientists

Data Science Handbook is a collection of in-depth interviews with 25 Data Scientists who are well known in the field.
Vaguely reminds me of another book named “Founders at Work” written by Jessica Livingston.

The book focuses more on the ‘life-story’ of some of the reputed data scientists working in different companies, either as employees or as founder. The authors asked the data scientists about their early life, what motivated them to enter the industry or how they started working in the data science domain, what courses they took during undergraduate or graduate studies that helped them to enter the industry and afterwards and how they think that data science will be impacting future.

What I liked most about the book is that they included stories from data scientists who transitioned from employment to entrepreneurship. I don’t think there are that many people available who did that yet. Also, the interviees were very diverse, with 25 scientists currently focusing on completely different domains.

From Riley Newman (AirBNB, relative new company), to Drew Conway, famous for his data science ven-diagram.
Sean Gourley, well-known for his application of mathematical modelling to middle-east war. Jace Kohlmeier, switched from high-frequency trading to help Khan Academy in revolutionising education and self-learning. Interviews with data scientists from Facebook and Palantir Technologies as well as more niche companies, less consumer focused.

Key topic are, among other, importance of effective communication for a data scientist and the ability to ask good questions. The need to of having an engineering background and strong coding, visualisation and experimental design skills.

I personally enjoyed John Foreman interview, who happens to be the chief data scientist of MailChimp, focused on bridging the gap between social and data science.
This book, in contrast to the previous in this article, is more practical.
It will give you a great insight on the career evolution of a Data Scientist and the set of problems that keeps lead data scientists awake at night.

Doing Data Science : Straight Talk from the Frontline

“Doing Data Science” is a compendium of chapters that deal with data science as it is practised in the real world. Each chapter is written by a different author, all of who have significant practical experience and are acknowledged authorities on data science. Most of the contributors work in industry, but data science is still so fresh and new that there is a lot of crossing over between academia and the corporate world.

A few of the chapters include exercises, but these tend to be too advanced and assume too much background material for an introductory book. The exercises still give you a good idea of what kinds of problems data scientists tend to grapple with. However, this book is definitely not a textbook and cannot be effectively used as such. The book doesn’t provide any background on statistics, data scrubbing, machine learning, and various other techniques used by data scientist.

There are two groups of people who would benefit from this book. The first are people who have absolutely no background in data science or any of its related fields, but would like to get a flavour of what data science is all about and are interested in exploring it for career purposes. The second group are people with significant technical background in one of the fields related to data science (programming, statistics, machine learning, etc.) who are interested in broadening their skills and would like to see how would their particular strengths fit within the broader data science field.

At Spotify we are constantly expanding our Data organization.
Want to join us? Let me know!