Brett Romero

Data Inspired Insights

Month: May 2017

Why the ‘boring’ part of Data Science is actually the most interesting

For the last 5 years, data science has been one of the world’s hottest professions, but it is also one of the most poorly defined. This can be seen on any career website, where advertisements for ‘Data Scientist’ positions describe everything from what used to be a simple data analyst role, to technical, PhD-only, research positions working on artificial intelligence or autonomous cars.

However, despite the diversity of roles being labelled ‘data scientist’, there is a common thread that runs through any job involving data and building models. And this is that only around 20% of time will be spent building models, with the other 80% of the time spent understanding, cleaning and transforming data to get it to the point where it can be used for modelling (for an overview of all the steps a Data Scientist goes through, see this series).

For many/most people working in the profession, the time spent cleaning and transforming is seen simply as a price to be paid to get to the interesting part – the modelling. If they could, many people would happily hand off this ‘grunt work’ to someone else. At first glance, it is easy to see why this would be the case – it is the modelling that gets all the headlines. There are very few people that hear about a model predicting cancer in hospital patients and thinks “they must have had some awesome clean data to build that with”.

However, plaudits aside, I am going to make the case that this is backwards. That from a creativity and challenge standpoint, it is often the cleaning and transforming parts of the job that are the most interesting parts of data science.

The creativity of cleaning

Over the past 12 years of working with data, one thing that has become painfully obvious is the unbridled creativity of people when it comes to introducing errors and inconsistencies into data. Typos, missing values, numbers in text fields, text in numerical fields, inconsistent spellings of the same item, and changing number formats (e.g. ever notice how most of continental Europe uses “,” as the decimal point instead of “.”?) are just some of the most common issues one will encounter.

To be fair, it is not only the fault of the person doing the data entry (e.g. an end user of an application). Often, the root of the problem is a poorly designed interface and a lack of data validation. For example, why is a user able to submit text in a field that should only ever contain numbers? Why do I have to guess how everyone else types in “the United States” (US, U.S., USA, U.S.A., United States of America, America, Murica) instead of choosing from a standardized list of countries?

However, even with the most carefully validated forms and data entry interface, data quality issues will continue to exist. People fudge their age, lie about their income, enter fake emails, addresses and names, and some, I assume, make honest typos and mistakes.

So why is dealing with these issues is a good thing? Because the unlimited creativity on the part of the people creating the data quality issues has to be exceeded by the creativity of the person cleaning the data. For every possible type of error that can be found in the data, the data scientist has to develop a method to address that error. And assuming the dataset is more than a few hundred rows, it will have to be a systematic method, as manually correcting the issues becomes impractical.

As a result, the data scientist has to find a way to address the universe of potential errors, and to do so in an automated, systematic way. How do I go through a column of countries that have all been spelt in different ways in order to standardize the country names? Someone got decimal happy and now I have a column where a lot of the numbers have two decimal points instead of one – how can I systematically work out which decimal point is the correct one, and then remove the other decimal point? A bunch of users put their birthday as 1 January 1900, how can I remove those, should I remove them, and if yes, what values should I put there instead?

All of these scenarios are real examples of interesting, challenging problems to solve, and ones that require a high-level of creativity to address.

The creativity of transformation/feature extraction

Once cleaning has been undertaken, typically the next step is to perform transformation and/or feature extraction. These steps are necessary because the data is rarely collected in the form required by the model, and/or there is additional information that can be added to and/or extracted from the data to make the model more effective.

If this sounds like a very open ended task, that’s because it is. Often, the ability to enhance a dataset is limited only by time, and the creativity and knowledge of the data scientist doing the work. Of course, there are diminishing returns, and at some point, it becomes uneconomic to invest additional effort to improve a dataset, but in many cases there are a huge range of options.

Due to the open-ended nature of this step, there are actually two types of creativity required. The first is the creativity to come up with potential new features that can be extracted from the existing dataset (and developing the methods to create those features). The second is identifying other data that could be used to enhance the dataset (and then developing the methods to import and combine it). Again, both of these are challenging and interesting problems to solve.

Making a model is often a mechanical process

Unlike the above, the process of creating the model is a relatively mechanical process. Of course, there are still challenges to overcome, but in most cases, it boils down to choosing an algorithm (or combination of algorithms), then tuning the parameters to improve the results. The issue is that both of these steps are not something that typically involve a lot of creative thinking, but instead involve cycling through a lot of options to see what works.

Even the selection of the algorithm, or combination of algorithms, which might seem relatively open ended, is, in the real world, limited by a range of factors. For a given problem, these factors include:

  • The task at hand – whether it be two-class or multi-class classification, cluster analysis, prediction of a continuous variable, or something else – will reduce the algorithm options. Some algorithms will typically perform better in certain scenarios, while others may simply not be able to handle the task at all.
  • The characteristics of the data often also reduces the options. Larger datasets mean some algorithms will take too long to train to be practical. Datasets with large numbers of features suit some algorithms more than others, while sparse datasets (those with lots of 0 values) will suit other algorithms.
  • An often-overlooked factor is the ability to explain to clients and/or bosses how and why a model is making a prediction. Being able to do this typically puts a significant limit on the complexity of the model (particularly ensembles), and makes simpler (and often less accurate) models more appealing.

After all these factors are taken into account, how many algorithms are left to choose from in a given scenario? Probably not too many.

machine learning cheat sheet

An excellent graphic from SAS summarizing how the algorithm choices in data science are often limited by the problem.

Wrapping Up

Taking all the above into account, the picture that starts to form is one where significant creativity is required to clean and create a good dataset for modelling, followed by a relatively mechanical process to create and tune a model. But if this is the case, why doesn’t everyone think the same way I do?

One of the primary reasons is that in most real-world data science scenarios, the above steps (cleaning, transformation, feature extraction and modelling) are not typically conducted in a strictly linear fashion. Often, building the model and assessing which features were the most predictive will lead to additional work transforming and extracting features. Feature extraction and testing a model will often reveal data quality issues that were missed earlier and cause the data scientist to revisit that step to address those issues.

In other words, in practice everything is interlinked and many data scientists view the various steps in the process of constructing a model (including cleaning and transforming) as one holistic process that they enjoy completing. However, because the cleaning and transforming aspects are the most time consuming, these aspects (data cleaning in particular) are often seen as being the major impediment to a completed project.

This is true – almost all projects could be completed significantly quicker if the data was of a higher quality at the outset. The quick turnaround for most Kaggle competition entries (where relatively clean and standardized data are provided to everyone) can attest to this. But to my fellow data scientists, I would say the following. Data science will always involve working with dirty and underdeveloped data – no matter how good we get at data validation, how clean and intuitive the interface, or how much planning is done on what data points to collect. Embrace the dirt, celebrate the grind, and take pride in creating creative solutions to often complex and challenging problems. If you don’t, no one else will.

The Surprising Complexity of Randomness

Previously, in a walkthrough on building a simple application without a database, I touched on randomness. Randomness and generating random numbers is a surprisingly deep and important area of computer science, and also one that few outside of computer science know much about. As such, for my own benefit as much as yours, I thought I would take a deeper look at the surprising complexity of randomness.

Why do we need randomness?

There can be a number of uses for randomness. But firstly, one thing to note is that when it comes to computers and computer science, randomness is typically represented by random numbers – seemingly random sequences of numbers that can then be used for different purposes. These purposes can range from randomly generating words in a flashcard app or shuffling songs in a playlist, to significantly more high-stakes uses, such as generating random keys for secure logins, data encryption, or randomly shuffling a deck of cards in an online game where large amounts of money are at stake.

How are random numbers created at the moment?

Random numbers come in two types, pseudorandom numbers and true random numbers.

Pseudorandom numbers are numbers that are generated to appear random, but are not truly random. Typically, pseudorandom numbers will be generated using a seed value provided by a user or programmer, which is then passed to an algorithm that uses that value to generate a new number. These algorithms often work by taking the remainder of an equation with includes the seed value and several large numbers.

For example, let’s say we use the following very simple equation to generate a series of random numbers:

R = (387 x S + 217) // 954

Where:
R is the random number to be produced
S is the seed value for R
// represents modular division, where the result will be the remainder of the division

Starting with a seed value (S) of 43, the first random number produced by the equation will be:

R = (387 x 43 + 217) // 953

R = 657

To produce the second random number, we then insert 657 as S, back into the equation:

R = (387 x 657 + 217) // 953

R = 25

This process can be repeated as many times as needed, generating an apparently random series of numbers.

While this example is a very simple one, this process of feeding the last random number into the same equation to generate a new random number is common to almost all pseudorandom number generators, and will result in two common attributes, regardless of the complexity.

The first is that if the seed value (S) is the same, the sequence of ‘random’ numbers produced by the algorithm will be exactly the same every time. This means that if you know the equation and the seed value, you can predict the entire sequence of ‘random’ numbers.

The second issue is that, eventually, the pattern will repeat. That is, eventually the formula will generate the same number twice, meaning the whole sequence will start again. And depending on the equation and large values chosen, this could be surprisingly soon.

Creating true random numbers

The reason we have pseudorandom numbers is because generating true random numbers using a computer is difficult. Computers, by design, are excellent at taking a set of instructions and carrying them out in the exact same way, every single time. It is this predictability which makes them so powerful. However, this predictability also makes it complicated to generate true random numbers.

As such, for a computer to create a truly random number, it has to take in some external input from something that is truly random. This external input can be something like key presses and movements of the mouse by a human operator, or network activity on a busy network in an office setting. But it can also be something far more complex such as the effect of atmospheric turbulence on a laser, or measuring the decay of a radioactive isotope.

randomness

Generating random numbers using mouse and keyboard inputs

Why does it matter?

This difference between pseudorandom and true random numbers is important, but only in certain settings.

For uses like selecting a random sample when working with data, shuffling a playlist, or triggering events in a video game, it is less important if pseudorandom or true random numbers are used. How true the randomness is, in these cases, will not impact the quality of the outcomes.

In some cases, using pseudorandom numbers may be advantageous. Take for example the process of selecting a random sample for a scientific study. In this case, using pseudorandom numbers allows others to replicate your results by using the same seed value. In video games, being able to trigger the same ‘random’ events is very useful when the game is being tested.

In other cases, using true random numbers is much more important. In applications such as encryption, using true random numbers is particularly important as it helps to ensure that data remains protected. Similarly, for online gambling, gaming companies need to have a very high level of confidence that the way results are being produced in everything from blackjack (how the cards are shuffled), to roulette (where the ball lands) and poker machines (which position the reels stop in) is a truly random process, or they risk someone reverse engineering the algorithm and making a significant profit as a result.

True randomness is not what most people expect

When it comes to true randomness, one of its stranger aspects is that it often behaves differently to people’s expectations. Take the two diagrams below – which one do you think is a random distribution, and which has been deliberately created/adjusted?

randomized dots

Only one of these panels shows a random distribution of dots | Source: Bully for Brontosaurus – Stephen Jay Gould

If you said the right panel, you are in good company, as this is most people’s expectation of what randomness looks like. However, this relatively uniform distribution has been adjusted to ensure the dots are evenly spread. In fact, it is the left panel, with its clumps and voids, that reflects a true random distribution. It is also this tendency for randomness to produce clumps and voids that leads to some unintuitive outcomes.

Take Spotify, the digital music service for example. For years, Spotify listeners have complained about the quality of the playlist shuffle. In fact, the quality of Spotify’s shuffle has been such a topic of discussion, that if you type “Spotify shuffle” into Google, one of the first autocomplete options that will come up is “sucks”. When Spotify looked into these complaints, the most common theme centered on songs from the same artist frequently playing one after the other. In short, people’s expectations of randomness were not matching reality. As Spotify explain in this interesting article, their shuffle was actually random, but they have now adjusted it to better align with what people think of as random – by reducing the randomness and ensuring that songs from a given artist will be spread throughout the playlist.

The gambler’s fallacy

As is also covered in the Spotify article, a great example of this misalignment of people’s expectations with the true nature of randomness is the so-called gambler’s fallacy. What the gambler’s fallacy boils down to is two things:

  1. A belief that independent random events (a flip of a coin, a roll of a dice) have some sort of inherent tendency to revert to the mean. For example, when flipping a coin, a streak of heads makes the likelihood that the next flip will be tails increase so that the eventual distribution will move back towards 50-50.
  2. As a result of belief 1, people tend to underestimate the likelihood of streaks (or clumps) of outcomes. The classic example of this is the person at the roulette table who looks at the list of previous results and sees a run of five black numbers, and believes that the likelihood of the next number being red is now higher as a result. By the way, this is exactly why casinos show the history, to tempt people into betting when they think the odds are in their favor.

To test your own beliefs on the likelihood of streaks, consider a roulette wheel in a casino. Let’s say the casino is open 12 hours a day, and that on average, it gets spun once per minute, giving us 720 spins in a day. Assuming there is a 50% chance of a red number and a 50% chance of a black number (i.e. we are ignoring the green 0 and 00 tiles for simplicity), what do you think the probability is of a streak of 8 or more black or red numbers in a row on a given day?

The answer is over 75%. In other words, on three out of four days, you should expect to see at least one streak of 8 or more black or red numbers during the day. Extending this, there is a 30% chance of a streak of 10 or more and around an 8% chance of a streak of twelve. You can test this and other scenarios using this handy calculator.

What does any of this mean?

In the course of your daily life, not too much. If you are a gambler, you should probably stop, but I am sure I am not the first person to tell you that. If you follow stock pickers, hopefully you will reconsider how much of their ‘skill’ is pure chance, especially when you factor in survivorship bias[1]. Perhaps something here will help you impress your friends at a trivia night.

If none of the above apply however, hopefully this article has introduced you to an interesting and little known area of knowledge with some important and fascinating applications.

 

[1] Survivorship bias in this context exists because the stock pickers that were not picking the right stocks did not keep writing articles. Over time, this leaves only the people who have been picking the winners (the ‘survivors’) to continue writing, even if their picks were correct purely by chance.

 

© 2018 Brett Romero

Theme by Anders NorenUp ↑