Brett Romero

Data Inspired Insights

Tag: feature extraction

Why the ‘boring’ part of Data Science is actually the most interesting

For the last 5 years, data science has been one of the world’s hottest professions, but it is also one of the most poorly defined. This can be seen on any career website, where advertisements for ‘Data Scientist’ positions describe everything from what used to be a simple data analyst role, to technical, PhD-only, research positions working on artificial intelligence or autonomous cars.

However, despite the diversity of roles being labelled ‘data scientist’, there is a common thread that runs through any job involving data and building models. And this is that only around 20% of time will be spent building models, with the other 80% of the time spent understanding, cleaning and transforming data to get it to the point where it can be used for modelling (for an overview of all the steps a Data Scientist goes through, see this series).

For many/most people working in the profession, the time spent cleaning and transforming is seen simply as a price to be paid to get to the interesting part – the modelling. If they could, many people would happily hand off this ‘grunt work’ to someone else. At first glance, it is easy to see why this would be the case – it is the modelling that gets all the headlines. There are very few people that hear about a model predicting cancer in hospital patients and thinks “they must have had some awesome clean data to build that with”.

However, plaudits aside, I am going to make the case that this is backwards. That from a creativity and challenge standpoint, it is often the cleaning and transforming parts of the job that are the most interesting parts of data science.

The creativity of cleaning

Over the past 12 years of working with data, one thing that has become painfully obvious is the unbridled creativity of people when it comes to introducing errors and inconsistencies into data. Typos, missing values, numbers in text fields, text in numerical fields, inconsistent spellings of the same item, and changing number formats (e.g. ever notice how most of continental Europe uses “,” as the decimal point instead of “.”?) are just some of the most common issues one will encounter.

To be fair, it is not only the fault of the person doing the data entry (e.g. an end user of an application). Often, the root of the problem is a poorly designed interface and a lack of data validation. For example, why is a user able to submit text in a field that should only ever contain numbers? Why do I have to guess how everyone else types in “the United States” (US, U.S., USA, U.S.A., United States of America, America, Murica) instead of choosing from a standardized list of countries?

However, even with the most carefully validated forms and data entry interface, data quality issues will continue to exist. People fudge their age, lie about their income, enter fake emails, addresses and names, and some, I assume, make honest typos and mistakes.

So why is dealing with these issues is a good thing? Because the unlimited creativity on the part of the people creating the data quality issues has to be exceeded by the creativity of the person cleaning the data. For every possible type of error that can be found in the data, the data scientist has to develop a method to address that error. And assuming the dataset is more than a few hundred rows, it will have to be a systematic method, as manually correcting the issues becomes impractical.

As a result, the data scientist has to find a way to address the universe of potential errors, and to do so in an automated, systematic way. How do I go through a column of countries that have all been spelt in different ways in order to standardize the country names? Someone got decimal happy and now I have a column where a lot of the numbers have two decimal points instead of one – how can I systematically work out which decimal point is the correct one, and then remove the other decimal point? A bunch of users put their birthday as 1 January 1900, how can I remove those, should I remove them, and if yes, what values should I put there instead?

All of these scenarios are real examples of interesting, challenging problems to solve, and ones that require a high-level of creativity to address.

The creativity of transformation/feature extraction

Once cleaning has been undertaken, typically the next step is to perform transformation and/or feature extraction. These steps are necessary because the data is rarely collected in the form required by the model, and/or there is additional information that can be added to and/or extracted from the data to make the model more effective.

If this sounds like a very open ended task, that’s because it is. Often, the ability to enhance a dataset is limited only by time, and the creativity and knowledge of the data scientist doing the work. Of course, there are diminishing returns, and at some point, it becomes uneconomic to invest additional effort to improve a dataset, but in many cases there are a huge range of options.

Due to the open-ended nature of this step, there are actually two types of creativity required. The first is the creativity to come up with potential new features that can be extracted from the existing dataset (and developing the methods to create those features). The second is identifying other data that could be used to enhance the dataset (and then developing the methods to import and combine it). Again, both of these are challenging and interesting problems to solve.

Making a model is often a mechanical process

Unlike the above, the process of creating the model is a relatively mechanical process. Of course, there are still challenges to overcome, but in most cases, it boils down to choosing an algorithm (or combination of algorithms), then tuning the parameters to improve the results. The issue is that both of these steps are not something that typically involve a lot of creative thinking, but instead involve cycling through a lot of options to see what works.

Even the selection of the algorithm, or combination of algorithms, which might seem relatively open ended, is, in the real world, limited by a range of factors. For a given problem, these factors include:

  • The task at hand – whether it be two-class or multi-class classification, cluster analysis, prediction of a continuous variable, or something else – will reduce the algorithm options. Some algorithms will typically perform better in certain scenarios, while others may simply not be able to handle the task at all.
  • The characteristics of the data often also reduces the options. Larger datasets mean some algorithms will take too long to train to be practical. Datasets with large numbers of features suit some algorithms more than others, while sparse datasets (those with lots of 0 values) will suit other algorithms.
  • An often-overlooked factor is the ability to explain to clients and/or bosses how and why a model is making a prediction. Being able to do this typically puts a significant limit on the complexity of the model (particularly ensembles), and makes simpler (and often less accurate) models more appealing.

After all these factors are taken into account, how many algorithms are left to choose from in a given scenario? Probably not too many.

machine learning cheat sheet

An excellent graphic from SAS summarizing how the algorithm choices in data science are often limited by the problem.

Wrapping Up

Taking all the above into account, the picture that starts to form is one where significant creativity is required to clean and create a good dataset for modelling, followed by a relatively mechanical process to create and tune a model. But if this is the case, why doesn’t everyone think the same way I do?

One of the primary reasons is that in most real-world data science scenarios, the above steps (cleaning, transformation, feature extraction and modelling) are not typically conducted in a strictly linear fashion. Often, building the model and assessing which features were the most predictive will lead to additional work transforming and extracting features. Feature extraction and testing a model will often reveal data quality issues that were missed earlier and cause the data scientist to revisit that step to address those issues.

In other words, in practice everything is interlinked and many data scientists view the various steps in the process of constructing a model (including cleaning and transforming) as one holistic process that they enjoy completing. However, because the cleaning and transforming aspects are the most time consuming, these aspects (data cleaning in particular) are often seen as being the major impediment to a completed project.

This is true – almost all projects could be completed significantly quicker if the data was of a higher quality at the outset. The quick turnaround for most Kaggle competition entries (where relatively clean and standardized data are provided to everyone) can attest to this. But to my fellow data scientists, I would say the following. Data science will always involve working with dirty and underdeveloped data – no matter how good we get at data validation, how clean and intuitive the interface, or how much planning is done on what data points to collect. Embrace the dirt, celebrate the grind, and take pride in creating creative solutions to often complex and challenging problems. If you don’t, no one else will.

Data Science: A Kaggle Walkthrough – Data Transformation and Feature Extraction

This article on data transformation and feature extraction is Part IV in a series looking at data science and machine learning by walking through a Kaggle competition. If you have not done so already, you are strongly encouraged to go back and read Part I, Part II and Part III.

Continuing on the walkthrough, in this part we focus on getting the data we cleaned in Part III ready for use in the classification algorithm. These steps are often referred to as data transformation and feature extraction.

Data Transformation and Feature Extraction as a Concept

The main purpose of data transformation and feature extraction is to enhance the data in such a way that it increases the likelihood that the classification algorithm will be able to make meaningful predictions. Unlike the steps taken during cleaning, which are designed to address problems with the raw data (missing and erroneous values, formatting issues etc.), these steps change the values and/or structure of the data (data transformation) and add additional features (feature extraction).

As you might imagine, this is quite an open-ended process, and hence a lot of the value that data scientists provide comes in these steps. There is no textbook or walkthrough that can tell you exactly what steps you should take for a given dataset, that knowledge can come only from experience, curiosity and trial and error. However, we can take a look at some common methods to provide a sense of what is possible. Please keep in mind this is not an exhaustive list of options.

Data Transformation

Covering steps taken to modify the data, data transformation is undertaken with the intention to enhance the ability of the classification algorithm to extract information from the data. Below are a few common data transformation methods used.


A common method for manipulating numeric data, binning or bucketing is when the numerical values in a particular column are converted from a continuous series into fixed ranges. For example, instead of using the age value of all our users, we could place them into buckets such as 15-20 years old, 21-25 years old and so on.

Typically this technique is used to manage ‘noisy data’. To understand what this means, think of the movements of the stock market over time: it goes up and down on an almost daily basis. However, if you are trying to predict the overall direction of the stock market over the next 6 months, these daily movements become kind of irrelevant – what you really want your model to focus on are the movements over longer periods of time. What is more, the essentially random daily movements in stock prices may actually confuse your prediction model – causing less accurate predictions. In this example, the daily movements are the noise and what you want to extract (the longer term direction of the market) is ‘the signal’.

The same logic can be applied to any numerical field in your dataset. If you are concerned that small changes in a given value may simply be representing random ‘noise’, you may want to consider bucketing/binning to remove that noise.


Although normalization can take on a large number of meanings depending on the context, the type of normalization being referred to here is the statistical type – converting the values of a column into a ‘normalized’ range. This could be translating heights from centimeter values anywhere from 100cm to 220cm to a scale where 0 represents the average (mean) height for your dataset and -1/+1 represent one standard deviation from that average. It could be translating those heights into a range of values from 0 to 1, where 0 is the lowest value in your dataset and 1 is the maximum value. There is a number of other methods that can be used here as well.

This type of transformation is more important for certain types of algorithms than others. For some algorithms – like the one we will be using – this type of transformation is not typically necessary. But for other algorithms, the magnitude of the values in each column will impact the calculations. In these cases, it is optimal to convert (‘normalize’) the values in each column onto the same scale to ensure each column is treated the equally. For a more detailed explanation on this subject, this answer from Quora is a good place to start.

Other Mathematical Transformations

In a similar manner to normalization, there is an almost unlimited number of ways that the numerical values of a given column can be transformed such that they are more suitable for the algorithm being used.

To provide one example, arguably the most common transformation (other than normalization) is to use a logarithm function. This transformation is a commonly used method of dealing with exponential data series (i.e. a column where there a lot of low values and relatively few high values). For those wanting to understand this transformation better, the Wikipedia page on this topic has a great illustrated example.

As I am hemorrhaging readers at this point, I won’t go into detail on the various other transformations possible – the key point is to be aware that there is a large range of possibilities here depending on your needs.

One Hot Encoding

Looking at one more example, and the most relevant one for our Kaggle competition, this transformation is one used for categorical data. What this transformation does is take one column with x categories (x must be greater than 2 for this to make sense) and convert it into x columns where each column represents one category in the original column. An illustrated example is shown below:

data transformation

For those familiar with regression modeling, you may recognize this as the same process of creating dummy variables.

Again there are a few reasons for doing this type of transformation. Some algorithms are structured in such a way that they do not handle categorical data very well – particularly when the categories do not have an inherent order (this answer on Stack Overflow does a good job of explaining why). Some other types of algorithms require numerical data to function. The only way to work out whether this transformation will be beneficial is to either read through the documentation for the algorithm you are using or to test it yourself.

Feature Extraction

Often broken down into sub steps of feature construction and feature selection, here we will focus on feature construction. Below are a couple of ways additional features can be constructed and added to your dataset.

Using Hierarchical Information

It will sometimes be the case that data in your dataset represents one level of a particular hierarchy, and that extracting the other implied levels of that hierarchy will provide the model with useful information.

For example, imagine a dataset with a column containing countries. This column allows the algorithm to look for patterns (in combination with all other columns) at the country level. However, by adding a new ‘region’ column based on the country column (Europe, South Asia, North Africa etc.), you may be providing information to the algorithm that allows it look for patterns across countries.

One of the most common ways to do this is with date fields. Take the date fields in the dataset we are working with as an example. By extracting the day of the week, the month of the year or the hour of the day, we could add important information for the algorithm to use. Maybe people who create their accounts in summer months are more likely to make a booking in a warmer country. Maybe people who were first active late at night are more disorganized travelers and are therefore more likely to make a domestic first booking. Additionally, it could be any combination of these factors that makes the difference (e.g. users first active late at night, in the summer months, on a weekday are more likely to travel to Portugal). The point is not to be able to explain why a factor may be important, but to think of as many factors as possible to test, and allow the algorithm to determine what is important and not important.

Adding External Data

One of the aspects of feature extraction that often gets overlooked is how data can be enriched through the addition of new external data. Using techniques such as record linkage, existing datasets can be greatly expanded by adding new data points for a given record. This new data often provides valuable new information that the algorithm can use to make more accurate predictions.

For example, a training dataset that contains a column with countries could be enriched with demographic data about the country such as population, income per capita or land area – all factors that may allow the algorithm to draw conclusions across similar groups of countries on any of those measures.

Relating this concept to the competition we are working through, consider how much more accurately we could predict a first booking country of a user if we could link the data from their Airbnb profile to data from one of their social media profiles (Facebook, Twitter etc.) or even better, from a Tripadvisor or Expedia account.

The key point here is that it is worth investing time looking for ways to add new and useful data to your existing dataset before moving onto the modeling step. Expending your dataset in this manner will often produce far bigger improvements in prediction accuracy than the choice of algorithm or the tuning of the algorithm parameters.

The Importance of Domain Knowledge

One of the things that may have occurred to you as you read through the various ways to modify and expand a dataset is how are you supposed to know what will help or not?

This is where knowledge about the data you are using and what it represents becomes so important. This knowledge – referred to as domain knowledge – helps guide this entire process, including what was covered in Part III, cleaning the data.

Understanding how the data was collected helps to provide insight into potential errors in the data that might need to be addressed or shortcomings in the way the data was sampled (sample selection bias/errors). Understanding the relevant industry or market can also provide a range of insights including:

  • what additional information is available to expand your dataset
  • what information may help to increase prediction accuracy and what is likely to be irrelevant
  • if the model makes intuitive sense (e.g. can you predict the likelihood of a waking up with a headache based on whether someone slept with their shoes on?[1]), and
  • if the industry or market is changing in such a way that it is likely to make the model redundant in the near future.

In practical terms, where does this leave aspiring data scientists?

The first thing is to realize that, obviously, it is not possible to be a domain expert for every domain. Acknowledging this limitation is important as it forces a second realization – you will almost always need to seek out this expertise. For most of us that means involving and utilizing people who are domain experts when constructing your dataset and model. Having access to that expertise is likely to be the difference between a model that gets thrown out in 6 months and one that fundamentally improves a business and/or fulfills a customer need.

Step by Step

After all the theory, let’s put some of these techniques into practice.

Transforming Categorical Data

The first step we are going to undertake is some One Hot Encoding – replacing the categorical fields in the dataset with multiple columns representing one value from each column.

To do this, the Scikit Learn library comes with a One Hot Encoder method that we could use to do these transformations, but it is often instructive to write your own function, particularly if it is a relative simple one like this. The code snippet below creates a simple function to do the encoding for a specified column, and then uses that function in a loop to convert all the categorical columns (and then delete the original columns).

# Home made One Hot Encoding function
def convert_to_binary(df, column_to_convert):
    categories = list(df[column_to_convert].drop_duplicates())

    for category in categories:
        cat_name = str(category).replace(" ", "_").replace("(", "").replace(")", "").replace("/", "_").replace("-", "").lower()
        col_name = column_to_convert[:5] + '_' + cat_name[:10]
        df[col_name] = 0
        df.loc[(df[column_to_convert] == category), col_name] = 1

    return df

# One Hot Encoding
print("One Hot Encoding categorical data...")
columns_to_convert = ['gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked', 'signup_app', 'first_device_type', 'first_browser']

for column in columns_to_convert:
    df_all = convert_to_binary(df=df_all, column_to_convert=column)
    df_all.drop(column, axis=1, inplace=True)

Creating New Features

From Part II of this series, one of the things we observed about the training (and test) datasets is that there is not a huge number of columns to work with. This limits what new features we can add based on the existing data. However, two fields that can be used to create some new features are the two date fields – date_account_created and timestamp_first_active. We want to extract all the information we can out of these two date fields that could potentially differentiate which country someone will make their first booking in. The code for extracting a range of different data points from these two date columns (and then deleting the original date columns) is shown below:

# Add new date related fields
print("Adding new fields...")
df_all['day_account_created'] = df_all['date_account_created'].dt.weekday
df_all['month_account_created'] = df_all['date_account_created'].dt.month
df_all['quarter_account_created'] = df_all['date_account_created'].dt.quarter
df_all['year_account_created'] = df_all['date_account_created'].dt.year
df_all['hour_first_active'] = df_all['timestamp_first_active'].dt.hour
df_all['day_first_active'] = df_all['timestamp_first_active'].dt.weekday
df_all['month_first_active'] = df_all['timestamp_first_active'].dt.month
df_all['quarter_first_active'] = df_all['timestamp_first_active'].dt.quarter
df_all['year_first_active'] = df_all['timestamp_first_active'].dt.year
df_all['created_less_active'] = (df_all['date_account_created'] - df_all['timestamp_first_active']).dt.days

# Drop unnecessary columns
columns_to_drop = ['date_account_created', 'timestamp_first_active', 'date_first_booking', 'country_destination']
for column in columns_to_drop:
    if column in df_all.columns:
        df_all.drop(column, axis=1, inplace=True)

Wrapping Up

In two relatively simple steps, we have changed our training dataset from 14 columns to 163 columns. Although this seems like a lot more information, most of this expansion was caused by the One Hot Encoding, which is not adding more information, but simply expanding out the existing information. We have not added any external data, and I didn’t even really investigate what information we could have extracted from the other non-date columns.

Again, this process is open ended, so there is an almost unlimited range of possibilities that we have not even really begun to explore. As such, if you see an additional transformation or have an idea for the addition of a new feature, please feel free to let me know in a comment!

Next Time

In the next piece, we will look at the data in sessions.csv that we left aside initially and see how we can add that data to our training dataset.


[1] This is an example of the existence of a confounding factor. A model predicting whether someone will wakeup with a headache based on whether they slept with their shoes on ignores that there is a more logical explanation for the headaches – in this case that both the headaches and sleeping with shoes on are caused by a third factor – going to bed drunk.


© 2020 Brett Romero

Theme by Anders NorenUp ↑