Data Inspired Insights

# Month: March 2016

This article on data transformation and feature extraction is Part IV in a series looking at data science and machine learning by walking through a Kaggle competition. If you have not done so already, you are strongly encouraged to go back and read Part I, Part II and Part III.

Continuing on the walkthrough, in this part we focus on getting the data we cleaned in Part III ready for use in the classification algorithm. These steps are often referred to as data transformation and feature extraction.

## Data Transformation and Feature Extraction as a Concept

The main purpose of data transformation and feature extraction is to enhance the data in such a way that it increases the likelihood that the classification algorithm will be able to make meaningful predictions. Unlike the steps taken during cleaning, which are designed to address problems with the raw data (missing and erroneous values, formatting issues etc.), these steps change the values and/or structure of the data (data transformation) and add additional features (feature extraction).

As you might imagine, this is quite an open-ended process, and hence a lot of the value that data scientists provide comes in these steps. There is no textbook or walkthrough that can tell you exactly what steps you should take for a given dataset, that knowledge can come only from experience, curiosity and trial and error. However, we can take a look at some common methods to provide a sense of what is possible. Please keep in mind this is not an exhaustive list of options.

### Data Transformation

Covering steps taken to modify the data, data transformation is undertaken with the intention to enhance the ability of the classification algorithm to extract information from the data. Below are a few common data transformation methods used.

#### Bucketing/Binning

A common method for manipulating numeric data, binning or bucketing is when the numerical values in a particular column are converted from a continuous series into fixed ranges. For example, instead of using the age value of all our users, we could place them into buckets such as 15-20 years old, 21-25 years old and so on.

Typically this technique is used to manage ‘noisy data’. To understand what this means, think of the movements of the stock market over time: it goes up and down on an almost daily basis. However, if you are trying to predict the overall direction of the stock market over the next 6 months, these daily movements become kind of irrelevant – what you really want your model to focus on are the movements over longer periods of time. What is more, the essentially random daily movements in stock prices may actually confuse your prediction model – causing less accurate predictions. In this example, the daily movements are the noise and what you want to extract (the longer term direction of the market) is ‘the signal’.

The same logic can be applied to any numerical field in your dataset. If you are concerned that small changes in a given value may simply be representing random ‘noise’, you may want to consider bucketing/binning to remove that noise.

#### Normalization

Although normalization can take on a large number of meanings depending on the context, the type of normalization being referred to here is the statistical type – converting the values of a column into a ‘normalized’ range. This could be translating heights from centimeter values anywhere from 100cm to 220cm to a scale where 0 represents the average (mean) height for your dataset and -1/+1 represent one standard deviation from that average. It could be translating those heights into a range of values from 0 to 1, where 0 is the lowest value in your dataset and 1 is the maximum value. There is a number of other methods that can be used here as well.

This type of transformation is more important for certain types of algorithms than others. For some algorithms – like the one we will be using – this type of transformation is not typically necessary. But for other algorithms, the magnitude of the values in each column will impact the calculations. In these cases, it is optimal to convert (‘normalize’) the values in each column onto the same scale to ensure each column is treated the equally. For a more detailed explanation on this subject, this answer from Quora is a good place to start.

#### Other Mathematical Transformations

In a similar manner to normalization, there is an almost unlimited number of ways that the numerical values of a given column can be transformed such that they are more suitable for the algorithm being used.

To provide one example, arguably the most common transformation (other than normalization) is to use a logarithm function. This transformation is a commonly used method of dealing with exponential data series (i.e. a column where there a lot of low values and relatively few high values). For those wanting to understand this transformation better, the Wikipedia page on this topic has a great illustrated example.

As I am hemorrhaging readers at this point, I won’t go into detail on the various other transformations possible – the key point is to be aware that there is a large range of possibilities here depending on your needs.

#### One Hot Encoding

Looking at one more example, and the most relevant one for our Kaggle competition, this transformation is one used for categorical data. What this transformation does is take one column with x categories (x must be greater than 2 for this to make sense) and convert it into x columns where each column represents one category in the original column. An illustrated example is shown below:

For those familiar with regression modeling, you may recognize this as the same process of creating dummy variables.

Again there are a few reasons for doing this type of transformation. Some algorithms are structured in such a way that they do not handle categorical data very well – particularly when the categories do not have an inherent order (this answer on Stack Overflow does a good job of explaining why). Some other types of algorithms require numerical data to function. The only way to work out whether this transformation will be beneficial is to either read through the documentation for the algorithm you are using or to test it yourself.

### Feature Extraction

Often broken down into sub steps of feature construction and feature selection, here we will focus on feature construction. Below are a couple of ways additional features can be constructed and added to your dataset.

#### Using Hierarchical Information

It will sometimes be the case that data in your dataset represents one level of a particular hierarchy, and that extracting the other implied levels of that hierarchy will provide the model with useful information.

For example, imagine a dataset with a column containing countries. This column allows the algorithm to look for patterns (in combination with all other columns) at the country level. However, by adding a new ‘region’ column based on the country column (Europe, South Asia, North Africa etc.), you may be providing information to the algorithm that allows it look for patterns across countries.

One of the most common ways to do this is with date fields. Take the date fields in the dataset we are working with as an example. By extracting the day of the week, the month of the year or the hour of the day, we could add important information for the algorithm to use. Maybe people who create their accounts in summer months are more likely to make a booking in a warmer country. Maybe people who were first active late at night are more disorganized travelers and are therefore more likely to make a domestic first booking. Additionally, it could be any combination of these factors that makes the difference (e.g. users first active late at night, in the summer months, on a weekday are more likely to travel to Portugal). The point is not to be able to explain why a factor may be important, but to think of as many factors as possible to test, and allow the algorithm to determine what is important and not important.

One of the aspects of feature extraction that often gets overlooked is how data can be enriched through the addition of new external data. Using techniques such as record linkage, existing datasets can be greatly expanded by adding new data points for a given record. This new data often provides valuable new information that the algorithm can use to make more accurate predictions.

For example, a training dataset that contains a column with countries could be enriched with demographic data about the country such as population, income per capita or land area – all factors that may allow the algorithm to draw conclusions across similar groups of countries on any of those measures.

Relating this concept to the competition we are working through, consider how much more accurately we could predict a first booking country of a user if we could link the data from their Airbnb profile to data from one of their social media profiles (Facebook, Twitter etc.) or even better, from a Tripadvisor or Expedia account.

The key point here is that it is worth investing time looking for ways to add new and useful data to your existing dataset before moving onto the modeling step. Expending your dataset in this manner will often produce far bigger improvements in prediction accuracy than the choice of algorithm or the tuning of the algorithm parameters.

## The Importance of Domain Knowledge

One of the things that may have occurred to you as you read through the various ways to modify and expand a dataset is how are you supposed to know what will help or not?

This is where knowledge about the data you are using and what it represents becomes so important. This knowledge – referred to as domain knowledge – helps guide this entire process, including what was covered in Part III, cleaning the data.

Understanding how the data was collected helps to provide insight into potential errors in the data that might need to be addressed or shortcomings in the way the data was sampled (sample selection bias/errors). Understanding the relevant industry or market can also provide a range of insights including:

• what information may help to increase prediction accuracy and what is likely to be irrelevant
• if the model makes intuitive sense (e.g. can you predict the likelihood of a waking up with a headache based on whether someone slept with their shoes on?[1]), and
• if the industry or market is changing in such a way that it is likely to make the model redundant in the near future.

In practical terms, where does this leave aspiring data scientists?

The first thing is to realize that, obviously, it is not possible to be a domain expert for every domain. Acknowledging this limitation is important as it forces a second realization – you will almost always need to seek out this expertise. For most of us that means involving and utilizing people who are domain experts when constructing your dataset and model. Having access to that expertise is likely to be the difference between a model that gets thrown out in 6 months and one that fundamentally improves a business and/or fulfills a customer need.

## Step by Step

After all the theory, let’s put some of these techniques into practice.

### Transforming Categorical Data

The first step we are going to undertake is some One Hot Encoding – replacing the categorical fields in the dataset with multiple columns representing one value from each column.

To do this, the Scikit Learn library comes with a One Hot Encoder method that we could use to do these transformations, but it is often instructive to write your own function, particularly if it is a relative simple one like this. The code snippet below creates a simple function to do the encoding for a specified column, and then uses that function in a loop to convert all the categorical columns (and then delete the original columns).

``````# Home made One Hot Encoding function
def convert_to_binary(df, column_to_convert):
categories = list(df[column_to_convert].drop_duplicates())

for category in categories:
cat_name = str(category).replace(" ", "_").replace("(", "").replace(")", "").replace("/", "_").replace("-", "").lower()
col_name = column_to_convert[:5] + '_' + cat_name[:10]
df[col_name] = 0
df.loc[(df[column_to_convert] == category), col_name] = 1

return df

# One Hot Encoding
print("One Hot Encoding categorical data...")
columns_to_convert = ['gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked', 'signup_app', 'first_device_type', 'first_browser']

for column in columns_to_convert:
df_all = convert_to_binary(df=df_all, column_to_convert=column)
df_all.drop(column, axis=1, inplace=True)``````

### Creating New Features

From Part II of this series, one of the things we observed about the training (and test) datasets is that there is not a huge number of columns to work with. This limits what new features we can add based on the existing data. However, two fields that can be used to create some new features are the two date fields – date_account_created and timestamp_first_active. We want to extract all the information we can out of these two date fields that could potentially differentiate which country someone will make their first booking in. The code for extracting a range of different data points from these two date columns (and then deleting the original date columns) is shown below:

``````# Add new date related fields
df_all['day_account_created'] = df_all['date_account_created'].dt.weekday
df_all['month_account_created'] = df_all['date_account_created'].dt.month
df_all['quarter_account_created'] = df_all['date_account_created'].dt.quarter
df_all['year_account_created'] = df_all['date_account_created'].dt.year
df_all['hour_first_active'] = df_all['timestamp_first_active'].dt.hour
df_all['day_first_active'] = df_all['timestamp_first_active'].dt.weekday
df_all['month_first_active'] = df_all['timestamp_first_active'].dt.month
df_all['quarter_first_active'] = df_all['timestamp_first_active'].dt.quarter
df_all['year_first_active'] = df_all['timestamp_first_active'].dt.year
df_all['created_less_active'] = (df_all['date_account_created'] - df_all['timestamp_first_active']).dt.days

# Drop unnecessary columns
columns_to_drop = ['date_account_created', 'timestamp_first_active', 'date_first_booking', 'country_destination']
for column in columns_to_drop:
if column in df_all.columns:
df_all.drop(column, axis=1, inplace=True)``````

## Wrapping Up

In two relatively simple steps, we have changed our training dataset from 14 columns to 163 columns. Although this seems like a lot more information, most of this expansion was caused by the One Hot Encoding, which is not adding more information, but simply expanding out the existing information. We have not added any external data, and I didn’t even really investigate what information we could have extracted from the other non-date columns.

Again, this process is open ended, so there is an almost unlimited range of possibilities that we have not even really begun to explore. As such, if you see an additional transformation or have an idea for the addition of a new feature, please feel free to let me know in a comment!

## Next Time

In the next piece, we will look at the data in sessions.csv that we left aside initially and see how we can add that data to our training dataset.

[1] This is an example of the existence of a confounding factor. A model predicting whether someone will wakeup with a headache based on whether they slept with their shoes on ignores that there is a more logical explanation for the headaches – in this case that both the headaches and sleeping with shoes on are caused by a third factor – going to bed drunk.

This article on cleaning data is Part III in a series looking at data science and machine learning by walking through a Kaggle competition. If you have not done so already, it is recommended that you go back and read Part I and Part II.

In this part we will focus on cleaning the data provided for the Airbnb Kaggle competition.

## Cleaning Data

When we talk about cleaning data, what exactly are we talking about? Generally when people talk about cleaning data, there are a few specific things they are referring to:

1. Fixing up formats – Often when data is saved or translated from one format to another (for example in our case from CSV to Python), some data may not be translated correctly. We saw a good example of this in the last article in csv. The timestamp_first_active column contained numbers like 20090609231247 instead of timestamps in the expected format: 2009-06-09 23:12:47. A typical job when it comes to cleaning data is correcting these types of issues.
2. Filling in missing values – As we also saw in Part II, it is quite common for some values to be missing from datasets. This typically means that a piece of information was simply not collected. There are several options for handling missing data that will be covered below.
3. Correcting erroneous values – For some columns, there are values that can be identified as obviously incorrect. This may be a ‘gender’ column where someone has entered a number, or an ‘age’ column where someone has entered a value well over 100. These values either need to be corrected (if the correct value can be determined) or assumed to be missing.
4. Standardizing categories – More of a subcategory of ‘correcting erroneous values’, this type of data cleansing is so common it is worth special mention. In many (all?) cases where data is collected from users directly – particularly using free text fields – spelling mistakes, language differences or other factors will result in a given answer being provided in multiple ways. For example, when collecting data on country of birth, if users are not provided with a standardized list of countries, the data will inevitably contain multiple spellings of the same country (e.g. USA, United States, U.S. and so on). One of the main cleaning tasks often involves standardizing these values to ensure that there is only one version of each value.

## Options for Dealing with Missing Data

Missing data in general is one of the trickier issues that is dealt with when cleaning data. Broadly there are two solutions:

### 1. Deleting/Ignoring rows with missing values

The simplest solution available when faced with missing values is to not use the records with missing values when training your model. However, there are some issues to be aware of before you starting deleting masses of rows from your dataset.

The first is that this approach only makes sense if the number of rows with missing data is relatively small compared to the dataset. If you are finding that you will be deleting more than around 10% of your dataset due to rows having missing values, you may need to reconsider.

The second issue is that in order to delete the rows containing missing data, you have to be confident that the rows you are deleting do not contain information that is not contained in other rows. For example, in the current Airbnb dataset we have seen that many users have not provided their age. Can we assume that the people who chose not to provide their age are the same as the users who did? Or are they likely to represent a different type of user, perhaps an older and more privacy conscious user, and therefore a user that is likely to make different choices on which countries to visit? If the answer is the latter, we probably do not want to just delete the records.

### 2. Filling in the Values

The second broad option for dealing with missing data is to fill the missing values with a value. But what value to use? This depends on a range of factors, including the type of data you are trying to fill.

If the data is categorical (i.e. countries, device types, etc.), it may make sense to simply create a new category that will represent ‘unknown’. Another option may be to fill the values with the most common value for that column (the mode). However, because these are broad methods for filling the missing values, this may oversimplify your data and/or make your final model less accurate.

For numerical values (for example the age column) there are some other options. Given that in this case using the mode to fill values makes less sense, we could instead use the mean or median. We could even take an average based on some other criteria – for example filling the missing age values based on an average age for users that selected the same country_destination.

For both types of data (categorical and numerical), we can also use far more complicated methods to impute the missing values. Effectively, we can use a similar methodology that we are planning to use to predict the country_destination to predict the values in any of the other columns, based on the columns that do have data. And just like with modeling in general, there are an almost endless number of ways this can be done, which won’t be detailed here. For more information on this topic, the orange Python library provides some excellent documentation.

## Step by Step

With that general overview out of the way, let’s start cleaning the Airbnb data. In relation to the datasets provided for the Airbnb Kaggle competition, we will focus our cleaning efforts on two files – train_users_2.csv and test_users.csv and leave aside sessions.csv.

The first step is to load the data from the CSV files using Python. To do this we will use the Pandas library and load the data from two files train_users_2.csv and test_users.csv. After loading, we will combine them into one dataset so that any cleaning (and later any other changes) will be done to all the data at once[1].

``````import pandas as pd

# Import data
tr_filepath = "./train_users_2.csv"
te_filepath = "./test_users.csv"

# Combine into one dataset
df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True)``````

### Clean the Timestamps

Once the data has been loaded and combined, the first cleaning step we will undertake is fixing the format of the dates – as we saw in Part II, at least one of the date columns looks like it is formatted as one long number. You may be wondering why this is necessary – after all, can’t we all see what the dates are supposed to represent when we look at the data?

The reason we need to convert the values in the date columns is that, if we want to do anything with those dates (e.g. subtract one date from another, extract the month of the year from each date etc.), it will be far easier if Python recognizes the values as dates. This will become much clearer next week when we start adding various new features to the training data based on this date information.

Luckily, fixing date formats is relatively easy. Pandas has a simple function, to_datetime, that will allow us to input a column and get the correctly formatted dates as a result. When using this function we also provide a parameter called ‘format’ that is like a regular expression for dates. In simpler terms, we are providing the function with a generalized form of the date so that it can interpret the data in the column. For example, for the date_account_created column we are telling the function to expect a four-digit year (%Y) followed by a ‘-’, then a two-digit month (%m), then ‘-’, then a two-digit day (%d) – altogether the expression would be ‘%Y-%m-%d’ (for the full list of directives that can be used, see here). For the timestamp_first_active column, the date format provided is different so we adjust our expression accordingly.

Once we have fixed the date formats, we simply replace the existing date columns with the corrected data. Finally, because the date_account_created column is sometimes empty, we replace the empty values with the value in the date_account_created column using the fillna function. The code for this step is provided below:

``````# Change Dates to consistent format
print("Fixing timestamps...")
df_all['date_account_created'] = pd.to_datetime(df_all['date_account_created'], format='%Y-%m-%d')
df_all['timestamp_first_active'] = pd.to_datetime(df_all['timestamp_first_active'], format='%Y%m%d%H%M%S')
df_all['date_account_created'].fillna(df_all.timestamp_first_active, inplace=True)``````

### Remove booking date field

Those following along and/or paying attention may have noticed that in the original dataset, there are three date fields, but we have only covered two above. The remaining date field, date_first_booking, we are going to drop (remove) from the training data altogether. The reason is that this field is only populated for users who have made a booking. For the data in training_users_2.csv, all the users that have a first booking country have a value in the date_first_booking column and for those that have not made a booking (country_destination = NDF) the value is missing. However, for the data in test_users.csv, the date_first_booking column is empty for all the records.

This means that this column is not going to be useful for predicting which country a booking will be made. What is more, if we leave it in the training dataset when building the model, it will likely increase the chances that the model predicts NDF as those are the records without dates in the training dataset. The code for removing the column is provided below:

``````# Remove date_first_booking column
df_all.drop('date_first_booking', axis=1, inplace=True)``````

### Clean the Age column

As identified in Part II, there are several age values that are clearly incorrect (unreasonably high or too low). In this step, we replace these incorrect values with ‘NaN’, which literally stands for Not a Number, but implies we do not know the age value. In other words we are changing the incorrect values into missing values. To do this, we create a simple function that intakes a dataframe (table), a column name, a maximum acceptable value (90) and a minimum acceptable value (15). This function will then replace the values in the specified column that are outside the acceptable range with NaN.

Again from Part II we know there were also a significant number of users who did not provide their age at all – so they also show up as NaN in the dataset. After we have converted the incorrect age values to NaN, we then change all the NaN values to -1.

The code for these steps is shown below:

``````import numpy as np

# Remove outliers function
def remove_outliers(df, column, min_val, max_val):
col_values = df[column].values
df[column] = np.where(np.logical_or(col_values <= min_val, col_values >= max_val), np.NaN, col_values)
return df

# Fixing age column
print("Fixing age column...")
df_all = remove_outliers(df=df_all, column='age', min_val=15, max_val=90)
df_all['age'].fillna(-1, inplace=True)``````

As mentioned earlier, there are several more complicated ways to fill in the missing values in the age column. We are selecting this simple method for two main reasons:

1. Clarity – this series of articles is going to be long enough without adding the complication of a complex methodology for imputing missing ages.
2. Questionable results – in my testing during the actual competition, I did test several more complex imputation methodologies. However, none of the methods I tested actually produced a better end result than the methodology outlined above.

### Identify and fill additional columns with missing values

From more detailed analysis of the data, you may have also realized there is one more column that has missing values – the first_affiliate_tracked column. In the same way we have been filling in the missing values in other columns, we now fill in the values in this column.

``````# Fill first_affiliate_tracked column
print("Filling first_affiliate_tracked column...")
df_all['first_affiliate_tracked'].fillna(-1, inplace=True)``````

### Sample Output

So what does the data look like after all these changes? Here is a sample of some rows from our cleaned dataset:

idaffiliate_channelaffiliate_provideragecountry_destinationdate_account_createdfirst_affiliate_trackedfirst_browserfirst_device_typegenderlanguagesignup_appsignup_flowsignup_methodtimestamp_first_active
4ft3gnwmtxdirectdirect56.0US2010-09-28 00:00:00untrackedIEWindows DesktopFEMALEenWeb3basic2009-06-09 23:12:47
87mebub9p4directdirect41.0US2010-09-14 00:00:00untrackedChromeMac Desktop-unknown-enWeb0basic2009-12-08 06:11:05
osr2jwljorotherother-1.0US2010-01-01 00:00:00omgChromeMac Desktop-unknown-enWeb0basic2010-01-01 21:56:19
lsw9q7uk0jothercraigslist46.0US2010-01-02 00:00:00untrackedSafariMac DesktopFEMALEenWeb0basic2010-01-02 01:25:58
0d01nltbrsdirectdirect47.0US2010-01-03 00:00:00omgSafariMac DesktopFEMALEenWeb0basic2010-01-03 19:19:05
a1vcnhxeijothercraigslist50.0US2010-01-04 00:00:00untrackedSafariMac DesktopFEMALEenWeb0basic2010-01-04 00:42:11
6uh8zyj2gnothercraigslist46.0US2010-01-04 00:00:00omgFirefoxMac Desktop-unknown-enWeb0basic2010-01-04 02:37:58
yuuqmid2rpothercraigslist36.0US2010-01-04 00:00:00untrackedFirefoxMac DesktopFEMALEenWeb0basic2010-01-04 19:42:51
om1ss59ys8othercraigslist47.0NDF2010-01-05 00:00:00untracked-unknown-iPhoneFEMALEenWeb0basic2010-01-05 05:18:12
k6np330cm1directdirect-1.0FR2010-01-05 00:00:00-1-unknown-Other/Unknown-unknown-enWeb0basic2010-01-05 06:08:59
ju3h98ch3wothercraigslist36.0NDF2010-01-07 00:00:00untrackedMobile SafariiPhoneFEMALEenWeb0basic2010-01-07 05:58:20
v4d5rl22pxdirectdirect33.0CA2010-01-07 00:00:00untrackedChromeWindows DesktopFEMALEenWeb0basic2010-01-07 20:45:55
2dwbwkx056othercraigslist-1.0NDF2010-01-07 00:00:00-1-unknown-Other/Unknown-unknown-enWeb0basic2010-01-07 21:51:25
frhre329auothercraigslist31.0US2010-01-07 00:00:00-1-unknown-Other/Unknown-unknown-enWeb0basic2010-01-07 22:46:25
gdka1q5ktddirectdirect29.0FR2010-01-10 00:00:00untrackedChromeMac DesktopFEMALEenWeb0basic2010-01-10 01:08:17

### Is that all?

Those more experienced with working with data may be thinking that we have not done all that much cleaning with this data – and you would be right. One of the nice things about Kaggle competitions is that the data provided does not require all that much cleaning as that is not what the providers of the data want participants to focus on. Many of the problems that would be found in real world data (as covered earlier) do not exist in this dataset, saving us significant time.

However, what this relatively easy cleaning process also tells us is that even when datasets are provided with the intention of needing no or minimal cleaning, there is always something that needs to be done.

## Next Time

In the next piece, we will focus on transforming the data and feature extraction, allowing us to create a training dataset that will hopefully allow the model to make better predictions. To make sure you don’t miss out, use the subscription feature below.

[1] For those with more data mining experience you may realize that combining the test and training data at this stage is not best practice. The best practice would be to avoid using the test dataset in any of the data preprocessing or model tuning/validation steps to avoid over fitting. However, in the context of this competition, because we are only trying to create the model to classify one unchanging dataset, simply maximizing the accuracy of the model for that dataset is the primary concern.