Data Inspired Insights

# Category: Data Science(Page 1 of 2)

Once we have our data in a pandas DataFrame, the basic table structure in pandas, the next step is how do we assess what we have? If you are coming from Excel or R Studio, you are probably used to being able to look at the data any time you want. In python/pandas, we don’t have a spreadsheet to work with, and we don’t even have an equivalent of R Studio (although Jupyter notebooks are a similar concept), but we do have several tools available that can help you get a handle on what your data looks like.

## DataFrame Dimensions

Perhaps the most basic question is how much data do I actually have? Did I successfully load in all the rows and columns I expected or are some missing? These questions can be answered with the `shape` method:

``````import pandas as pd

print(df.shape)

(1599, 12)``````

`shape` returns a tuple (think of it as a list that you can’t alter) which tells you the number of rows and columns, 1599 and 12 respectively in this example. You can also use `len` to get the number of rows:

``````print(len(df))

1599``````

Using `len` is also slightly quicker than using `shape`, so if it is just the number of rows you are interested in, go with `len`.

Another dimension we might also be interested in is the size of the table in terms of disk space . For this we can use the memory_usage method:

``````print(df.memory_usage())

Index                     128
fixed acidity           12792
volatile acidity        12792
citric acid             12792
residual sugar          12792
chlorides               12792
free sulfur dioxide     12792
total sulfur dioxide    12792
density                 12792
pH                      12792
sulphates               12792
alcohol                 12792
quality                 12792
dtype: int64``````

This tells us the space, in bytes, each column is taking up on the disk. If we want to know the total size of the DataFrame, we can take a sum, and then to get the number into a more usable metric like kilobytes (kB) or megabytes (MB), we can divide by 1024 as many times as needed.

``````print(df.memory_usage().sum() / 1024)  # Size in kB

150.03125``````

Lastly, for these basic dimension assessments, we can generate a list of the data types of each column. This can be a very useful early indicator that your data has been read in correctly. If you have a column that you believe should be numeric (i.e. a `float64` or an `int64`) but it is listed as `object` (pandas speak for categorical data), it may be a sign that something has not been interpreted correctly:

``````print(df.dtypes)

fixed acidity           float64
volatile acidity        float64
citric acid             float64
residual sugar          float64
chlorides               float64
free sulfur dioxide     float64
total sulfur dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
dtype: object``````

## Viewing some sample rows

After we have satisfied ourselves that we have the expected volume of data in our DataFrame, we might want to look at some actual rows of data. Particularly for large datasets, this is where the `head` and `tail` methods come in handy. As the names suggest, `head` will return the first n rows of the DataFrame and `tail` will return the last n rows of the DataFrame (n is set to 5 by default for both).

``print(df.head())``

Aside from this basic use, we can use `head` and `tail` in some very useful ways with some alterations/additions. First off, we can set n to what ever value we want, showing as many or few rows as desired:

``print(df.head(10))``

We can also combine it with `sort_values` to see the top (or bottom) n rows of data sorted by a column, or selection of columns:

``print(df.sort_values('fixed acidity', ascending=False).head(10))``

Finally, if we have a lot of columns, too many to display all of them in Jupyter or the console, we can combine `head`/`tail` with `transpose` to inspect all the columns for a few rows:

``print(df.head().transpose())``

## Summary Statistics

Moving on, the next step is typically some exploratory data analysis (EDA). EDA is a very open ended process so no one can give you an explicit set of instructions on how to do it. Each dataset is different and to a large extent, you just have to allow your curiosity to run wild. However, there are some tools we can take advantage of in this process.

### Describe

The most basic way to summarize the data in your DataFrame is the `describe` method. This method, by default, gives us a summary of all the numeric fields in the DataFrame, including counts of values (which exclude null values), the mean, standard deviation, min, max and some percentiles.

``df.describe()``

This is nice, but let’s talk about what isn’t being shown. Firstly, by default, any non-numeric and date fields are excluded. In the dataset we are using in this example we don’t have any non-numeric fields, so let’s add a couple of categorical fields (they will just have random letters in this case), and a date field:

``````import string
import random

df['categorical'] = [random.choice(string.ascii_letters) for i in range(len(df))]
df['categorical_2'] = [random.choice(string.ascii_letters) for i in range(len(df))]
df['date_col'] = pd.date_range(start='2020-11-01', periods=len(df))

df.describe()``````

As we can see, the output didn’t change. But we can use some of the parameters for `describe` to address that. First, we can set `include='all'` to include all datatypes in the summary:

``df.describe(include='all')``

Now for the categorical columns it tells us some useful numbers about the number of unique values and which value is the most frequent. But the way it is handling the date column is like a categorical value. We can also change that so it treats it as numeric value by setting the `datetime_as_numeric` parameter to `True`:

``df[['date_col]].describe(datetime_as_numeric=True)``

### `pandas_summary` Library

Building on top of the kind of summaries that are produced by `describe`, some very talented people have developed a library called pandas_summary. This library is purely designed to generate informative summaries of pandas DataFrames. First though we need to do some quick setup (you may need to install the library using pip):

``````from pandas_summary import DataFrameSummary

dfs = DataFrameSummary(df)``````

Now let’s take a look at two ways we can use this new `DataFrameSummary` object. The first one is `columns_stats`. This is similar to what we saw previously with `describe`, but with one useful addition: the number and percent of missing values in each column:

``dfs.columns_stats``

Secondly, my personal favorite, by selecting a column from we can look at an individual column to get some really detailed statistics, plus a histogram thrown in for numeric fields:

``dfs['fixed acidity']``

## Seaborn

Seaborn is a statistical data visualization library for python with a full suite of charts that you should definitely look into if you have time, but for today we are going to look at just one very nice feature – `pairplot`. This function will generate a pairwise correlation plots for all the columns in your DataFrame with literally one line:

``````import seaborn as sns

sns.pairplot(df, hue="quality")``````

The colors of the plot will be determined by the column you select for the `hue` parameter. This allows you to see how the values in that column are impacted by the two features in each pairwise plot , but also along the axis where you would have a plot of a feature against itself, we get the distributions of that variable for different values in your `hue` column.

Note, if you have a lot of columns, be aware that this type of chart will become less useful, and will also likely take a lot of time to render.

## Wrapping Up

Exploratory data analysis (EDA) should be an open ended and flexible process that never really ends. However, when we are first trying to understand the basic dimensions of a new dataset and what it contains, there are some common methods we can employ such as `shape` and `describe` and `dtypes`, and some very useful third party libraries such as `pandas_summary` and `seaborn`. While this explainer does not provide a comprehensive list of methods and techniques, hopefully it has provided you with somewhere to get started.

When we are working with data in software development or when the data comes from APIs, it is often not provided in a tabular form. Instead it is provided in some combination of key-value stores and arrays broadly denoted as JavaScript Object Notation (JSON). So how do we read this type of non-tabular data into a tabular format like a pandas DataFrame?

## Understanding Structures

The first thing to understand about data stored in this form is that there are effectively infinite ways to represent a single dataset. For example, take a simple dataset, shown here in tabular form:

Let’s look at some of the more common ways this data might be represented using JSON:

### 1. “Records”

A list of objects, with each object representing a row of data. The column names are the keys of each object.

``````[
{
"department": "Sales",
"first_name": "John",
"id": 1,
"last_name": "Johnson"
},
{
"department": "Sales",
"first_name": "Peter",
"id": 2,
"last_name": "Peterson"
},
{
"department": "Sales",
"first_name": "Paula",
"id": 3,
"last_name": "Paulson"
},
...
]``````

### 2. “List”

An object where each key is a column, with the values for that column stored in a list.

``````{
"id": [
1,
2,
3,
4,
5,
6,
7
],
"department": [
"Sales",
"Sales",
"Sales",
"HR",
"HR",
"Accounting",
"Accounting"
],
"first_name": [
"John",
"Peter",
"Paula",
"James",
"Jennifer",
"Susan",
"Clare"
],
"last_name": [
"Johnson",
"Peterson",
"Paulson",
"Jameson",
"Jensen",
"Susanson",
"Clareson"
]
}``````

### 3. “Split”

An object with two keys, one for the column names, the other for the data which is a list of lists representing rows of data.

``````{
"columns": [
"id",
"department",
"first_name",
"last_name"
],
"data": [
[
1,
"Sales",
"John",
"Johnson"
],
[
2,
"Sales",
"Peter",
"Peterson"
],
...
]
}``````

## Creating a DataFrame

So how do we get this data into a pandas DataFrame given that it could come in different forms? The key is knowing what structures pandas understands. In the first two cases above (#1 Records and #2 List), pandas understands the structure and will automatically convert it to a DataFrame for you. All you have to is pass the structure to the `DataFrame` class:

``````list_data = {
"id": [
1,
2,
3,
4,
5,
6,
7
],
"department": [
"Sales",
"Sales",
"Sales",
"HR",
"HR",
"Accounting",
"Accounting"
],
"first_name": [
"John",
"Peter",
"Paula",
"James",
"Jennifer",
"Susan",
"Clare"
],
"last_name": [
"Johnson",
"Peterson",
"Paulson",
"Jameson",
"Jensen",
"Susanson",
"Clareson"
]
}
df = pd.DataFrame(list_data)
print(df)

id 	department     first_name 	last_name
0 	1 	Sales 	       John 	        Johnson
1 	2 	Sales 	       Peter 	        Peterson
2 	3 	Sales 	       Paula 	        Paulson
3 	4 	HR 	       James 	        Jameson
4 	5 	HR 	       Jennifer 	Jensen
5 	6 	Accounting     Susan 	        Susanson
6 	7 	Accounting     Clare 	        Clareson``````

Initializing a DataFrame this way also gives us a couple of options. We can load in only a selection of columns:

``````df = pd.DataFrame(list_data, columns=['id', 'department])
print(df)

id 	department
0 	1 	Sales
1 	2 	Sales
2 	3 	Sales
3 	4 	HR
4 	5 	HR
5 	6 	Accounting
6	7 	Accounting``````

We can also define an index. However, if one of the fields in your dataset is what you want to set as the index, it is simpler to do so after you load the data into a DataFrame:

``````df = pd.DataFrame(list_data).set_index('id')
print(df)

department first_name last_name
id
1        Sales       John   Johnson
2        Sales      Peter  Peterson
3        Sales      Paula   Paulson
4           HR      James   Jameson
5           HR   Jennifer    Jensen
6   Accounting      Susan  Susanson
7   Accounting      Clare  Clareson``````

## Acceptable Structures

What are the acceptable structures that pandas recognizes? Here are the ones I have found so far:

### Records

``````[
{
"department": "Sales",
"first_name": "John",
"id": 1,
"last_name": "Johnson"
},
{
"department": "Sales",
"first_name": "Peter",
"id": 2,
"last_name": "Peterson"
},
{
"department": "Sales",
"first_name": "Paula",
"id": 3,
"last_name": "Paulson"
},
...
]``````

### List

``````{
"id": [
1,
2,
3,
4,
5,
6,
7
],
"department": [
"Sales",
"Sales",
"Sales",
"HR",
"HR",
"Accounting",
"Accounting"
],
"first_name": [
"John",
"Peter",
"Paula",
"James",
"Jennifer",
"Susan",
"Clare"
],
"last_name": [
"Johnson",
"Peterson",
"Paulson",
"Jameson",
"Jensen",
"Susanson",
"Clareson"
]
}``````

### Dict

``````{
"id": {
0: 1,
1: 2,
2: 3,
3: 4,
4: 5,
5: 6,
6: 7
},
"department": {
0: "Sales",
1: "Sales",
2: "Sales",
3: "HR",
4: "HR",
5: "Accounting",
6: "Accounting"
},
"first_name": {
0: "John",
1: "Peter",
2: "Paula",
3: "James",
4: "Jennifer",
5: "Susan",
6: "Clare"
},
"last_name": {
0: "Johnson",
1: "Peterson",
2: "Paulson",
3: "Jameson",
4: "Jensen",
5: "Susanson",
6: "Clareson"
}
}``````

### Matrix

For this one you will have to pass the column names separately.

``````[
[
1,
"Sales",
"John",
"Johnson"
],
[
2,
"Sales",
"Peter",
"Peterson"
],
[
3,
"Sales",
"Paula",
"Paulson"
],
...
]``````

## Inconsistencies

What happens if the data is in one of these formats, but has some inconsistencies? For example, what if we have something that looks like this?

``````[
{
"department": "Sales",
"first_name": "John",
"id": 1,
"last_name": "Johnson",
"extra_field": "woops!"
},
{
"department": "Sales",
"first_name": "Peter",
"id": 2,
"last_name": "Peterson"
},
{
"department": "Sales",
"first_name": "Paula",
"id": 3,
"last_name": "Paulson"
},
...
]``````

Fortunately, pandas is fairly robust to these types of inconsistencies, in this case creating an extra column and filling the remaining rows with NaN (null values):

``````    id  department  first_name  last_name  extra_field
0   1   Sales 	    John 	Johnson    woops!
1   2   Sales 	    Peter 	Peterson   NaN
2   3   Sales 	    Paula 	Paulson    NaN
3   4   HR 	    James 	Jameson    NaN
4   5   HR 	    Jennifer 	Jensen     NaN
5   6   Accounting  Susan       Susanson   NaN
6   7   Accounting  Clare       Clareson   NaN``````

Something important to note is that, depending on the structure and where the inconsistency occurs in the structure, the inconsistency can be handled differently. It could be an additional column, an additional row, or in some cases it may be ignored completely. The key is, as always, to check your data has loaded as expected.

## Explicit Methods

There are two more methods for reading JSON data into a DataFrame: `DataFrame.from_records` and `DataFrame.from_dict`. `DataFrame.from_records` expects data in the ‘Records’ or ‘Matrix’ formats shown above, while `DataFrame.from_dict` will accept data in either the Dict or List structures. These methods are more explicit in what they do and have several potential advantages:

### Clarity

When working with a team or in a situation where other people are going to review your code, being explicit can help them to understand what you are trying to do. Passing some unknown structure to `DataFrame` and knowing/hoping it will interpret it correctly is using a little too much ‘magic’ for some people. For the sanity of others, and yourself in 6 months when you are trying to work out what you did, you might want to consider the more explicit methods.

### Strictness

When writing code that is going to be reused, maintained and/or run automatically, we want to write that code in a very strict way. That is, it should not keep working if the inputs change. Using `DataFrame` could lead to situations where the input data format changes, but is read in anyway and instead breaks something else further down the line. In a situation like this, someone will likely have the unenviable task of following the trail through the code to work out what changed.

Using the more explicit methods is more likely to cause the error to be raised where the problem actually occurred: reading in data which is no longer in the expected format.

### Options

The more explicit methods give you more options for reading in the data. For example `DataFrame.from_records` gives you options to limit the number of rows to read in. `DataFrame.from_dict` allows you to specify the orientation of the data. That is, are the lists of values representative of columns or rows?

### Coercion

In some cases, your data will not play nice and the generic DataFrame method will not correctly interpret your data. Using the more explicit method can help to resolve this. For example, if your objects are in a column of a DataFrame (i.e. a pandas Series) instead of a list, using DataFrame will give you a DataFrame with one column:

``````records_data = pd.Series([
{
"department": "Sales",
"first_name": "John",
"id": 1,
"last_name": "Johnson",
"test": 0
},
{
"department": "Sales",
"first": "Peter",
"id": 2,
"name": "Peterson"
},
{
"dept": "Sales",
"firstname": "Paula",
"sid": 3,
"lastname": "Paulson"
},
{
"dept": "HR",
"name": "James",
"pid": 4,
"last": "Jameson"
}
])
print(pd.DataFrame(records_data))``````

Using the more explicit `DataFrame.from_records` gives you the expected results:

``````records_data = pd.Series([
{
"department": "Sales",
"first_name": "John",
"id": 1,
"last_name": "Johnson",
"test": 0
},
{
"department": "Sales",
"first": "Peter",
"id": 2,
"name": "Peterson"
},
{
"dept": "Sales",
"firstname": "Paula",
"sid": 3,
"lastname": "Paulson"
},
{
"dept": "HR",
"name": "James",
"pid": 4,
"last": "Jameson"
}
])
print(pd.DataFrame.from_records(records_data))``````

## Wrapping Up

We’ve looked at how we can quickly and easily convert JSON format data into tabular data using the `DataFrame` class and the more explicit `DataFrame.from_records` and `DataFrame.from_dict` methods. The downside is this only works if the data is in one of a few structures. The upside is most of the data you will encounter will be in one of these formats, or something that is easily converted into these formats.

If you want to play around with converting data between delimited format and various JSON formats, I can recommend trying an app I built a while back: JSONifyit.

To get started with pandas, the first thing you are going to need to understand is how to get data into pandas. For this guide we are going to focus on reading in tabular data (i.e. data stored in a table with rows and columns). If you don’t have some data available but want to try some things out, a great place to get some data to play with is the UCI Machine Learning Repository.

## Delimited Files

One of the most common ways you will encounter tabular data, particularly data from an external source or publicly available data, is in the form of a delimited file such as comma separated values (CSV), tab separated values (TSV), or separated by some other character. To import this data so you can start playing with it, pandas gives you the `read_csv` function with a lot of options to help manage different cases. But let’s start with the very basic case:

``````import pandas as pd

# Show the top 5 rows to make sure it read in correctly

Running this code imports the pandas library (as `pd`), uses the `read_csv` function to read in the data and stores it as a pandas DataFrame called `df`, then prints the top 5 rows using the head method. Note that the path to the file that you want to import can be a path to a file on your computer, or it can be a URL (web address) for a file on the internet. As long as you have internet access (and permission to access the file) it will work like you have the file downloaded and saved already.

When reading the data, unless specified, `read_csv` will attempt to automatically detect what the delimiting character is (e.g. “,” for CSV). In most cases this works fine, but in cases where it doesn’t, you can use the `sep` parameter to specify what char to use. For example, if your file is separated with “;” you might do something like:

``````import pandas as pd

# Show the top 5 rows to make sure it is correct

OK, what if your file has some other junk above and/or below the actual data like this:

We have two options for working around this, the `header` parameter and the `skiprows` parameter:

``````import pandas as pd

# Both DataFrames produce the same result

These are equivalent because setting `header=7` tells `read_csv` to look in row 7 (remember the row numbers are 0 indexed) to find the header row, then assume the data starts from the next row. On the other hand, setting `skiprows=7` tells `read_csv` to ignore the first 7 rows (so rows 0 to 6), then it assumes the header row is the first row after the ignored rows.

### Other Useful `read_csv` Parameters

There are dozens of other parameters to help you read in your data to handle a range of strange cases, but here are a selection of parameters I have found most useful to date:

For the full list of available parameters, checkout out the official documentation. One thing to note is that although there a lot of the parameters available for `read_csv`, many are focused on helping correctly format and interpret data as it is being read in – for example, interpretation of dates, interpretation of boolean values, and so on. In many/most cases these are things that can be addressed after the data is in a pandas DataFrame, and in some cases, handling these types of formatting and standardization steps explicitly after reading in the data can make it easier to understand for the next person that reads your code.

## Excel Data

Pandas also has a nice handy wrapper for reading in Excel data `read_excel`. Instead of writing your data to a CSV, then reading it in, now you can read directly from the Excel file itself. This function has many of the same parameters as `read_csv`, with options to skip rows, read in a sample of rows and/or columns, specify a header row and so on.

## Databases

If your data is in a tabular/SQL database, like PostgreSQL, MySQL, Bigquery or something similar, your job gets a little bit more complicated to setup, but once that setup is done, it becomes really simple to repeatedly query data (using SQL) from that database directly into a DataFrame where you can do what you want with it.

The first step is to create a connection to the database holding the data. It is beyond the scope of this particular guide, but the library you will almost certainly need to use will be SQLAlchemy or in some cases a library created by the creator of the database (for example, Google has Bigquery API library called google-cloud-bigquery).

Once you have connected to your database, pandas provides three functions for you to extract data into a DataFrame: `read_sql_table`, `read_sql_query` and `read_sql`. The last of these, `read_sql`, is what’s called a ‘convenience wrapper’ around `read_sql_table` and `read_sql_query` – the functionality of both the underlying functions can be accessed from `read_sql`. But let’s look at the two underlying functions individually to see what the differences are and what options we have.

`read_sql_table` is a function we can use to extract data from a table in a SQL database. The function requires two parameters `table_name` – the name of the table you want to get the data from; and `con` – the location of the database the table is in. With these two parameters, all data from the specified table (i.e. `SELECT * FROM table_name`) will be returned as a DataFrame:

``df = pd.read_sql_table(table_name='table_name', con='postgres:///db_name')  ``

`read_sql_table` does also give you the option to specify a list of columns to be extracted using the `columns` parameter.

`read_sql_query` on the other hand allows you to specify the query you want to run against the database.

``````query = """
SELECT column_1
, column_2
, column_3
FROM table_name
WHERE column_4 > 10
"""

Obviously writing your own query gives you a lot more flexibility to extract exactly what you need. However, also consider the potential upside in terms of processing efficiency. Doing aggregations and transformations in a database, in almost all cases, will be much faster than doing it in pandas after it is extracted. As a result, some careful query planning can save a lot of time and effort later on.

## Other Data Sources

Pandas also has functions for reading in data from a range of other sources, including HTML tables, to SPSS, Stata, SAS and HDF files. We won’t go into them here, but being aware that these options exist is often all you really need to know. If a case arises where you need to read data from these sources, you can always refer to the documentation.

## Wrapping Up

We’ve looked at how we can use pandas to read in data from various sources of tabular data, from delimited files and Excel, to databases, to some other more uncommon sources. While these functions often have many parameters available, remember most of them will be unnecessary for any given dataset. These functions are designed to work with the minimum parameters provided (e.g. just a file location) in most cases. Also remember that once you have the data in a DataFrame, you will have a tonne of options to fix and change the data as needed – you don’t need to do everything in one function.

For the last 5 years, data science has been one of the world’s hottest professions, but it is also one of the most poorly defined. This can be seen on any career website, where advertisements for ‘Data Scientist’ positions describe everything from what used to be a simple data analyst role, to technical, PhD-only, research positions working on artificial intelligence or autonomous cars.

However, despite the diversity of roles being labelled ‘data scientist’, there is a common thread that runs through any job involving data and building models. And this is that only around 20% of time will be spent building models, with the other 80% of the time spent understanding, cleaning and transforming data to get it to the point where it can be used for modelling (for an overview of all the steps a Data Scientist goes through, see this series).

For many/most people working in the profession, the time spent cleaning and transforming is seen simply as a price to be paid to get to the interesting part – the modelling. If they could, many people would happily hand off this ‘grunt work’ to someone else. At first glance, it is easy to see why this would be the case – it is the modelling that gets all the headlines. There are very few people that hear about a model predicting cancer in hospital patients and thinks “they must have had some awesome clean data to build that with”.

However, plaudits aside, I am going to make the case that this is backwards. That from a creativity and challenge standpoint, it is often the cleaning and transforming parts of the job that are the most interesting parts of data science.

## The creativity of cleaning

Over the past 12 years of working with data, one thing that has become painfully obvious is the unbridled creativity of people when it comes to introducing errors and inconsistencies into data. Typos, missing values, numbers in text fields, text in numerical fields, inconsistent spellings of the same item, and changing number formats (e.g. ever notice how most of continental Europe uses “,” as the decimal point instead of “.”?) are just some of the most common issues one will encounter.

To be fair, it is not only the fault of the person doing the data entry (e.g. an end user of an application). Often, the root of the problem is a poorly designed interface and a lack of data validation. For example, why is a user able to submit text in a field that should only ever contain numbers? Why do I have to guess how everyone else types in “the United States” (US, U.S., USA, U.S.A., United States of America, America, Murica) instead of choosing from a standardized list of countries?

However, even with the most carefully validated forms and data entry interface, data quality issues will continue to exist. People fudge their age, lie about their income, enter fake emails, addresses and names, and some, I assume, make honest typos and mistakes.

So why is dealing with these issues is a good thing? Because the unlimited creativity on the part of the people creating the data quality issues has to be exceeded by the creativity of the person cleaning the data. For every possible type of error that can be found in the data, the data scientist has to develop a method to address that error. And assuming the dataset is more than a few hundred rows, it will have to be a systematic method, as manually correcting the issues becomes impractical.

As a result, the data scientist has to find a way to address the universe of potential errors, and to do so in an automated, systematic way. How do I go through a column of countries that have all been spelt in different ways in order to standardize the country names? Someone got decimal happy and now I have a column where a lot of the numbers have two decimal points instead of one – how can I systematically work out which decimal point is the correct one, and then remove the other decimal point? A bunch of users put their birthday as 1 January 1900, how can I remove those, should I remove them, and if yes, what values should I put there instead?

All of these scenarios are real examples of interesting, challenging problems to solve, and ones that require a high-level of creativity to address.

## The creativity of transformation/feature extraction

Once cleaning has been undertaken, typically the next step is to perform transformation and/or feature extraction. These steps are necessary because the data is rarely collected in the form required by the model, and/or there is additional information that can be added to and/or extracted from the data to make the model more effective.

If this sounds like a very open ended task, that’s because it is. Often, the ability to enhance a dataset is limited only by time, and the creativity and knowledge of the data scientist doing the work. Of course, there are diminishing returns, and at some point, it becomes uneconomic to invest additional effort to improve a dataset, but in many cases there are a huge range of options.

Due to the open-ended nature of this step, there are actually two types of creativity required. The first is the creativity to come up with potential new features that can be extracted from the existing dataset (and developing the methods to create those features). The second is identifying other data that could be used to enhance the dataset (and then developing the methods to import and combine it). Again, both of these are challenging and interesting problems to solve.

## Making a model is often a mechanical process

Unlike the above, the process of creating the model is a relatively mechanical process. Of course, there are still challenges to overcome, but in most cases, it boils down to choosing an algorithm (or combination of algorithms), then tuning the parameters to improve the results. The issue is that both of these steps are not something that typically involve a lot of creative thinking, but instead involve cycling through a lot of options to see what works.

Even the selection of the algorithm, or combination of algorithms, which might seem relatively open ended, is, in the real world, limited by a range of factors. For a given problem, these factors include:

• The task at hand – whether it be two-class or multi-class classification, cluster analysis, prediction of a continuous variable, or something else – will reduce the algorithm options. Some algorithms will typically perform better in certain scenarios, while others may simply not be able to handle the task at all.
• The characteristics of the data often also reduces the options. Larger datasets mean some algorithms will take too long to train to be practical. Datasets with large numbers of features suit some algorithms more than others, while sparse datasets (those with lots of 0 values) will suit other algorithms.
• An often-overlooked factor is the ability to explain to clients and/or bosses how and why a model is making a prediction. Being able to do this typically puts a significant limit on the complexity of the model (particularly ensembles), and makes simpler (and often less accurate) models more appealing.

After all these factors are taken into account, how many algorithms are left to choose from in a given scenario? Probably not too many.

An excellent graphic from SAS summarizing how the algorithm choices in data science are often limited by the problem.

## Wrapping Up

Taking all the above into account, the picture that starts to form is one where significant creativity is required to clean and create a good dataset for modelling, followed by a relatively mechanical process to create and tune a model. But if this is the case, why doesn’t everyone think the same way I do?

One of the primary reasons is that in most real-world data science scenarios, the above steps (cleaning, transformation, feature extraction and modelling) are not typically conducted in a strictly linear fashion. Often, building the model and assessing which features were the most predictive will lead to additional work transforming and extracting features. Feature extraction and testing a model will often reveal data quality issues that were missed earlier and cause the data scientist to revisit that step to address those issues.

In other words, in practice everything is interlinked and many data scientists view the various steps in the process of constructing a model (including cleaning and transforming) as one holistic process that they enjoy completing. However, because the cleaning and transforming aspects are the most time consuming, these aspects (data cleaning in particular) are often seen as being the major impediment to a completed project.

This is true – almost all projects could be completed significantly quicker if the data was of a higher quality at the outset. The quick turnaround for most Kaggle competition entries (where relatively clean and standardized data are provided to everyone) can attest to this. But to my fellow data scientists, I would say the following. Data science will always involve working with dirty and underdeveloped data – no matter how good we get at data validation, how clean and intuitive the interface, or how much planning is done on what data points to collect. Embrace the dirt, celebrate the grind, and take pride in creating creative solutions to often complex and challenging problems. If you don’t, no one else will.

A while back, I attended a hackathon in Belgrade as a mentor. This hackathon was the first ‘open data’ hackathon in Serbia and focused on making applications using data that had recently been released by various ministries, government agencies, and independent bodies in Serbia. As we walked around talking to the various teams, one of the things I noticed at the time, was that almost all teams were using databases to manage their data . In most cases, the database being used was something very lightweight like SQLite3, but in some cases more serious databases (MySQL, PostgreSQL, MongoDB) were also being used.

What I have come to realize is that in many cases this was probably completely unnecessary, particularly given the tight timeframe the teams were working towards – a functional prototype within 48 hours. However, even if you have more time to build an application, there are several good reasons that you may not need to worry about using a formal database. These are outlined below.

## 1. The data is small

Firstly, let’s clarify what I mean when I say ‘small data’. For me, small data is any dataset under 10,000 records (assuming a reasonable number of data points for each record). For many non-data people, 10,000 records may seem quite big, but when using programming languages such as Python or JavaScript, this amount of data is usually very quick and easy to work with. In fact, as Josh Zeigler found, even loading 100,000 records or 15MB of data into a page was possible, completing in as little as 463ms (Safari FTW).

Leaving aside the numbers for a second, the key point here is that in many cases, the data being displayed in an application has far fewer than 10,000 records. If your data is less than 10,000 records, you should probably ask yourself, do you need a database? It is often far simpler, and requires significantly less overhead to simply have your data in a JSON file and load it into the page directly. Alternatively, CSV and Excel files can also be converted to JSON and dumped to a file very quickly and easily using a Python/Pandas script.

The ECIS Development Tracker uses data from six Worldwide Governance Indicators and two other series over 20 years and 18 countries – a total of almost 3,000 data points and a perfect example of small data.

## 2. The data is static

Another reason you may not need a database is if you have a reasonable expectation that the data you are using is not going to change. This is often the case where the data is going to be used for read only purposes – for example visualizations, dashboards and other apps where you are presenting information to users. In these cases, again it may make sense to avoid a database, and rely on a flat file instead.

The important point here is that if the data is not changing or being altered, then static files are probably all that is needed. Even if the data is larger, you can use a script to handle any data processing and load the (assumedly) aggregated or filtered results into the page. If your needs are more dynamic (i.e. you want to show different data to different users and do not want to load everything), you may need a backend (something you would need for a database anyway) that extracts the required data from the flat file, but again, a database may be overkill.

The Kosovo Mosaic visualizer – based on data from a survey conducted once every three years – is an example of a case where the data is not expected to change any time soon.

## 3. The data is simple

One of the big advantages of databases is their ability to store and provide access to complex data. For example, think about representing data from a chain of retail stores on the sale of various products by different sales people. In this case, because there are three related concepts (products, sales people and stores), representing this data without using a database becomes very difficult without a large amount of repetition[1]. In this case, even if the data is small and static, it may simply be better to use a relational database to store the data.

However, in cases where the data can be represented in a table, or multiple unrelated tables, subject to points 1 and 2 above, it may make sense to avoid the overhead of a database.

If you need a schema diagram like this to describe your data, you can probably skip the rest of this article.

## 4. The data is available from a good API

I have recently been working on a project to develop an application that is making extensive use of the Google API. While still under development, the app is already quite complex, making heavy use of data to generate charts and tables on almost every page. However, despite this complexity, so far, I have not had to use a database.

One of the primary reasons I have not needed to implement a database is that the Google API is flexible enough for me to effectively use that as a database. Every time I need data to generate a chart or table, the app makes a call to the API (using Python), passes the results to the front end where, because the data is small (the Google API returns a maximum of 10,000 rows in a query), most of the data manipulation is handled using JavaScript on the client side. For the cases where more heavy data manipulation is required, I make use of Python libraries like Pandas to handle the data processing before sending the data to the front end. What this boils down to is a data intensive application that, as yet, still does not need a database.

Of course, this isn’t to say I will not need a database in the future. If I plan to store user settings and preferences, track usage of the application, or collect other meta data, I will need to implement a database to store that information. However, if you are developing an application that will make use of a flexible and reliable API, you may not need to implement your own database.

Google has APIs available for almost all of its products – most of them with a lot of flexibility and quick response times.

## 5. The app is being built for a short-term need

While it might seem unusual to build a web app with the expectation that it will not be used six months later, this is a surprisingly common use case. In fact, this is often the expectation for visualizations and other informative pages, or pages built for a specific event.

In these particular use cases, keeping down overhead should be a big consideration, in addition to potential hosting options. Developing these short-term applications without a backend and database means free and easy hosting solutions like that provided by GitHub can be used. Adding a backend or database immediately means a more complex hosting setup is required.

## Wrapping up, this is a not an argument against databases…

… it is simply an argument to use the best and simplest tools for a given job. As someone who has worked with a number of different databases throughout their career, I am actually a big user of databases and find most of them intuitive and easy to use. There is also a large number of advantages that only a database can provide, from ensuring data consistency, to facilitating large numbers of users simultaneously making updates, to managing large and complex datasets, there are a number of very good reasons to use a database (SQL or NoSQL, whichever flavor you happen to prefer).

But, as we have covered above, there may be some cases where you do not need these features and can avoid adding an unnecessary complication to your app.

Next week we’ll take a look at a simple app that uses an Excel spreadsheet to generate the data required for the application.

[1] With repetition comes an increased risk of data quality issues

This article is Part VI in a series looking at data science and machine learning by walking through a Kaggle competition. If you have not done so already, you are strongly encouraged to go back and read the earlier parts – (Part I, Part II, Part III, Part IV and Part V).

Continuing on the walkthrough, in this part we build the model that will predict the first booking destination country for each user based on the dataset created in the earlier parts.

## Choosing an Algorithm

The first step to building a model is to decide what type of algorithm to use. Below we look at some of the options.

### Decision Tree

Arguably the most well known algorithm, and one of the simplest conceptually. The decision tree works in a similar manner to the decision tree that you might create when trying to understand which decision to make based on a range of variables.

The goal of the decision tree algorithm used for classification problems (like the one we are looking at) is to create one of these decision trees to classify records into a set number of categories. To do this, it starts with all the records in the training dataset and looks through all the features until it finds the one that allows it to most ‘cleanly’ split the records according to their categories. For example, if you are using daily weather data to try and determine whether it will rain the following day (i.e. there are two categories, ‘it does rain’ and ‘it does not rain’), the algorithm will look for a feature that best splits the records (in this case representing days) into those two categories. When it finds that feature, and the value to split on, it creates one point (‘decision node’) on the decision tree. It then takes each subpopulation and does the same thing again, building up a tree until either all the records are correctly classified, or the number in each subpopulation becomes too small to split. Below is an example decision tree using the described weather data to predict if it will rain tomorrow or not (thanks to Graham Williams’ excellent Rattle package for R):

The way to interpret the above tree is to start at the top. The first criteria the algorithm splits on is the humidity at 3pm. Starting with 100% of the records, if the the humidity at 3pm is less than 71, as it is the case for 93% of the records, we move to the left and find the next decision node. If the humidity at 3pm is greater than or equal to 71, we move to the right, which takes us to a leaf node where the model predicts that there will be rain tomorrow (‘yes’). We can see from the numbers in the node that this represents 7% of all records, and that 74% of the records that reach this node are correctly classified.

The first thing to note is that the model does not accurately predict whether it will rain tomorrow for all records, and in some leaf nodes, it is only slightly better than a coin toss. This is not necessarily a bad thing. The biggest problem that data scientists have with decision trees is the classic problem of overfitting. In the example above, parameters have been set to stop model splitting once the population of records at a given node gets too small (minimum split) and when a certain number of splits have occurred (‘maximum depth’). These values have been set at values to prevent the tree from growing to large. The reason for this is that if the tree gets too large, it will start modelling random noise and hence will not work for data not in the training dataset (it will not ‘generalize’ well).

To picture what this means, imagine extending the example decision tree above further until the model starts splitting out single records using criteria like ‘Humidty3pm = 54’ and ‘Humidty3pm = 31’. That type of decision node may work for this particular training data because there is a specific record that meet that criteria, but it is highly unlikely that it represents any predictive ability and so is unlikely to be accurate if applied to other data.

All this discussion of overfitting with decision trees does however raise an important problem. That problem is how do you know how large you should grow the tree. How do you set the parameters to avoid overfitting but still have an accurate model? The truth is that is is extremely difficult to know how to set the parameters. Set them too conservatively and the model will lose too much predictive power. Set them too aggressively and the model will start overfitting the data.

### Seeing the Forest for the Trees

Given the limitations of decisions trees and the risk of overfitting, it may be tempting to think “why bother?” Fortunately, methods have been found to reduce the risk of overfitting and increase predictive power of decisions trees and the two most popular methods both have the same basic premise – to train multiple trees.

One of the most well known algorithms that utilizes decision trees is the ‘random forest’ algorithm. As the name suggests, the algorithm constructs a large number of different trees (as defined by the user) by randomly selecting the features that can be used to build each tree (as opposed to using all the features for each tree). Typically, the trees in a random forest also have the parameters set to ensure each tree will also be relatively shallow, meaning that the algorithm creates a large number of shallow decision trees (decision bonsai?). Once the trees are constructed, each tree is used to predict the outcome for a new record, with these multiple predictions then serving as votes, with a majority rules approach applied.

Another algorithm which has become almost the default algorithm of choice for Kagglers, and is the type of the model we will use, uses a method called ‘boosting’, which means it builds trees iteratively such that each tree ‘learns’ from earlier trees. To do this the algorithm builds a first tree – again typically a shallower tree than if you were going to use a one tree approach – and makes predictions using that tree. Then the algorithm finds the records that are misclassified by that tree, and assigns a higher weight of importance to those records than the records that were correctly classified. The algorithm then builds a new tree with these new weightings. This whole process is repeated as many times as specified by the user. Once the specified number of trees have been built, all the trees built during this process are used to classify the records, with a majority rules approach used to determine the final prediction.

It should be noted that this methodology (‘boosting’) can actually be applied to many classification algorithms, but has really grown popular with the decision tree based implementation. It should also be noted there are different implementations of this algorithm even just using trees. In this case, we will be using the very popular XGBoost algorithm.

### Alternative Models

So far we have only covered decision trees and decision tree-based algorithms. However, there are a range of different algorithms that can be used for classification problems. Given this is supposed to be a short blog series, I will not go into too much detail on each algorithm here. But if you want more information on these algorithms, or other algorithms that I haven’t covered here, there is a growing amount of information online. I also strongly recommend the Data Science specialization offered by John Hopkins University, for free, on Coursera.

#### K-Nearest Neighbors

The K-nearest neighbor algorithms are arguably one of the simplest algorithms in concept. The algorithm classifies a given object by looking at the classification of the k most similar records[1] and seeing how those records are classified. This type of algorithm is called a lazy learner because during the training phase, it essentially just stores the data provided. Only when a new object needs to be classified does the algorithm start looking through the data to try to find the closest matches.

#### Neural Networks

As the name suggests, these algorithms simulate biological networks by creating a series of nodes and connecting them together. A neural network typically consists of three layers; an input layer, a hidden layer (although there can be multiple hidden layers) and an output layer.

A model is trained by passing records through the network and weights adjusted at each node continually adjusted to ensure that the record ends up at the right ‘output node’.

#### Support Vector Machines

This type of algorithm, commonly used for text classification problems, is arguably the most difficult to visualize. At the simplest level, the algorithm tries to draw straight lines (or planes for classifications with more than 2 features) that best separate the classes provided. Although this sounds like a fairly simplistic approach to classifying objects, it becomes far more powerful due to the transformations (sometimes called a ‘kernel trick’) the algorithm can apply to the data before drawing these lines/planes. The mathematics behind this are far too complex to go into here, but the Wikipedia page has some nice visuals to help picture how this is working. In addition, this video provides a nice example of how a Support Vector Machine can separate classes using this kernel trick:

## Creating the Model

Back to the modelling – now that we know what algorithm we are using (XGBoost algorithm for those skipping ahead), let talk about the approach.

### Cross Validation

As mentioned in regards to decision trees, one of the keys risks when creating models of any type is the risk of overfitting. One of the primary ways data scientists will guard against overfitting is to estimate the accuracy of their models on data that was not used to train the model. To do this they typically use a method called cross validation. There are different methods for doing cross validation, but the method we will employ is called k-fold cross validation.

k-fold cross validation involves splitting the training data into k subsets (where k is greater than or equal to 2), training the model using k – 1 of those subsets, then running the model on the subset that was not used in the training process. Because all of the data used in the cross validation process is training data, the correct classification for each record is known and so the predicted category can be compared to the actual category. Once all folds have been completed, the average score across all folds is taken as an estimate of how the model will perform on other data. An example of a 3-fold cross validation is shown below:

### Parameter Tuning

As you may have realized from the earlier description of the XGBoost algorithm – there are quite a few options (parameters) that we need to define to build the model. How many trees to build? How deep should each tree be? How much extra weight will be attached to each misclassified record? Tuning these parameters to get the best results from the model is often one of the most time consuming things that data scientists do. Fortunately, the process can be automated to a large degree so that we do not have to sit there rerunning the model repeatedly and noting down the results. Even better, using the Scikit-Learn package, we can merge the parameter tuning and cross validation steps into one, allowing us to search for the best combination of parameters while using k-fold cross validation to verify the results.

### Training the Model

In order to train the model (using cross validation and parameter tuning as outlined above), we first need to define our training dataset – remembering that we previously combined the training and test data to simplify the cleaning and transforming process. To feed these into the model, we also need to split the training data into the three main components – the user IDs (we don’t want to use these for training as they are randomly generated), the features to use for training (X), and the categories we are trying to predict (y).

```# Prepare training data for modelling
df_train.set_index('id', inplace=True)
df_train = pd.concat([df_train['country_destination'], df_all], axis=1, join='inner')

id_train = df_train.index.values
labels = df_train['country_destination']
le = LabelEncoder()
y = le.fit_transform(labels)
X = df_train.drop('country_destination', axis=1, inplace=False)
```

Now that we have our training data ready, we can use GridSearchCV to run the algorithm with a range of parameters, then select the model that has the highest cross validated score based on the chosen measure of a performance (in this case accuracy, but there are a range of metrics we could use based on our needs).

```# Grid Search - Used to find best combination of parameters
XGB_model = xgb.XGBClassifier(objective='multi:softprob', subsample=0.5, colsample_bytree=0.5, seed=0)
param_grid = {'max_depth': [3, 4, 5], 'learning_rate': [0.1, 0.3], 'n_estimators': [25, 50]}
model = grid_search.GridSearchCV(estimator=XGB_model, param_grid=param_grid, scoring='accuracy', verbose=10, n_jobs=1, iid=True, refit=True, cv=3)

model.fit(X, y)
print("Best score: %0.3f" % model.best_score_)
print("Best parameters set:")
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
```

Please note that running this step can take a significant amount of time. Running the algorithm with 25 trees takes around 2.5 minutes for each cross validation on my Macbook Pro. Running the script above with all the options specified will likely take well over an hour.

## Making the Predictions

Now that we have trained a model based on the best parameters, the next step is to use the model to make predictions for the records in the testing dataset. Again we need to extract the testing data out of the combined dataset we created for the cleaning and transformation steps, and again we need to separate the main components for the model. After these steps, we use the model created in the previous step to make the predictions.

```# Prepare test data for prediction
df_test.set_index('id', inplace=True)
df_test = pd.merge(df_test[['date_first_booking']], df_all, how='left', left_index=True, right_index=True, sort=False)
X_test = df_test.drop('date_first_booking', axis=1, inplace=False)
X_test = X_test.fillna(-1)
id_test = df_test.index.values

# Make predictions
y_pred = model.predict_proba(X_test)```

As you may have noted from the code above, we have used the predict_proba method instead of the usual predict method. This is done because of the way Kaggle will assess the results for this particular competition. Rather than just assessing one prediction for each user, Kaggle will assess up to 5 predictions for each user. In order to maximize the score, we will use the predicted probabilities that predict_proba produces to select the 5 best predictions. Finally, we will write these results to a file that will be created in the same folder as the script.

```#Taking the 5 classes with highest probabilities
ids = [] #list of ids
cts = [] #list of countries
for i in range(len(id_test)):
idx = id_test[i]
ids += [idx] * 5
cts += le.inverse_transform(np.argsort(y_pred[i])[::-1])[:5].tolist()

#Generate submission
print("Outputting final results...")
sub = pd.DataFrame(np.column_stack((ids, cts)), columns=['id', 'country'])
sub.to_csv('./submission.csv', index=False)
```

For those that wish to, you should be able to submit the file produced from this script on Kaggle. The competition is now finished and you will not receive an official position on the leaderboard, but your results will be processed and you will be told where you would have finished.

## Wrapping Up

Those that are more experienced with data science may realize this series, as lengthy as it is, does not even scratch the surface of a lot of topics related to data science. Unsupervised learning, association rules mining, text analytics and deep learning are all topics that have not been covered at all. Unfortunately, the full scope of data science and machine learning are not something that can be covered in a blog. That said, I did have two goals for those reading these blog articles.

Firstly, I hope that this series demystifies some aspects of data science for those that currently see it as a black box. Although one can spend their career working in data science and still not master all aspects, even a cursory understanding of how machine learning algorithms work can help provide understanding as to what sort of questions machine learning can help to answer, and what sort of questions are problematic.

Secondly, I hope this series encourages some of you to dig deeper, to learn more about this topic. Machine learning is a rapidly growing field that is expanding to every aspect of life. This includes, recommendation engines on websites, astronomy – where it helps to identify stars and planets, the pharmaceutical industry – where it is being used to predict which molecular structures that are likely to produce useful drugs, and maybe most famously, in training self‑driving cars to drive in the real world. Whatever your primary interest, there is likely to be some machine learning applications being developed or being used already.

[1] There are a range of metrics that can be used to do this. For available metrics in the Scikit Learn package, see here.

Full script:

```import pandas as pd
import numpy as np
import xgboost as xgb

from sklearn import cross_validation, decomposition, grid_search
from sklearn.preprocessing import LabelEncoder

####################################################
# Functions #
####################################################
# Remove outliers
def remove_outliers(df, column, min_val, max_val):
col_values = df[column].values
df[column] = np.where(np.logical_or(col_values<=min_val, col_values>=max_val), np.NaN, col_values)

return df

# Home made One Hot Encoder
def convert_to_binary(df, column_to_convert):
categories = list(df[column_to_convert].drop_duplicates())

for category in categories:
cat_name = str(category).replace(" ", "_").replace("(", "").replace(")", "").replace("/", "_").replace("-", "").lower()
col_name = column_to_convert[:5] + '_' + cat_name[:10]
df[col_name] = 0
df.loc[(df[column_to_convert] == category), col_name] = 1

return df

# Count occurrences of value in a column
def convert_to_counts(df, id_col, column_to_convert):
id_list = df[id_col].drop_duplicates()

df_counts = df[[id_col, column_to_convert]]
df_counts['count'] = 1
df_counts = df_counts.groupby(by=[id_col, column_to_convert], as_index=False, sort=False).sum()

new_df = df_counts.pivot(index=id_col, columns=column_to_convert, values='count')
new_df = new_df.fillna(0)

# Rename Columns
categories = list(df[column_to_convert].drop_duplicates())
for category in categories:
cat_name = str(category).replace(" ", "_").replace("(", "").replace(")", "").replace("/", "_").replace("-", "").lower()
col_name = column_to_convert + '_' + cat_name
new_df.rename(columns = {category:col_name}, inplace=True)

return new_df

####################################################
# Cleaning #
####################################################
# Import data
tr_filepath = "./train_users_2.csv"
te_filepath = "./test_users.csv"

# Combine into one dataset
df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True)

# Change Dates to consistent format
print("Fixing timestamps...")
df_all['date_account_created'] = pd.to_datetime(df_all['date_account_created'], format='%Y-%m-%d')
df_all['timestamp_first_active'] = pd.to_datetime(df_all['timestamp_first_active'], format='%Y%m%d%H%M%S')
df_all['date_account_created'].fillna(df_all.timestamp_first_active, inplace=True)

# Remove date_first_booking column
df_all.drop('date_first_booking', axis=1, inplace=True)

# Fixing age column
print("Fixing age column...")
df_all = remove_outliers(df=df_all, column='age', min_val=15, max_val=90)
df_all['age'].fillna(-1, inplace=True)

# Fill first_affiliate_tracked column
print("Filling first_affiliate_tracked column...")
df_all['first_affiliate_tracked'].fillna(-1, inplace=True)

####################################################
# Data Transformation #
####################################################
# One Hot Encoding
print("One Hot Encoding categorical data...")
columns_to_convert = ['gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked', 'signup_app', 'first_device_type', 'first_browser']

for column in columns_to_convert:
df_all = convert_to_binary(df=df_all, column_to_convert=column)
df_all.drop(column, axis=1, inplace=True)

####################################################
# Feature Extraction #
####################################################
# Add new date related fields
df_all['day_account_created'] = df_all['date_account_created'].dt.weekday
df_all['month_account_created'] = df_all['date_account_created'].dt.month
df_all['quarter_account_created'] = df_all['date_account_created'].dt.quarter
df_all['year_account_created'] = df_all['date_account_created'].dt.year
df_all['hour_first_active'] = df_all['timestamp_first_active'].dt.hour
df_all['day_first_active'] = df_all['timestamp_first_active'].dt.weekday
df_all['month_first_active'] = df_all['timestamp_first_active'].dt.month
df_all['quarter_first_active'] = df_all['timestamp_first_active'].dt.quarter
df_all['year_first_active'] = df_all['timestamp_first_active'].dt.year
df_all['created_less_active'] = (df_all['date_account_created'] - df_all['timestamp_first_active']).dt.days

# Drop unnecessary columns
columns_to_drop = ['date_account_created', 'timestamp_first_active', 'date_first_booking', 'country_destination']
for column in columns_to_drop:
if column in df_all.columns:
df_all.drop(column, axis=1, inplace=True)

####################################################
# Add data from sessions.csv #
####################################################
# Import sessions data
s_filepath = "./sessions.csv"

# Determine primary device
print("Determining primary device...")
sessions_device = sessions[['user_id', 'device_type', 'secs_elapsed']]
aggregated_lvl1 = sessions_device.groupby(['user_id', 'device_type'], as_index=False, sort=False).aggregate(np.sum)
idx = aggregated_lvl1.groupby(['user_id'], sort=False)['secs_elapsed'].transform(max) == aggregated_lvl1['secs_elapsed']
df_primary = aggregated_lvl1.loc[idx, ['user_id', 'device_type', 'secs_elapsed']].copy()
df_primary.rename(columns = {'device_type':'primary_device', 'secs_elapsed':'primary_secs'}, inplace=True)
df_primary = convert_to_binary(df=df_primary, column_to_convert='primary_device')
df_primary.drop('primary_device', axis=1, inplace=True)

# Determine Secondary device
print("Determining secondary device...")
remaining = aggregated_lvl1.drop(aggregated_lvl1.index[idx])
idx = remaining.groupby(['user_id'], sort=False)['secs_elapsed'].transform(max) == remaining['secs_elapsed']
df_secondary = remaining.loc[idx, ['user_id', 'device_type', 'secs_elapsed']].copy()
df_secondary.rename(columns = {'device_type':'secondary_device', 'secs_elapsed':'secondary_secs'}, inplace=True)
df_secondary = convert_to_binary(df=df_secondary, column_to_convert='secondary_device')
df_secondary.drop('secondary_device', axis=1, inplace=True)

# Aggregate and combine actions taken columns
print("Aggregating actions taken...")
session_actions = sessions[['user_id', 'action', 'action_type', 'action_detail']]
columns_to_convert = ['action', 'action_type', 'action_detail']
session_actions = session_actions.fillna('not provided')
first = True

for column in columns_to_convert:
print("Converting " + column + " column...")
current_data = convert_to_counts(df=session_actions, id_col='user_id', column_to_convert=column)

# If first loop, current data becomes existing data, otherwise merge existing and current
if first:
first = False
actions_data = current_data
else:
actions_data = pd.concat([actions_data, current_data], axis=1, join='inner')

# Merge device datasets
print("Combining results...")
df_primary.set_index('user_id', inplace=True)
df_secondary.set_index('user_id', inplace=True)
device_data = pd.concat([df_primary, df_secondary], axis=1, join="outer")

# Merge device and actions datasets
combined_results = pd.concat([device_data, actions_data], axis=1, join='outer')
df_sessions = combined_results.fillna(0)

# Merge user and session datasets
df_all.set_index('id', inplace=True)
df_all = pd.concat([df_all, df_sessions], axis=1, join='inner')

####################################################
# Building Model #
####################################################
# Prepare training data for modelling
df_train.set_index('id', inplace=True)
df_train = pd.concat([df_train['country_destination'], df_all], axis=1, join='inner')

id_train = df_train.index.values
labels = df_train['country_destination']
le = LabelEncoder()
y = le.fit_transform(labels)
X = df_train.drop('country_destination', axis=1, inplace=False)

# Training model
print("Training model...")

# Grid Search - Used to find best combination of parameters
XGB_model = xgb.XGBClassifier(objective='multi:softprob', subsample=0.5, colsample_bytree=0.5, seed=0)
param_grid = {'max_depth': [3, 4], 'learning_rate': [0.1, 0.3], 'n_estimators': [25, 50]}
model = grid_search.GridSearchCV(estimator=XGB_model, param_grid=param_grid, scoring='accuracy', verbose=10, n_jobs=1, iid=True, refit=True, cv=3)

model.fit(X, y)
print("Best score: %0.3f" % model.best_score_)
print("Best parameters set:")
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))

####################################################
# Make predictions #
####################################################
print("Making predictions...")

# Prepare test data for prediction
df_test.set_index('id', inplace=True)
df_test = pd.merge(df_test[['date_first_booking']], df_all, how='left', left_index=True, right_index=True, sort=False)
X_test = df_test.drop('date_first_booking', axis=1, inplace=False)
X_test = X_test.fillna(-1)
id_test = df_test.index.values

# Make predictions
y_pred = model.predict_proba(X_test)

#Taking the 5 classes with highest probabilities
ids = [] #list of ids
cts = [] #list of countries
for i in range(len(id_test)):
idx = id_test[i]
ids += [idx] * 5
cts += le.inverse_transform(np.argsort(y_pred[i])[::-1])[:5].tolist()

#Generate submission
print("Outputting final results...")
sub = pd.DataFrame(np.column_stack((ids, cts)), columns=['id', 'country'])
sub.to_csv('./submission.csv',index=False)
```

This article is Part V in a series looking at data science and machine learning by walking through a Kaggle competition. If you have not done so already, you are strongly encouraged to go back and read the earlier parts – (Part I, Part II, Part III and Part IV).

Continuing on the walkthrough, in this part we take the data from sessions.csv that we left aside initially and add it to the transformed and expanded data from Part IV.  This part will cover, in brief, all the steps in Parts II – IV.

## Understanding the Data

As we did for the user data in training.csv, the first step here is to understand what the data in sessions.csv looks like. Although this file, with over 10 million rows, is too large to display in entirety in Excel[1], we can still open the file using Excel to get an understanding of what columns we have and what at least the first million rows of data looks like. Some sample rows are provided below:

user_idactionaction_typeaction_detaildevice_typesecs_elapsed
4rvqpxoh3hcampaigns-unknown--unknown-iPhone375
4rvqpxoh3hactive-unknown--unknown-iPhone728
4rvqpxoh3hcreate-unknown--unknown-iPhone
4rvqpxoh3hlistings-unknown--unknown-iPhone154
4rvqpxoh3hunavailabilities-unknown--unknown-iPhone204
4rvqpxoh3hindex-unknown--unknown-iPhone21
4rvqpxoh3hindex-unknown--unknown-iPhone886
xwxei6hdk4dashboardviewdashboardiPhone1355

As can be seen, the dataset contains records of user actions, with each row representing one action a user took. Every time a user reviewed search results, updated a wish list or updated their account information, a new row was created in this dataset. Although this data is likely to be very useful for our goal of predicting which country a user will make their first booking in, it also complicates the process of combining this data with the data from training.csv, as it will have to be aggregated so that there is one row per user (as opposed to many rows for each user, currently).

Aside from details of the actions taken, there are a couple of interesting fields in this data. The first is device_type – this field contains the type of device used for the specified action. The second interesting field is the secs_elapsed field. This shows us how long (in seconds) was spent on a particular action.

Both of these fields provide us with potentially important information that could help to more accurately predict which country a user will make a first booking in. For example, it is not difficult to imagine that people spending relatively little time to make a booking on a phone are likely to be making bookings in locations closer to home (i.e. the US) than someone spending more time to make a booking on a desktop computer. Of course this is just a theory that needs to be proven, but it is a good reason to ensure we are capturing this information in our final training dataset.

## Cleaning and Transforming the Data

Now that we have a basic understanding of the data, we need to undertake the cleaning and transformation steps. Because of the structure of this data (and for the sake of brevity), we are going to do both of these things at the same time.

The first step is to import the data:

``````# Import sessions data
s_filepath = "./sessions.csv"

### Extract the primary and secondary devices for each user

Remembering that we need to get the final data into a format that can be merged with the data created in Part IV (i.e. a dataset where one row equals one user), the first piece of information we are going to extract is the primary and secondary device for each user. How do we determine what a user’s primary and secondary devices are? We look at how much time they spent on each device. In short we are going to make the following changes to the data:

One thing to note as we make these transformations is that by aggregating the data this way, we are also implicitly removing the missing values. The code to do this transformation is shown below:

``````# Determine primary device
print("Determining primary device...")
sessions_device = sessions[['user_id', 'device_type', 'secs_elapsed']]
aggregated_lvl1 = sessions_device.groupby(['user_id', 'device_type'], as_index=False, sort=False).sum()
idx = aggregated_lvl1.groupby(['user_id'], sort=False)['secs_elapsed'].transform(max) == aggregated_lvl1['secs_elapsed']
df_primary = aggregated_lvl1.loc[idx, ['user_id', 'device_type', 'secs_elapsed']].copy()
df_primary.rename(columns = {'device_type':'primary_device', 'secs_elapsed':'primary_secs'}, inplace=True)
df_primary = convert_to_binary(df=df_primary, column_to_convert='primary_device')
df_primary.drop('primary_device', axis=1, inplace=True)

# Determine Secondary device
print("Determining secondary device...")
remaining = aggregated_lvl1.drop(aggregated_lvl1.index[idx])
idx = remaining.groupby(['user_id'], sort=False)['secs_elapsed'].transform(max) == remaining['secs_elapsed']
df_secondary = remaining.loc[idx, ['user_id', 'device_type', 'secs_elapsed']].copy()
df_secondary.rename(columns = {'device_type':'secondary_device', 'secs_elapsed':'secondary_secs'}, inplace=True)
df_secondary = convert_to_binary(df=df_secondary, column_to_convert='secondary_device')
df_secondary.drop('secondary_device', axis=1, inplace=True)``````

### Determine Counts of Actions

The next thing we are going to do is take counts of how many times each action was taken by each user. This is a two-step process. The first step is to determine the count of each action type for each user:

#### Step 2

For you Excel buffs out there, the second step might strike you as something that could be achieved using a pivot table – and you would be right. In fact, the custom function that we use to make this transformation uses a pandas method called ‘pivot’. This is important to note for a couple of reasons. The first is that, with all the talk about new data, people who have worked with data mostly (or entirely) using ‘old technology’ like Excel and SQL are often given the impression that their skills are redundant or not useful in modern data science. As this example shows, the ways of thinking about data that you develop working with Excel and SQL are not only relevant, but often extremely useful.

The second reason is that for people (like me) who do not know all the methods available for pandas dataframes off by heart, being able to identify techniques you have used in other programs and languages provides you with a way to find corresponding methods in new languages. I discovered this method by searching for “pandas pivot”, knowing that this way of manipulating data was likely to have some equivalent in pandas.

### Looping Through the Actions Columns

Looking at the examples above, you may have realized that the transformation as shown only works for one action column at a time, but in the data we have three action columns: action, action_type and action_detail.

To handle the multiple action columns, we repeat these steps for each column individually, effectively creating three separate tables. Because we have now created tables where each row represents one user, we can now join (another concept SQL users will be very familiar with) these three tables together on the basis of the user id. The full code for these steps is shown below:

``````# Count occurrences of value in a column
def convert_to_counts(df, id_col, column_to_convert):
id_list = df[id_col].drop_duplicates()

df_counts = df[[id_col, column_to_convert]]
df_counts['count'] = 1
df_counts = df_counts.groupby(by=[id_col, column_to_convert], as_index=False, sort=False).sum()

new_df = df_counts.pivot(index=id_col, columns=column_to_convert, values='count')
new_df = new_df.fillna(0)

# Rename Columns
categories = list(df[column_to_convert].drop_duplicates())
for category in categories:
cat_name = str(category).replace(" ", "_").replace("(", "").replace(")", "").replace("/", "_").replace("-", "").lower()
col_name = column_to_convert + '_' + cat_name
new_df.rename(columns = {category:col_name}, inplace=True)

return new_df

# Aggregate and combine actions taken columns
print("Aggregating actions taken...")
session_actions = sessions[['user_id', 'action', 'action_type', 'action_detail']].copy()
columns_to_convert = ['action', 'action_type', 'action_detail']
session_actions = session_actions.fillna('not provided')
first = True

for column in columns_to_convert:
print("Converting " + column + " column...")
current_data = convert_to_counts(df=session_actions, id_col='user_id', column_to_convert=column)

# If first loop, current data becomes existing data, otherwise merge existing and current
if first:
first = False
actions_data = current_data
else:
actions_data = pd.concat([actions_data, current_data], axis=1, join='inner')``````

## Combine Data Sets

The final steps are to combine the various datasets we have created into one large dataset. First we combine the two device dataframes (df_primary and df_secondary) to create a device dataframe. Then we combine the device dataframe with the actions dataframe to create a sessions dataframe with all the features we extracted from sessions.csv. Finally, we combine the sessions dataframe with the user data dataframe from Part IV. The code for the various combinations is shown below:

``````# Merge device datasets
print("Combining results...")
df_primary.set_index('user_id', inplace=True)
df_secondary.set_index('user_id', inplace=True)
device_data = pd.concat([df_primary, df_secondary], axis=1, join="outer")

# Merge device and actions datasets
combined_results = pd.concat([device_data, actions_data], axis=1, join='outer')
df_sessions = combined_results.fillna(0)

# Merge user and session datasets
df_all.set_index('id', inplace=True)
df_all = pd.concat([df_all, df_sessions], axis=1, join='inner')``````

### A Note on Joins

For those that can read a little bit of code and are familiar with joins in SQL, you may be asking why I am using (full) outer joins for the first two combinations, but an inner join for the final step[2].

The first step requires an outer join because not all users have a secondary device. That is, some users only logged onto Airbnb using one device (or at least one type of device). Doing an outer join here ensures that our dataset includes all users regardless of this fact.

The second step could use an inner or an outer join, as both the device and actions datasets should contain all users. In this case we use an outer join just to ensure that if a user is missing from one of the datasets (for whatever reason), we will still capture them. You may also notice that after the second step we fill any missing values with 0s to ensure we do not have any NULL values that may have been generated by these outer joins.

For the third step we use an inner join for a key reason – we want our final training dataset to only include users that also have sessions data. Using an inner join here is an easy way to join the datasets and filter for the users with sessions data in one step.

## Wrapping Up

In the first four parts of this series, we looked in detail at some of the various steps in the process of building a model. Although these steps should be distinct thought processes that occur for each model building process, hopefully what this article provides is an insight into how some of these steps can be combined if planned out carefully. In relatively few steps, we have taken a dataset containing 10 million rows of user actions data, cleaned it, extracted a bunch of important information, and added it to our user data, ready for training a model.

The other important thing to take away from this article is how useful ‘old school’ ways of thinking about data still are. For all the talk about unstructured data and NoSQL databases, the fact is that knowing how to work with and manipulate old fashioned columns and rows is still as important as ever. Whether it is joins and aggregation in SQL, pivot tables and VLOOKUPS in Excel, or just the general concept of relational data, not only is that knowledge relevant, but it is often extremely useful.

## Next Time

In the next piece, we will finally get to the good stuff and train the algorithm to make the final predictions.

[1] Nope, still doesn’t qualify as ‘Big Data’…

[2] For those that do not understand what I mean by inner and outer joins (and are interested in knowing) – stackoverflow comes to the rescue again with this great illustrated answer.

This article on data transformation and feature extraction is Part IV in a series looking at data science and machine learning by walking through a Kaggle competition. If you have not done so already, you are strongly encouraged to go back and read Part I, Part II and Part III.

Continuing on the walkthrough, in this part we focus on getting the data we cleaned in Part III ready for use in the classification algorithm. These steps are often referred to as data transformation and feature extraction.

## Data Transformation and Feature Extraction as a Concept

The main purpose of data transformation and feature extraction is to enhance the data in such a way that it increases the likelihood that the classification algorithm will be able to make meaningful predictions. Unlike the steps taken during cleaning, which are designed to address problems with the raw data (missing and erroneous values, formatting issues etc.), these steps change the values and/or structure of the data (data transformation) and add additional features (feature extraction).

As you might imagine, this is quite an open-ended process, and hence a lot of the value that data scientists provide comes in these steps. There is no textbook or walkthrough that can tell you exactly what steps you should take for a given dataset, that knowledge can come only from experience, curiosity and trial and error. However, we can take a look at some common methods to provide a sense of what is possible. Please keep in mind this is not an exhaustive list of options.

### Data Transformation

Covering steps taken to modify the data, data transformation is undertaken with the intention to enhance the ability of the classification algorithm to extract information from the data. Below are a few common data transformation methods used.

#### Bucketing/Binning

A common method for manipulating numeric data, binning or bucketing is when the numerical values in a particular column are converted from a continuous series into fixed ranges. For example, instead of using the age value of all our users, we could place them into buckets such as 15-20 years old, 21-25 years old and so on.

Typically this technique is used to manage ‘noisy data’. To understand what this means, think of the movements of the stock market over time: it goes up and down on an almost daily basis. However, if you are trying to predict the overall direction of the stock market over the next 6 months, these daily movements become kind of irrelevant – what you really want your model to focus on are the movements over longer periods of time. What is more, the essentially random daily movements in stock prices may actually confuse your prediction model – causing less accurate predictions. In this example, the daily movements are the noise and what you want to extract (the longer term direction of the market) is ‘the signal’.

The same logic can be applied to any numerical field in your dataset. If you are concerned that small changes in a given value may simply be representing random ‘noise’, you may want to consider bucketing/binning to remove that noise.

#### Normalization

Although normalization can take on a large number of meanings depending on the context, the type of normalization being referred to here is the statistical type – converting the values of a column into a ‘normalized’ range. This could be translating heights from centimeter values anywhere from 100cm to 220cm to a scale where 0 represents the average (mean) height for your dataset and -1/+1 represent one standard deviation from that average. It could be translating those heights into a range of values from 0 to 1, where 0 is the lowest value in your dataset and 1 is the maximum value. There is a number of other methods that can be used here as well.

This type of transformation is more important for certain types of algorithms than others. For some algorithms – like the one we will be using – this type of transformation is not typically necessary. But for other algorithms, the magnitude of the values in each column will impact the calculations. In these cases, it is optimal to convert (‘normalize’) the values in each column onto the same scale to ensure each column is treated the equally. For a more detailed explanation on this subject, this answer from Quora is a good place to start.

#### Other Mathematical Transformations

In a similar manner to normalization, there is an almost unlimited number of ways that the numerical values of a given column can be transformed such that they are more suitable for the algorithm being used.

To provide one example, arguably the most common transformation (other than normalization) is to use a logarithm function. This transformation is a commonly used method of dealing with exponential data series (i.e. a column where there a lot of low values and relatively few high values). For those wanting to understand this transformation better, the Wikipedia page on this topic has a great illustrated example.

As I am hemorrhaging readers at this point, I won’t go into detail on the various other transformations possible – the key point is to be aware that there is a large range of possibilities here depending on your needs.

#### One Hot Encoding

Looking at one more example, and the most relevant one for our Kaggle competition, this transformation is one used for categorical data. What this transformation does is take one column with x categories (x must be greater than 2 for this to make sense) and convert it into x columns where each column represents one category in the original column. An illustrated example is shown below:

For those familiar with regression modeling, you may recognize this as the same process of creating dummy variables.

Again there are a few reasons for doing this type of transformation. Some algorithms are structured in such a way that they do not handle categorical data very well – particularly when the categories do not have an inherent order (this answer on Stack Overflow does a good job of explaining why). Some other types of algorithms require numerical data to function. The only way to work out whether this transformation will be beneficial is to either read through the documentation for the algorithm you are using or to test it yourself.

### Feature Extraction

Often broken down into sub steps of feature construction and feature selection, here we will focus on feature construction. Below are a couple of ways additional features can be constructed and added to your dataset.

#### Using Hierarchical Information

It will sometimes be the case that data in your dataset represents one level of a particular hierarchy, and that extracting the other implied levels of that hierarchy will provide the model with useful information.

For example, imagine a dataset with a column containing countries. This column allows the algorithm to look for patterns (in combination with all other columns) at the country level. However, by adding a new ‘region’ column based on the country column (Europe, South Asia, North Africa etc.), you may be providing information to the algorithm that allows it look for patterns across countries.

One of the most common ways to do this is with date fields. Take the date fields in the dataset we are working with as an example. By extracting the day of the week, the month of the year or the hour of the day, we could add important information for the algorithm to use. Maybe people who create their accounts in summer months are more likely to make a booking in a warmer country. Maybe people who were first active late at night are more disorganized travelers and are therefore more likely to make a domestic first booking. Additionally, it could be any combination of these factors that makes the difference (e.g. users first active late at night, in the summer months, on a weekday are more likely to travel to Portugal). The point is not to be able to explain why a factor may be important, but to think of as many factors as possible to test, and allow the algorithm to determine what is important and not important.

One of the aspects of feature extraction that often gets overlooked is how data can be enriched through the addition of new external data. Using techniques such as record linkage, existing datasets can be greatly expanded by adding new data points for a given record. This new data often provides valuable new information that the algorithm can use to make more accurate predictions.

For example, a training dataset that contains a column with countries could be enriched with demographic data about the country such as population, income per capita or land area – all factors that may allow the algorithm to draw conclusions across similar groups of countries on any of those measures.

Relating this concept to the competition we are working through, consider how much more accurately we could predict a first booking country of a user if we could link the data from their Airbnb profile to data from one of their social media profiles (Facebook, Twitter etc.) or even better, from a Tripadvisor or Expedia account.

The key point here is that it is worth investing time looking for ways to add new and useful data to your existing dataset before moving onto the modeling step. Expending your dataset in this manner will often produce far bigger improvements in prediction accuracy than the choice of algorithm or the tuning of the algorithm parameters.

## The Importance of Domain Knowledge

One of the things that may have occurred to you as you read through the various ways to modify and expand a dataset is how are you supposed to know what will help or not?

This is where knowledge about the data you are using and what it represents becomes so important. This knowledge – referred to as domain knowledge – helps guide this entire process, including what was covered in Part III, cleaning the data.

Understanding how the data was collected helps to provide insight into potential errors in the data that might need to be addressed or shortcomings in the way the data was sampled (sample selection bias/errors). Understanding the relevant industry or market can also provide a range of insights including:

• what information may help to increase prediction accuracy and what is likely to be irrelevant
• if the model makes intuitive sense (e.g. can you predict the likelihood of a waking up with a headache based on whether someone slept with their shoes on?[1]), and
• if the industry or market is changing in such a way that it is likely to make the model redundant in the near future.

In practical terms, where does this leave aspiring data scientists?

The first thing is to realize that, obviously, it is not possible to be a domain expert for every domain. Acknowledging this limitation is important as it forces a second realization – you will almost always need to seek out this expertise. For most of us that means involving and utilizing people who are domain experts when constructing your dataset and model. Having access to that expertise is likely to be the difference between a model that gets thrown out in 6 months and one that fundamentally improves a business and/or fulfills a customer need.

## Step by Step

After all the theory, let’s put some of these techniques into practice.

### Transforming Categorical Data

The first step we are going to undertake is some One Hot Encoding – replacing the categorical fields in the dataset with multiple columns representing one value from each column.

To do this, the Scikit Learn library comes with a One Hot Encoder method that we could use to do these transformations, but it is often instructive to write your own function, particularly if it is a relative simple one like this. The code snippet below creates a simple function to do the encoding for a specified column, and then uses that function in a loop to convert all the categorical columns (and then delete the original columns).

``````# Home made One Hot Encoding function
def convert_to_binary(df, column_to_convert):
categories = list(df[column_to_convert].drop_duplicates())

for category in categories:
cat_name = str(category).replace(" ", "_").replace("(", "").replace(")", "").replace("/", "_").replace("-", "").lower()
col_name = column_to_convert[:5] + '_' + cat_name[:10]
df[col_name] = 0
df.loc[(df[column_to_convert] == category), col_name] = 1

return df

# One Hot Encoding
print("One Hot Encoding categorical data...")
columns_to_convert = ['gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked', 'signup_app', 'first_device_type', 'first_browser']

for column in columns_to_convert:
df_all = convert_to_binary(df=df_all, column_to_convert=column)
df_all.drop(column, axis=1, inplace=True)``````

### Creating New Features

From Part II of this series, one of the things we observed about the training (and test) datasets is that there is not a huge number of columns to work with. This limits what new features we can add based on the existing data. However, two fields that can be used to create some new features are the two date fields – date_account_created and timestamp_first_active. We want to extract all the information we can out of these two date fields that could potentially differentiate which country someone will make their first booking in. The code for extracting a range of different data points from these two date columns (and then deleting the original date columns) is shown below:

``````# Add new date related fields
df_all['day_account_created'] = df_all['date_account_created'].dt.weekday
df_all['month_account_created'] = df_all['date_account_created'].dt.month
df_all['quarter_account_created'] = df_all['date_account_created'].dt.quarter
df_all['year_account_created'] = df_all['date_account_created'].dt.year
df_all['hour_first_active'] = df_all['timestamp_first_active'].dt.hour
df_all['day_first_active'] = df_all['timestamp_first_active'].dt.weekday
df_all['month_first_active'] = df_all['timestamp_first_active'].dt.month
df_all['quarter_first_active'] = df_all['timestamp_first_active'].dt.quarter
df_all['year_first_active'] = df_all['timestamp_first_active'].dt.year
df_all['created_less_active'] = (df_all['date_account_created'] - df_all['timestamp_first_active']).dt.days

# Drop unnecessary columns
columns_to_drop = ['date_account_created', 'timestamp_first_active', 'date_first_booking', 'country_destination']
for column in columns_to_drop:
if column in df_all.columns:
df_all.drop(column, axis=1, inplace=True)``````

## Wrapping Up

In two relatively simple steps, we have changed our training dataset from 14 columns to 163 columns. Although this seems like a lot more information, most of this expansion was caused by the One Hot Encoding, which is not adding more information, but simply expanding out the existing information. We have not added any external data, and I didn’t even really investigate what information we could have extracted from the other non-date columns.

Again, this process is open ended, so there is an almost unlimited range of possibilities that we have not even really begun to explore. As such, if you see an additional transformation or have an idea for the addition of a new feature, please feel free to let me know in a comment!

## Next Time

In the next piece, we will look at the data in sessions.csv that we left aside initially and see how we can add that data to our training dataset.

[1] This is an example of the existence of a confounding factor. A model predicting whether someone will wakeup with a headache based on whether they slept with their shoes on ignores that there is a more logical explanation for the headaches – in this case that both the headaches and sleeping with shoes on are caused by a third factor – going to bed drunk.

This article on cleaning data is Part III in a series looking at data science and machine learning by walking through a Kaggle competition. If you have not done so already, it is recommended that you go back and read Part I and Part II.

In this part we will focus on cleaning the data provided for the Airbnb Kaggle competition.

## Cleaning Data

When we talk about cleaning data, what exactly are we talking about? Generally when people talk about cleaning data, there are a few specific things they are referring to:

1. Fixing up formats – Often when data is saved or translated from one format to another (for example in our case from CSV to Python), some data may not be translated correctly. We saw a good example of this in the last article in csv. The timestamp_first_active column contained numbers like 20090609231247 instead of timestamps in the expected format: 2009-06-09 23:12:47. A typical job when it comes to cleaning data is correcting these types of issues.
2. Filling in missing values – As we also saw in Part II, it is quite common for some values to be missing from datasets. This typically means that a piece of information was simply not collected. There are several options for handling missing data that will be covered below.
3. Correcting erroneous values – For some columns, there are values that can be identified as obviously incorrect. This may be a ‘gender’ column where someone has entered a number, or an ‘age’ column where someone has entered a value well over 100. These values either need to be corrected (if the correct value can be determined) or assumed to be missing.
4. Standardizing categories – More of a subcategory of ‘correcting erroneous values’, this type of data cleansing is so common it is worth special mention. In many (all?) cases where data is collected from users directly – particularly using free text fields – spelling mistakes, language differences or other factors will result in a given answer being provided in multiple ways. For example, when collecting data on country of birth, if users are not provided with a standardized list of countries, the data will inevitably contain multiple spellings of the same country (e.g. USA, United States, U.S. and so on). One of the main cleaning tasks often involves standardizing these values to ensure that there is only one version of each value.

## Options for Dealing with Missing Data

Missing data in general is one of the trickier issues that is dealt with when cleaning data. Broadly there are two solutions:

### 1. Deleting/Ignoring rows with missing values

The simplest solution available when faced with missing values is to not use the records with missing values when training your model. However, there are some issues to be aware of before you starting deleting masses of rows from your dataset.

The first is that this approach only makes sense if the number of rows with missing data is relatively small compared to the dataset. If you are finding that you will be deleting more than around 10% of your dataset due to rows having missing values, you may need to reconsider.

The second issue is that in order to delete the rows containing missing data, you have to be confident that the rows you are deleting do not contain information that is not contained in other rows. For example, in the current Airbnb dataset we have seen that many users have not provided their age. Can we assume that the people who chose not to provide their age are the same as the users who did? Or are they likely to represent a different type of user, perhaps an older and more privacy conscious user, and therefore a user that is likely to make different choices on which countries to visit? If the answer is the latter, we probably do not want to just delete the records.

### 2. Filling in the Values

The second broad option for dealing with missing data is to fill the missing values with a value. But what value to use? This depends on a range of factors, including the type of data you are trying to fill.

If the data is categorical (i.e. countries, device types, etc.), it may make sense to simply create a new category that will represent ‘unknown’. Another option may be to fill the values with the most common value for that column (the mode). However, because these are broad methods for filling the missing values, this may oversimplify your data and/or make your final model less accurate.

For numerical values (for example the age column) there are some other options. Given that in this case using the mode to fill values makes less sense, we could instead use the mean or median. We could even take an average based on some other criteria – for example filling the missing age values based on an average age for users that selected the same country_destination.

For both types of data (categorical and numerical), we can also use far more complicated methods to impute the missing values. Effectively, we can use a similar methodology that we are planning to use to predict the country_destination to predict the values in any of the other columns, based on the columns that do have data. And just like with modeling in general, there are an almost endless number of ways this can be done, which won’t be detailed here. For more information on this topic, the orange Python library provides some excellent documentation.

## Step by Step

With that general overview out of the way, let’s start cleaning the Airbnb data. In relation to the datasets provided for the Airbnb Kaggle competition, we will focus our cleaning efforts on two files – train_users_2.csv and test_users.csv and leave aside sessions.csv.

The first step is to load the data from the CSV files using Python. To do this we will use the Pandas library and load the data from two files train_users_2.csv and test_users.csv. After loading, we will combine them into one dataset so that any cleaning (and later any other changes) will be done to all the data at once[1].

``````import pandas as pd

# Import data
tr_filepath = "./train_users_2.csv"
te_filepath = "./test_users.csv"

# Combine into one dataset
df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True)``````

### Clean the Timestamps

Once the data has been loaded and combined, the first cleaning step we will undertake is fixing the format of the dates – as we saw in Part II, at least one of the date columns looks like it is formatted as one long number. You may be wondering why this is necessary – after all, can’t we all see what the dates are supposed to represent when we look at the data?

The reason we need to convert the values in the date columns is that, if we want to do anything with those dates (e.g. subtract one date from another, extract the month of the year from each date etc.), it will be far easier if Python recognizes the values as dates. This will become much clearer next week when we start adding various new features to the training data based on this date information.

Luckily, fixing date formats is relatively easy. Pandas has a simple function, to_datetime, that will allow us to input a column and get the correctly formatted dates as a result. When using this function we also provide a parameter called ‘format’ that is like a regular expression for dates. In simpler terms, we are providing the function with a generalized form of the date so that it can interpret the data in the column. For example, for the date_account_created column we are telling the function to expect a four-digit year (%Y) followed by a ‘-’, then a two-digit month (%m), then ‘-’, then a two-digit day (%d) – altogether the expression would be ‘%Y-%m-%d’ (for the full list of directives that can be used, see here). For the timestamp_first_active column, the date format provided is different so we adjust our expression accordingly.

Once we have fixed the date formats, we simply replace the existing date columns with the corrected data. Finally, because the date_account_created column is sometimes empty, we replace the empty values with the value in the date_account_created column using the fillna function. The code for this step is provided below:

``````# Change Dates to consistent format
print("Fixing timestamps...")
df_all['date_account_created'] = pd.to_datetime(df_all['date_account_created'], format='%Y-%m-%d')
df_all['timestamp_first_active'] = pd.to_datetime(df_all['timestamp_first_active'], format='%Y%m%d%H%M%S')
df_all['date_account_created'].fillna(df_all.timestamp_first_active, inplace=True)``````

### Remove booking date field

Those following along and/or paying attention may have noticed that in the original dataset, there are three date fields, but we have only covered two above. The remaining date field, date_first_booking, we are going to drop (remove) from the training data altogether. The reason is that this field is only populated for users who have made a booking. For the data in training_users_2.csv, all the users that have a first booking country have a value in the date_first_booking column and for those that have not made a booking (country_destination = NDF) the value is missing. However, for the data in test_users.csv, the date_first_booking column is empty for all the records.

This means that this column is not going to be useful for predicting which country a booking will be made. What is more, if we leave it in the training dataset when building the model, it will likely increase the chances that the model predicts NDF as those are the records without dates in the training dataset. The code for removing the column is provided below:

``````# Remove date_first_booking column
df_all.drop('date_first_booking', axis=1, inplace=True)``````

### Clean the Age column

As identified in Part II, there are several age values that are clearly incorrect (unreasonably high or too low). In this step, we replace these incorrect values with ‘NaN’, which literally stands for Not a Number, but implies we do not know the age value. In other words we are changing the incorrect values into missing values. To do this, we create a simple function that intakes a dataframe (table), a column name, a maximum acceptable value (90) and a minimum acceptable value (15). This function will then replace the values in the specified column that are outside the acceptable range with NaN.

Again from Part II we know there were also a significant number of users who did not provide their age at all – so they also show up as NaN in the dataset. After we have converted the incorrect age values to NaN, we then change all the NaN values to -1.

The code for these steps is shown below:

``````import numpy as np

# Remove outliers function
def remove_outliers(df, column, min_val, max_val):
col_values = df[column].values
df[column] = np.where(np.logical_or(col_values <= min_val, col_values >= max_val), np.NaN, col_values)
return df

# Fixing age column
print("Fixing age column...")
df_all = remove_outliers(df=df_all, column='age', min_val=15, max_val=90)
df_all['age'].fillna(-1, inplace=True)``````

As mentioned earlier, there are several more complicated ways to fill in the missing values in the age column. We are selecting this simple method for two main reasons:

1. Clarity – this series of articles is going to be long enough without adding the complication of a complex methodology for imputing missing ages.
2. Questionable results – in my testing during the actual competition, I did test several more complex imputation methodologies. However, none of the methods I tested actually produced a better end result than the methodology outlined above.

### Identify and fill additional columns with missing values

From more detailed analysis of the data, you may have also realized there is one more column that has missing values – the first_affiliate_tracked column. In the same way we have been filling in the missing values in other columns, we now fill in the values in this column.

``````# Fill first_affiliate_tracked column
print("Filling first_affiliate_tracked column...")
df_all['first_affiliate_tracked'].fillna(-1, inplace=True)``````

### Sample Output

So what does the data look like after all these changes? Here is a sample of some rows from our cleaned dataset:

idaffiliate_channelaffiliate_provideragecountry_destinationdate_account_createdfirst_affiliate_trackedfirst_browserfirst_device_typegenderlanguagesignup_appsignup_flowsignup_methodtimestamp_first_active
4ft3gnwmtxdirectdirect56.0US2010-09-28 00:00:00untrackedIEWindows DesktopFEMALEenWeb3basic2009-06-09 23:12:47
87mebub9p4directdirect41.0US2010-09-14 00:00:00untrackedChromeMac Desktop-unknown-enWeb0basic2009-12-08 06:11:05
osr2jwljorotherother-1.0US2010-01-01 00:00:00omgChromeMac Desktop-unknown-enWeb0basic2010-01-01 21:56:19
lsw9q7uk0jothercraigslist46.0US2010-01-02 00:00:00untrackedSafariMac DesktopFEMALEenWeb0basic2010-01-02 01:25:58
0d01nltbrsdirectdirect47.0US2010-01-03 00:00:00omgSafariMac DesktopFEMALEenWeb0basic2010-01-03 19:19:05
a1vcnhxeijothercraigslist50.0US2010-01-04 00:00:00untrackedSafariMac DesktopFEMALEenWeb0basic2010-01-04 00:42:11
6uh8zyj2gnothercraigslist46.0US2010-01-04 00:00:00omgFirefoxMac Desktop-unknown-enWeb0basic2010-01-04 02:37:58
yuuqmid2rpothercraigslist36.0US2010-01-04 00:00:00untrackedFirefoxMac DesktopFEMALEenWeb0basic2010-01-04 19:42:51
om1ss59ys8othercraigslist47.0NDF2010-01-05 00:00:00untracked-unknown-iPhoneFEMALEenWeb0basic2010-01-05 05:18:12
k6np330cm1directdirect-1.0FR2010-01-05 00:00:00-1-unknown-Other/Unknown-unknown-enWeb0basic2010-01-05 06:08:59
ju3h98ch3wothercraigslist36.0NDF2010-01-07 00:00:00untrackedMobile SafariiPhoneFEMALEenWeb0basic2010-01-07 05:58:20
v4d5rl22pxdirectdirect33.0CA2010-01-07 00:00:00untrackedChromeWindows DesktopFEMALEenWeb0basic2010-01-07 20:45:55
2dwbwkx056othercraigslist-1.0NDF2010-01-07 00:00:00-1-unknown-Other/Unknown-unknown-enWeb0basic2010-01-07 21:51:25
frhre329auothercraigslist31.0US2010-01-07 00:00:00-1-unknown-Other/Unknown-unknown-enWeb0basic2010-01-07 22:46:25
gdka1q5ktddirectdirect29.0FR2010-01-10 00:00:00untrackedChromeMac DesktopFEMALEenWeb0basic2010-01-10 01:08:17

### Is that all?

Those more experienced with working with data may be thinking that we have not done all that much cleaning with this data – and you would be right. One of the nice things about Kaggle competitions is that the data provided does not require all that much cleaning as that is not what the providers of the data want participants to focus on. Many of the problems that would be found in real world data (as covered earlier) do not exist in this dataset, saving us significant time.

However, what this relatively easy cleaning process also tells us is that even when datasets are provided with the intention of needing no or minimal cleaning, there is always something that needs to be done.

## Next Time

In the next piece, we will focus on transforming the data and feature extraction, allowing us to create a training dataset that will hopefully allow the model to make better predictions. To make sure you don’t miss out, use the subscription feature below.

[1] For those with more data mining experience you may realize that combining the test and training data at this stage is not best practice. The best practice would be to avoid using the test dataset in any of the data preprocessing or model tuning/validation steps to avoid over fitting. However, in the context of this competition, because we are only trying to create the model to classify one unchanging dataset, simply maximizing the accuracy of the model for that dataset is the primary concern.

This article on understanding the data is Part II in a series looking at data science and machine learning by walking through a Kaggle competition. Part I can be found here.

Continuing on the walkthrough of data science via a Kaggle competition entry, in this part we focus on understanding the data provided for the Airbnb Kaggle competition.

## Reviewing the Data

In any process involving data, the first goal should always be understanding the data. This involves looking at the data and answering a range of questions including (but not limited to):

1. What features (columns) does the dataset contain?
2. How many records (rows) have been provided?
3. What format is the data in (e.g. what format are the dates provided, are there numerical values, what do the different categorical values look like)?
4. Are there missing values?
5. How do the different features relate to each other?

For this competition, Airbnb have provided 6 different files. Two of these files provide background information (countries.csv and age_gender_bkts.csv), while sample_submission_NDF.csv provides an example of how the submission file containing our final predictions should be formatted. The three remaining files are the key ones:

1. train_users_2.csv – This dataset contains data on Airbnb users, including the destination countries. Each row represents one user with the columns containing various information such the users’ ages and when they signed up. This is the primary dataset that we will use to train the model.
2. test_users.csv – This dataset also contains data on Airbnb users, in the same format as train_users_2.csv, except without the destination country. These are the users for which we will have to make our final predictions.
3. sessions.csv – This data is supplementary data that can be used to train the model and make the final predictions. It contains information about the actions (e.g. clicked on a listing, updated a  wish list, ran a search etc.) taken by the users in both the testing and training datasets above.

With this information in mind, an easy first step in understanding the data is reviewing the information provided by the data provider – Airbnb. For this competition, the information can be found here. The main points (aside from the descriptions of the columns) are as follows:

• All the users in the data provided are from the USA.
• There are 12 possible outcomes of the destination country: ‘US’, ‘FR’, ‘CA’, ‘GB’, ‘ES’, ‘IT’, ‘PT’, ‘NL’,’DE’, ‘AU’, ‘NDF’ (no destination found), and ‘other’.
• ‘other’ means there was a booking, but in a country not included in the list, while ‘NDF’ means there was not a booking.
• The training and test sets are split by dates. In the test set, you will predict the destination country for all the new users with first activities after 7/1/2014
• In the sessions dataset, the data only dates back to 1/1/2014, while the training dataset dates back to 2010.

After absorbing this information, we can start looking at the actual data. For now we will focus on the train_users_2.csv file only.

#### Table 1 – Three rows (transposed) from train_users_2.csv

Column NameExample 1Example 2Example 3
id4ft3gnwmtxv5lq9bj8gvmsucfwmlzc
date_account_created28/9/1030/6/1430/6/14
timestamp_first_active200906092312472014063023442920140630234729
date_first_booking2/8/1016/3/15
genderFEMALE-unknown-MALE
age5643
signup_methodbasicbasicbasic
signup_flow3250
languageenenen
affiliate_channeldirectdirectdirect
affiliate_providerdirectdirectdirect
first_affiliate_trackeduntrackeduntrackeduntracked
signup_appWebiOSWeb
first_device_typeWindows DesktopiPhoneWindows Desktop
first_browserIE-unknown-Firefox
country_destinationUSNDFUS

Looking at the sample of three records above provides us with a few key pieces of information about this dataset. The first is that at least two columns have missing values – the age column and date_first_booking column. This tells us that before we use this data for training a model, these missing values need to be filled or the rows excluded altogether. These options will be discussed in more detail in the next part of this series.

Secondly, most of the columns provided contain categorical data (i.e. the values represent one of some fixed number of categories). In fact 11 of the 16 columns provided appear to be categorical. Most of the algorithms that are used in classification do not handle categorical data like this very well, and so when it comes to the data transformation step, we will need to find a way to change this data into a form that is more suited for classification.

Thirdly, the timestamp_first_active column looks to be a full timestamp, but in the format of a number. For example 20090609231247 looks like it should be 2009-06-09 23:12:47. This formatting will need to be corrected if we are to use the date values.

## Diving Deeper

Now that we have gained a basic understanding of the data by looking at a few example records, the next step is to start looking at the structure of the data.

### Country Destination Values

Arguably, the most important column in the dataset is the one the model will try to predict – country_destination. Looking at the number of records that fall into each category can help provide some insights into how the model should be constructed as well as pitfalls to avoid.

#### Table 2 – Users by Destination

DestinationRecords% of Total
NDF124,54358.3%
US62,37629.2%
other10,0944.7%
FR5,0232.4%
IT2,8351.3%
GB2,3241.1%
ES2,2491.1%
CA1,4280.7%
DE1,0610.5%
NL7620.4%
AU5390.3%
PT2170.1%
Grand Total213,451100.0%

Looking at the breakdown of the data, one thing that immediately stands out is that almost 90% of users fall into two categories, that is, they are either yet to make a booking (NDF) or they made their first booking in the US. What’s more, breaking down these percentage splits by year reveals that the percentage of users yet to make a booking increases each year and reached over 60% in 2014.

#### Table 3 – Users by Destination and Year

Destination20102011201220132014Overall
NDF42.5%45.4%55.0%59.2%61.8%58.3%
US44.0%38.1%31.1%28.9%26.7%29.2%
other2.8%4.7%4.9%4.6%4.8%4.7%
FR4.3%4.0%2.8%2.2%1.9%2.4%
IT1.1%1.7%1.5%1.2%1.3%1.3%
GB1.0%1.5%1.3%1.0%1.0%1.1%
ES1.5%1.7%1.2%1.0%0.9%1.1%
CA1.5%1.1%0.7%0.6%0.6%0.7%
DE0.6%0.8%0.7%0.5%0.3%0.5%
NL0.4%0.6%0.4%0.3%0.3%0.4%
AU0.3%0.3%0.3%0.3%0.2%0.3%
PT0.0%0.2%0.1%0.1%0.1%0.1%
Total100.0%100.0%100.0%100.0%100.0%100.0%

For modeling purposes, this type of split means a couple of things. Firstly, the spread of categories has changed over time. Considering that our final predictions will be made against user data from July 2014 onwards, this change provides us with an incentive to focus on more recent data for training purposes, as it is more likely to resemble the test data.

Secondly, because the vast majority of users fall into 2 categories, there is a risk that if the model is too generalized, or in other words not sensitive enough, it will select one of those two categories for every prediction. A key step will be ensuring the training data has enough information to ensure the model will predict other categories as well.

### Account Creation Dates

Let’s now move onto the date_account_created column to see how the values have changed over time.

#### Chart 1 – Accounts Created Over Time

Chart 1 provides excellent evidence of the explosive growth of Airbnb, averaging over 10% growth in new accounts created per month. In the year to June 2014, the number of new accounts created was 125,884 – 132% increase from the year before.

But aside from showing how quickly Airbnb has grown, this data also provides another important insight, the majority of the training data provided comes from the latest 2 years. In fact, if we limited the training data to accounts created from January 2013 onwards, we would still be including over 70% of all the data. This matters because, referring back to the notes provided by Airbnb, if we want to use the data in sessions.csv we would be limited to data from January 2014 onwards. Again looking at the numbers, this means that even though the sessions.csv data only covers 11% of the time period (6 out of 54 months), it still covers over 30% of the training data – or 76,466 users.

### Age Breakdown

Looking at the breakdown by age, we can see a good example of another issue that anyone working with data (whether a Data Scientist or not) faces regularly – data quality issues. As can be seen from Chart 2, there are a significant number of users that have reported their ages as well over 100. In fact, a significant number of users reported their ages as over 1000.

#### Chart 2 – Reported Ages of Users

So what is going on here? Firstly, it appears that a number of users have reported their birth year instead of their age. This would help to explain why there are a lot of users with ‘ages’ between 1924 and 1953. Secondly, we also see significant numbers of users reporting their age as 105 and 110. This is harder to explain but it is likely that some users intentionally entered their age incorrectly for privacy reasons. Either way, these values would appear to be errors that will need to be addressed.

Additionally, as we saw in the example data provided above, another issue with the age column is that sometimes age has not been reported at all. In fact, if we look across all the training data provided, we can see a large number of missing values in all years.

#### Table 4 – Missing Ages

YearMissing ValuesTotal Records% Missing
20101,0822,78838.8%
20114,09011,77534.7%
201213,74039,46234.8%
201334,95082,96042.1%
201434,12876,46644.6%
Total87,990213,45141.2%

When we clean the data, we will have to decide what to do with these missing values.

### First Device Type

Finally, one last column that we will look at is the first_device_used column.

#### Table 5 – First Device Used

Device20102011201220132014All Years
Mac Desktop37.2%40.4%47.2%44.2%37.3%42.0%
Windows Desktop21.6%25.2%37.7%36.9%31.0%34.1%
iPhone5.8%6.3%3.8%7.5%15.9%9.7%
Other/Unknown28.8%21.3%3.8%2.8%4.6%5.0%
Android Phone1.1%1.2%0.7%0.4%2.6%1.3%
Android Tablet0.4%0.4%0.3%0.5%0.9%0.6%
Desktop (Other)0.4%0.4%0.4%0.6%0.7%0.6%
SmartPhone (Other)0.0%0.1%0.1%0.0%0.0%0.0%
Total100.0%100.0%100.0%100.0%100.0%100.0%

The interesting thing about the data in this column is how the types of devices used have changed over time. Windows users have increased significantly as a percentage of all users. iPhone users have tripled their share, while users using ‘Other/unknown’ devices have gone from the second largest group to less than 5% of users. Further, the majority of these changes occurred between 2011 and 2012, suggesting that there may have been a change in the way the classification was done.

Like with the other columns we have reviewed above, this change over time reinforces the presumption that recent data is likely to be the most useful for building our model.

### Other Columns

It should be noted that although we have not covered all of them here, having some understanding of all the data provided in a dataset is important for building an accurate classification model. In some cases, this may not be possible due to the presence of a very large number of columns, or due to the fact that the data has been abstracted (that is, the data has been converted into a different form). However, in this particular case, the number of columns is relatively small and the information is easily understandable.

## Next Time

Now that we have taken the first step – understanding the data – in the next piece, we will start cleaning the data to get it into a form that will help to optimize the model’s performance.