Data Inspired Insights

Tag: python (Page 1 of 2)

Pandas: Filtering and segmenting

This article is part of a series of practical guides for using the Python data processing library pandas. To see view all the available parts, click here.

One of the most common ways you will interact with a pandas DataFrame is by selecting different combinations of columns and rows. This can be done using the numerical positions of columns and rows in the DataFrame, column names and row indices, or by filtering the rows by applying some criteria to the data in the DataFrame. All of these options (and combinations of them) are available, so let’s dig in!

Reading in a dataset

If you don’t have a dataset you want to play around with, University of California Irvine has an excellent online repository of datasets that you can play with. For this explainer we are going to be using the Wine Quality dataset. If you want to follow along, you can import the dataset as follows:

import pandas as pd

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=';')

Selecting columns

Selecting columns in pandas is about as straight forward as it gets, but there are a few options worth covering. First, let’s keep it simple:

df['quality']

0       5
1       5
2       5
3       6
4       5
       ..
1594    5
1595    6
1596    6
1597    5
1598    6
Name: quality, Length: 1599, dtype: int64

By putting the name of the column we want to select inside square brackets and quotes (‘ and ” both work), we return a pandas Series. There is an alternative which returns the same results:

df.quality

0       5
1       5
2       5
3       6
4       5
       ..
1594    5
1595    6
1596    6
1597    5
1598    6
Name: quality, Length: 1599, dtype: int64

However, I recommend you do not use dot notation for several reasons:

  1. It will only work if the column name does not have any special characters. If, for example, your column name has a space in it (e.g. 'fixed acidity'), you won’t be able to use dot notation.
  2. If your column name is the same as a DataFrame method name (e.g. you have a column called ‘sum’), using the dot notation will call the method instead of selecting the column.
  3. In more complicated scenarios you will often need to select columns dynamically. That is you will have column names that you store to a variable, then use that variable to access the column. This only works with the bracket notation.

What if we want to select multiple columns? Instead of passing one column name, we are going to pass a list of column names:

df[['fixed acidity', 'pH', 'quality']]

      fixed acidity    pH  quality
0               7.4  3.51        5
1               7.8  3.20        5
2               7.8  3.26        5
3              11.2  3.16        6
4               7.4  3.51        5
...             ...   ...      ...
1594            6.2  3.45        5
1595            5.9  3.52        6
1596            6.3  3.42        6
1597            5.9  3.57        5
1598            6.0  3.39        6

[1599 rows x 3 columns]

You might notice that we now have double square brackets. To understand why we need them, let’s reorganize our code a little:

columns_to_extract = ['fixed acidity', 'pH', 'quality']
df[columns_to_extract]

As this hopefully makes clearer, one set of square brackets is the syntax for selecting columns from a DataFrame in pandas, the other set is needed to create the list of column names.

What if we don’t know the column names or just want to select columns based on their position in the DataFrame rather than their name? Here we can use iloc or “integer-location”:

df.iloc[:, 0]

0        7.4
1        7.8
2        7.8
3       11.2
4        7.4
        ... 
1594     6.2
1595     5.9
1596     6.3
1597     5.9
1598     6.0
Name: fixed acidity, Length: 1599, dtype: float64

To explain this syntax, first let’s understand what iloc does. iloc is a method for selecting rows and columns in a DataFrame, based on their zero-indexed integer location in the DataFrame. That is, the first column (counting from left to right) will be column 0, the second column will be 1, and so on. The same applies to the rows (counting from top to bottom), the first row is row 0, second is row 1 and so on. The full syntax for iloc is:

df.iloc[<row numbers>, <column numbers>]

When we pass ":" to iloc before the "," as we did in the example above, we tell iloc to return all rows. If we passed ":" after the "," we would return all columns.

Now let’s combine a couple of these techniques. We can also pass lists of numbers for both the rows and columns to iloc. Before scrolling down, see if you can guess what the following will return:

df.iloc[[0, 1], [0, 1, 2]]

If you guessed the first two rows for the first three columns, well done!

   fixed acidity  volatile acidity  citric acid
0            7.4              0.70          0.0
1            7.8              0.88          0.0

Selecting rows

When it comes to selecting or filtering rows in a DataFrame, there are typically two scenarios:

  1. We have a list of rows we want to select based on the index; or
  2. We want to filter based on the values in one or more columns.

By name

Let’s start with the less common use case as it is the simpler one to understand. Every DataFrame by default has an index. This index works like column names for rows: we can use it to select a row or a selection of rows. But first let’s look at the index for the DataFrame from earlier:

df.index

RangeIndex(start=0, stop=1599, step=1)

This output tells us that our index is just a list of sequential numbers from 0 to 1598. This aligns exactly with the zero-indexes numbered rows we were using with iloc earlier. However, to show they are different things, let’s create a new column called id that will be the current index plus 10, and then set that column as the index:

df['id'] = df.index + 10
df.set_index('id', inplace=True)

df.index

Int64Index([  10,   11,   12,   13,   14,   15,   16,   17,   18,   19,
            ...
            1599, 1600, 1601, 1602, 1603, 1604, 1605, 1606, 1607, 1608], dtype='int64', name='id', length=1599)

Now we can use this new index to select the first row of the DataFrame, which now has the index 10:

df.loc[10]

fixed acidity            7.4000
volatile acidity         0.7000
citric acid              0.0000
residual sugar           1.9000
chlorides                0.0760
free sulfur dioxide     11.0000
total sulfur dioxide    34.0000
density                  0.9978
pH                       3.5100
sulphates                0.5600
alcohol                  9.4000
quality                  5.0000
Name: 10, dtype: float64

We are now selecting the first row by index. You can also test by trying to select row 0 to confirm that there is no longer a row with that index.

Now let’s talk a little about loc. loc is the named equivalent to iloc, meaning that instead of passing it lists of zero-indexed row and column numbers, we pass it the names of columns and the indices of rows:

df.loc[<row indices>, <column names>]

In practice, this ends up being much more useful than iloc. In fact, most experienced pandas users wouldn’t even have to take their socks off to count the number of times they’ve used iloc.

Something that is common to loc and iloc is that if we just want to select some rows for all columns, we don’t actually need to pass the columns at all. In other words df.loc[[10, 11, 12]] is the same as df.loc[[10, 11, 12], :].

By value

A very common use case is that we need to select a subset of rows based on the values in one or more columns. The way we do that is syntactically simple, but also very powerful once you understand what it is actually doing. Let’s start by looking at the syntax:

df[df['fixed acidity'] > 12]

This line of code will return all rows in the DataFrame where the value in the “fixed acidity” column is greater than 12. But why do we need to repeat df? Let’s reorganize our code again for a bit of clarity:

filter = df['fixed acidity'] > 12
df[filter]

The key to understanding what the code is doing is understanding what filter in the above code is. When we run the first line of code above, filter becomes a Series (basically a list with some metadata) that has the same length as the DataFrame (i.e. one value for every row), and each value in that Series is either True or False. For our example, the value will be True where fixed acidity is greater than 12, and False otherwise. When we pass that list of True and False values to the DataFrame (or to loc), it will return the rows with a True value.

Why is this important to understand? Because it means you aren’t limited to generating a list of True and False values using the columns of the DataFrame you are working with. For example, I can generate a list of True and False values based on arbitrary things like whether the row number (not the index) is divisible by 3 (or in other words, selecting every third row):

every_3rd_row = [i % 3 == 0 for i in range(len(df))]
df[every_3rd_row]

Obviously this is not something that you are likely to see used in practice, but the point is simply to show that once you understand that filters are just lists of True and False values, you are free to generate that list any way you want/need to.

Filtering a DataFrame for multiple conditions

What if we want to filter the DataFrame for multiple conditions? To do that, we are going going to need the following characters:

CharacterMeaning
&AND condition
|OR condition
~negation

Let’s look at an example:

filter = ((df["fixed acidity"] > 12) & (df["volatile acidity"] < 0.3)) | (df["quality"] == 3)
df[filter]

Using the same structure as before, separating out the filter from the line where we apply the filter to the DataFrame, we are now filtering for three conditions – “fixed acidity” has to be greater than 12 and “volatile acidity” has to be less than 0.3; or “quality” has to be equal to 3. Note that you can keep adding more and more conditions in the same way, you just need ensure to wrap each individual condition in parentheses “()”, and also use parentheses to specify how you want to group the conditions when you use an or condition.

There are lots of more advanced methods for creating these True/False series for filtering, but that is a subject for a separate explainer.

Selecting rows and columns

The above describes all the essential building blocks for how to filter and segment a DataFrame. We can select specific columns and we can filter the rows using multiple criteria. Now we just have to put it all together. This is where loc really shows its value; loc doesn’t just work with row indices, it also works with those lists of True and False values. So now we can combine those filters and a selection of column names:

filter = ((df["fixed acidity"] > 12) & (df["volatile acidity"] < 0.3)) | (df["quality"] == 3)
df.loc[filter, ["density", "pH"]]

We also aren’t limited to just selecting these values, we can also use these selections to overwrite values. For example, let’s imagine I really like sweet red wines. Before I show these ratings to someone who is going to buy wines for me, I want to assign a quality of 9 to the wines with residual sugar in the the top 1%. Let’s see how this could be done:

cutoff = df["residual sugar"].quantile(0.99)
df.loc[df["residual sugar"] >= cutoff, "quality"] = 9
df.sort_values(["residual sugar", "quality"], ascending=False).head(20)

In the first line, we work out the cutoff point in terms of residual sugar and assign that value to a variable called cutoff. We use the loc method to filter the rows and select the column we want to update, then assign the new higher rating to everything that meets the criteria. Finally, we use sort_values to confirm it worked.

Wrapping up

In this explainer we have looked at a range of ways to filter the rows and columns of a DataFrame. This includes the loc and iloc methods, and how a list of True and False values, however it is created, can be used to filter the rows in a DataFrame. We also looked at how we can filter a DataFrame based on very complex criteria using combinations of simple building blocks and and (&), or (|) and negation (~) operators. With these tools, you will be able to filter and segment a DataFrame in practically anyway you are likely to need.

Pandas: Basic data interrogation

This article is part of a series of practical guides for using the Python data processing library pandas. To see view all the available parts, click here.

Once we have our data in a pandas DataFrame, the basic table structure in pandas, the next step is how do we assess what we have? If you are coming from Excel or R Studio, you are probably used to being able to look at the data any time you want. In python/pandas, we don’t have a spreadsheet to work with, and we don’t even have an equivalent of R Studio (although Jupyter notebooks are a similar concept), but we do have several tools available that can help you get a handle on what your data looks like.

DataFrame Dimensions

Perhaps the most basic question is how much data do I actually have? Did I successfully load in all the rows and columns I expected or are some missing? These questions can be answered with the shape method:

import pandas as pd

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=';')

print(df.shape)

(1599, 12)

shape returns a tuple (think of it as a list that you can’t alter) which tells you the number of rows and columns, 1599 and 12 respectively in this example. You can also use len to get the number of rows:

print(len(df))

1599

Using len is also slightly quicker than using shape, so if it is just the number of rows you are interested in, go with len.

Another dimension we might also be interested in is the size of the table in terms of disk space . For this we can use the memory_usage method:

print(df.memory_usage())

Index                     128
fixed acidity           12792
volatile acidity        12792
citric acid             12792
residual sugar          12792
chlorides               12792
free sulfur dioxide     12792
total sulfur dioxide    12792
density                 12792
pH                      12792
sulphates               12792
alcohol                 12792
quality                 12792
dtype: int64

This tells us the space, in bytes, each column is taking up on the disk. If we want to know the total size of the DataFrame, we can take a sum, and then to get the number into a more usable metric like kilobytes (kB) or megabytes (MB), we can divide by 1024 as many times as needed.

print(df.memory_usage().sum() / 1024)  # Size in kB

150.03125

Lastly, for these basic dimension assessments, we can generate a list of the data types of each column. This can be a very useful early indicator that your data has been read in correctly. If you have a column that you believe should be numeric (i.e. a float64 or an int64) but it is listed as object (pandas speak for categorical data), it may be a sign that something has not been interpreted correctly:

print(df.dtypes)

fixed acidity           float64
volatile acidity        float64
citric acid             float64
residual sugar          float64
chlorides               float64
free sulfur dioxide     float64
total sulfur dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
dtype: object

Viewing some sample rows

After we have satisfied ourselves that we have the expected volume of data in our DataFrame, we might want to look at some actual rows of data. Particularly for large datasets, this is where the head and tail methods come in handy. As the names suggest, head will return the first n rows of the DataFrame and tail will return the last n rows of the DataFrame (n is set to 5 by default for both).

print(df.head())

Aside from this basic use, we can use head and tail in some very useful ways with some alterations/additions. First off, we can set n to what ever value we want, showing as many or few rows as desired:

print(df.head(10))

We can also combine it with sort_values to see the top (or bottom) n rows of data sorted by a column, or selection of columns:

print(df.sort_values('fixed acidity', ascending=False).head(10))

Finally, if we have a lot of columns, too many to display all of them in Jupyter or the console, we can combine head/tail with transpose to inspect all the columns for a few rows:

print(df.head().transpose())

Summary Statistics

Moving on, the next step is typically some exploratory data analysis (EDA). EDA is a very open ended process so no one can give you an explicit set of instructions on how to do it. Each dataset is different and to a large extent, you just have to allow your curiosity to run wild. However, there are some tools we can take advantage of in this process.

Describe

The most basic way to summarize the data in your DataFrame is the describe method. This method, by default, gives us a summary of all the numeric fields in the DataFrame, including counts of values (which exclude null values), the mean, standard deviation, min, max and some percentiles.

df.describe()

This is nice, but let’s talk about what isn’t being shown. Firstly, by default, any non-numeric and date fields are excluded. In the dataset we are using in this example we don’t have any non-numeric fields, so let’s add a couple of categorical fields (they will just have random letters in this case), and a date field:

import string
import random


df['categorical'] = [random.choice(string.ascii_letters) for i in range(len(df))]
df['categorical_2'] = [random.choice(string.ascii_letters) for i in range(len(df))]
df['date_col'] = pd.date_range(start='2020-11-01', periods=len(df))

df.describe()

As we can see, the output didn’t change. But we can use some of the parameters for describe to address that. First, we can set include='all' to include all datatypes in the summary:

df.describe(include='all')

Now for the categorical columns it tells us some useful numbers about the number of unique values and which value is the most frequent. But the way it is handling the date column is like a categorical value. We can also change that so it treats it as numeric value by setting the datetime_as_numeric parameter to True:

df[['date_col]].describe(datetime_as_numeric=True)

pandas_summary Library

Building on top of the kind of summaries that are produced by describe, some very talented people have developed a library called pandas_summary. This library is purely designed to generate informative summaries of pandas DataFrames. First though we need to do some quick setup (you may need to install the library using pip):

from pandas_summary import DataFrameSummary

dfs = DataFrameSummary(df)

Now let’s take a look at two ways we can use this new DataFrameSummary object. The first one is columns_stats. This is similar to what we saw previously with describe, but with one useful addition: the number and percent of missing values in each column:

dfs.columns_stats

Secondly, my personal favorite, by selecting a column from we can look at an individual column to get some really detailed statistics, plus a histogram thrown in for numeric fields:

dfs['fixed acidity']

Seaborn

Seaborn is a statistical data visualization library for python with a full suite of charts that you should definitely look into if you have time, but for today we are going to look at just one very nice feature – pairplot. This function will generate a pairwise correlation plots for all the columns in your DataFrame with literally one line:

import seaborn as sns

sns.pairplot(df, hue="quality")

The colors of the plot will be determined by the column you select for the hue parameter. This allows you to see how the values in that column are impacted by the two features in each pairwise plot , but also along the axis where you would have a plot of a feature against itself, we get the distributions of that variable for different values in your hue column.

Note, if you have a lot of columns, be aware that this type of chart will become less useful, and will also likely take a lot of time to render.

Wrapping Up

Exploratory data analysis (EDA) should be an open ended and flexible process that never really ends. However, when we are first trying to understand the basic dimensions of a new dataset and what it contains, there are some common methods we can employ such as shape and describe and dtypes, and some very useful third party libraries such as pandas_summary and seaborn. While this explainer does not provide a comprehensive list of methods and techniques, hopefully it has provided you with somewhere to get started.

Pandas: Reading in JSON data

This article is part of a series of practical guides for using the Python data processing library pandas. To see view all the available parts, click here.

When we are working with data in software development or when the data comes from APIs, it is often not provided in a tabular form. Instead it is provided in some combination of key-value stores and arrays broadly denoted as JavaScript Object Notation (JSON). So how do we read this type of non-tabular data into a tabular format like a pandas DataFrame?

Understanding Structures

The first thing to understand about data stored in this form is that there are effectively infinite ways to represent a single dataset. For example, take a simple dataset, shown here in tabular form:

iddepartmentfirst_namelast_name
1 SalesJohnJohnson
2SalesPeterPeterson
3SalesPaulaPaulson
4HRJamesJameson
5HRJenniferJensen
6AccountingSusanSusanson
7AccountingClareClareson

Let’s look at some of the more common ways this data might be represented using JSON:

1. “Records”

A list of objects, with each object representing a row of data. The column names are the keys of each object.

[
  {
    "department": "Sales",
    "first_name": "John",
    "id": 1,
    "last_name": "Johnson"
  },
  {
    "department": "Sales",
    "first_name": "Peter",
    "id": 2,
    "last_name": "Peterson"
  },
  {
    "department": "Sales",
    "first_name": "Paula",
    "id": 3,
    "last_name": "Paulson"
  },
  ...
]

2. “List”

An object where each key is a column, with the values for that column stored in a list.

{
  "id": [
    1,
    2,
    3,
    4,
    5,
    6,
    7
  ],
  "department": [
    "Sales",
    "Sales",
    "Sales",
    "HR",
    "HR",
    "Accounting",
    "Accounting"
  ],
  "first_name": [
    "John",
    "Peter",
    "Paula",
    "James",
    "Jennifer",
    "Susan",
    "Clare"
  ],
  "last_name": [
    "Johnson",
    "Peterson",
    "Paulson",
    "Jameson",
    "Jensen",
    "Susanson",
    "Clareson"
  ]
}

3. “Split”

An object with two keys, one for the column names, the other for the data which is a list of lists representing rows of data.

{
  "columns": [
    "id",
    "department",
    "first_name",
    "last_name"
  ],
  "data": [
    [
      1,
      "Sales",
      "John",
      "Johnson"
    ],
    [
      2,
      "Sales",
      "Peter",
      "Peterson"
    ],
  ...
  ]
}

Creating a DataFrame

So how do we get this data into a pandas DataFrame given that it could come in different forms? The key is knowing what structures pandas understands. In the first two cases above (#1 Records and #2 List), pandas understands the structure and will automatically convert it to a DataFrame for you. All you have to is pass the structure to the DataFrame class:

list_data = {
  "id": [
    1,
    2,
    3,
    4,
    5,
    6,
    7
  ],
  "department": [
    "Sales",
    "Sales",
    "Sales",
    "HR",
    "HR",
    "Accounting",
    "Accounting"
  ],
  "first_name": [
    "John",
    "Peter",
    "Paula",
    "James",
    "Jennifer",
    "Susan",
    "Clare"
  ],
  "last_name": [
    "Johnson",
    "Peterson",
    "Paulson",
    "Jameson",
    "Jensen",
    "Susanson",
    "Clareson"
  ]
}
df = pd.DataFrame(list_data)
print(df)

	id 	department     first_name 	last_name
0 	1 	Sales 	       John 	        Johnson
1 	2 	Sales 	       Peter 	        Peterson
2 	3 	Sales 	       Paula 	        Paulson
3 	4 	HR 	       James 	        Jameson
4 	5 	HR 	       Jennifer 	Jensen
5 	6 	Accounting     Susan 	        Susanson
6 	7 	Accounting     Clare 	        Clareson

Initializing a DataFrame this way also gives us a couple of options. We can load in only a selection of columns:

df = pd.DataFrame(list_data, columns=['id', 'department])
print(df)

	id 	department
0 	1 	Sales
1 	2 	Sales
2 	3 	Sales
3 	4 	HR
4 	5 	HR
5 	6 	Accounting
6	7 	Accounting

We can also define an index. However, if one of the fields in your dataset is what you want to set as the index, it is simpler to do so after you load the data into a DataFrame:

df = pd.DataFrame(list_data).set_index('id')
print(df)

    department first_name last_name
id                                 
1        Sales       John   Johnson
2        Sales      Peter  Peterson
3        Sales      Paula   Paulson
4           HR      James   Jameson
5           HR   Jennifer    Jensen
6   Accounting      Susan  Susanson
7   Accounting      Clare  Clareson

Acceptable Structures

What are the acceptable structures that pandas recognizes? Here are the ones I have found so far:

Records

[
  {
    "department": "Sales",
    "first_name": "John",
    "id": 1,
    "last_name": "Johnson"
  },
  {
    "department": "Sales",
    "first_name": "Peter",
    "id": 2,
    "last_name": "Peterson"
  },
  {
    "department": "Sales",
    "first_name": "Paula",
    "id": 3,
    "last_name": "Paulson"
  },
  ...
]

List

{
  "id": [
    1,
    2,
    3,
    4,
    5,
    6,
    7
  ],
  "department": [
    "Sales",
    "Sales",
    "Sales",
    "HR",
    "HR",
    "Accounting",
    "Accounting"
  ],
  "first_name": [
    "John",
    "Peter",
    "Paula",
    "James",
    "Jennifer",
    "Susan",
    "Clare"
  ],
  "last_name": [
    "Johnson",
    "Peterson",
    "Paulson",
    "Jameson",
    "Jensen",
    "Susanson",
    "Clareson"
  ]
}

Dict

{
  "id": {
      0: 1,
      1: 2,
      2: 3,
      3: 4,
      4: 5,
      5: 6,
      6: 7
  },
  "department": {
    0: "Sales",
    1: "Sales",
    2: "Sales",
    3: "HR",
    4: "HR",
    5: "Accounting",
    6: "Accounting"
  },
  "first_name": {
    0: "John",
    1: "Peter",
    2: "Paula",
    3: "James",
    4: "Jennifer",
    5: "Susan",
    6: "Clare"
  },
  "last_name": {
    0: "Johnson",
    1: "Peterson",
    2: "Paulson",
    3: "Jameson",
    4: "Jensen",
    5: "Susanson",
    6: "Clareson"
  }
}

Matrix

For this one you will have to pass the column names separately.

[
    [
        1,
        "Sales",
        "John",
        "Johnson"
    ],
    [
        2,
        "Sales",
        "Peter",
        "Peterson"
    ],
    [
        3,
        "Sales",
        "Paula",
        "Paulson"
    ],
    ...
]

Inconsistencies

What happens if the data is in one of these formats, but has some inconsistencies? For example, what if we have something that looks like this?

[ 
  {
    "department": "Sales",
    "first_name": "John",
    "id": 1,
    "last_name": "Johnson",
    "extra_field": "woops!"
  },
  {
    "department": "Sales",
    "first_name": "Peter",
    "id": 2,
    "last_name": "Peterson"
  },
  {
    "department": "Sales",
    "first_name": "Paula",
    "id": 3,
    "last_name": "Paulson"
  },
  ...
]

Fortunately, pandas is fairly robust to these types of inconsistencies, in this case creating an extra column and filling the remaining rows with NaN (null values):

    id  department  first_name  last_name  extra_field
0   1   Sales 	    John 	Johnson    woops!
1   2   Sales 	    Peter 	Peterson   NaN
2   3   Sales 	    Paula 	Paulson    NaN
3   4   HR 	    James 	Jameson    NaN
4   5   HR 	    Jennifer 	Jensen     NaN
5   6   Accounting  Susan       Susanson   NaN
6   7   Accounting  Clare       Clareson   NaN

Something important to note is that, depending on the structure and where the inconsistency occurs in the structure, the inconsistency can be handled differently. It could be an additional column, an additional row, or in some cases it may be ignored completely. The key is, as always, to check your data has loaded as expected.

Explicit Methods

There are two more methods for reading JSON data into a DataFrame: DataFrame.from_records and DataFrame.from_dict. DataFrame.from_records expects data in the ‘Records’ or ‘Matrix’ formats shown above, while DataFrame.from_dict will accept data in either the Dict or List structures. These methods are more explicit in what they do and have several potential advantages:

Clarity

When working with a team or in a situation where other people are going to review your code, being explicit can help them to understand what you are trying to do. Passing some unknown structure to DataFrame and knowing/hoping it will interpret it correctly is using a little too much ‘magic’ for some people. For the sanity of others, and yourself in 6 months when you are trying to work out what you did, you might want to consider the more explicit methods.

Strictness

When writing code that is going to be reused, maintained and/or run automatically, we want to write that code in a very strict way. That is, it should not keep working if the inputs change. Using DataFrame could lead to situations where the input data format changes, but is read in anyway and instead breaks something else further down the line. In a situation like this, someone will likely have the unenviable task of following the trail through the code to work out what changed.

Using the more explicit methods is more likely to cause the error to be raised where the problem actually occurred: reading in data which is no longer in the expected format.

Options

The more explicit methods give you more options for reading in the data. For example DataFrame.from_records gives you options to limit the number of rows to read in. DataFrame.from_dict allows you to specify the orientation of the data. That is, are the lists of values representative of columns or rows?

Coercion

In some cases, your data will not play nice and the generic DataFrame method will not correctly interpret your data. Using the more explicit method can help to resolve this. For example, if your objects are in a column of a DataFrame (i.e. a pandas Series) instead of a list, using DataFrame will give you a DataFrame with one column:

records_data = pd.Series([
  {
    "department": "Sales",
    "first_name": "John",
    "id": 1,
    "last_name": "Johnson",
    "test": 0
  },
  {
    "department": "Sales",
    "first": "Peter",
    "id": 2,
    "name": "Peterson"
  },
  {
    "dept": "Sales",
    "firstname": "Paula",
    "sid": 3,
    "lastname": "Paulson"
  },
  {
    "dept": "HR",
    "name": "James",
    "pid": 4,
    "last": "Jameson"
  }
])
print(pd.DataFrame(records_data))

Using the more explicit DataFrame.from_records gives you the expected results:

records_data = pd.Series([
  {
    "department": "Sales",
    "first_name": "John",
    "id": 1,
    "last_name": "Johnson",
    "test": 0
  },
  {
    "department": "Sales",
    "first": "Peter",
    "id": 2,
    "name": "Peterson"
  },
  {
    "dept": "Sales",
    "firstname": "Paula",
    "sid": 3,
    "lastname": "Paulson"
  },
  {
    "dept": "HR",
    "name": "James",
    "pid": 4,
    "last": "Jameson"
  }
])
print(pd.DataFrame.from_records(records_data))

Wrapping Up

We’ve looked at how we can quickly and easily convert JSON format data into tabular data using the DataFrame class and the more explicit DataFrame.from_records and DataFrame.from_dict methods. The downside is this only works if the data is in one of a few structures. The upside is most of the data you will encounter will be in one of these formats, or something that is easily converted into these formats.

If you want to play around with converting data between delimited format and various JSON formats, I can recommend trying an app I built a while back: JSONifyit.

Pandas: Reading in tabular data

This article is part of a series of practical guides for using the Python data processing library pandas. To see view all the available parts, click here.

To get started with pandas, the first thing you are going to need to understand is how to get data into pandas. For this guide we are going to focus on reading in tabular data (i.e. data stored in a table with rows and columns). If you don’t have some data available but want to try some things out, a great place to get some data to play with is the UCI Machine Learning Repository.

Delimited Files

One of the most common ways you will encounter tabular data, particularly data from an external source or publicly available data, is in the form of a delimited file such as comma separated values (CSV), tab separated values (TSV), or separated by some other character. To import this data so you can start playing with it, pandas gives you the read_csv function with a lot of options to help manage different cases. But let’s start with the very basic case:

import pandas as pd

df = pd.read_csv('path/to/file.csv')

# Show the top 5 rows to make sure it read in correctly
print(df.head())

Running this code imports the pandas library (as pd), uses the read_csv function to read in the data and stores it as a pandas DataFrame called df, then prints the top 5 rows using the head method. Note that the path to the file that you want to import can be a path to a file on your computer, or it can be a URL (web address) for a file on the internet. As long as you have internet access (and permission to access the file) it will work like you have the file downloaded and saved already.

When reading the data, unless specified, read_csv will attempt to automatically detect what the delimiting character is (e.g. “,” for CSV). In most cases this works fine, but in cases where it doesn’t, you can use the sep parameter to specify what char to use. For example, if your file is separated with “;” you might do something like:

import pandas as pd

df = pd.read_csv('path/to/file.csv', sep=';')

# Show the top 5 rows to make sure it is correct
print(df.head())

OK, what if your file has some other junk above and/or below the actual data like this:

We have two options for working around this, the header parameter and the skiprows parameter:

import pandas as pd

df_1 = pd.read_csv('path/to/file.csv', header=7)
df_2 = pd.read_csv('path/to/file.csv', skiprows=7)

# Both DataFrames produce the same result
print(df_1.head())
print(df_2.head())

These are equivalent because setting header=7 tells read_csv to look in row 7 (remember the row numbers are 0 indexed) to find the header row, then assume the data starts from the next row. On the other hand, setting skiprows=7 tells read_csv to ignore the first 7 rows (so rows 0 to 6), then it assumes the header row is the first row after the ignored rows.

Other Useful read_csv Parameters

There are dozens of other parameters to help you read in your data to handle a range of strange cases, but here are a selection of parameters I have found most useful to date:

ParameterDescription
skipfooterSkip rows at the end of the file (note: you will need to set engine=’python’ to use this)
index_colColumn to set as the index (the values in this column will become the row labels)
nrowsNumber of rows to read in (useful for reading in a sample of rows from a large file)
usecolsA list of columns to read (can use the column names or the 0 indexed column numbers)
skip_blank_linesSkip empty rows instead of reading in as NaN (empty values)

For the full list of available parameters, checkout out the official documentation. One thing to note is that although there a lot of the parameters available for read_csv, many are focused on helping correctly format and interpret data as it is being read in – for example, interpretation of dates, interpretation of boolean values, and so on. In many/most cases these are things that can be addressed after the data is in a pandas DataFrame, and in some cases, handling these types of formatting and standardization steps explicitly after reading in the data can make it easier to understand for the next person that reads your code.

Excel Data

Pandas also has a nice handy wrapper for reading in Excel data read_excel. Instead of writing your data to a CSV, then reading it in, now you can read directly from the Excel file itself. This function has many of the same parameters as read_csv, with options to skip rows, read in a sample of rows and/or columns, specify a header row and so on.

Databases

If your data is in a tabular/SQL database, like PostgreSQL, MySQL, Bigquery or something similar, your job gets a little bit more complicated to setup, but once that setup is done, it becomes really simple to repeatedly query data (using SQL) from that database directly into a DataFrame where you can do what you want with it.

The first step is to create a connection to the database holding the data. It is beyond the scope of this particular guide, but the library you will almost certainly need to use will be SQLAlchemy or in some cases a library created by the creator of the database (for example, Google has Bigquery API library called google-cloud-bigquery).

Once you have connected to your database, pandas provides three functions for you to extract data into a DataFrame: read_sql_table, read_sql_query and read_sql. The last of these, read_sql, is what’s called a ‘convenience wrapper’ around read_sql_table and read_sql_query – the functionality of both the underlying functions can be accessed from read_sql. But let’s look at the two underlying functions individually to see what the differences are and what options we have.

read_sql_table is a function we can use to extract data from a table in a SQL database. The function requires two parameters table_name – the name of the table you want to get the data from; and con – the location of the database the table is in. With these two parameters, all data from the specified table (i.e. SELECT * FROM table_name) will be returned as a DataFrame:

df = pd.read_sql_table(table_name='table_name', con='postgres:///db_name')  

read_sql_table does also give you the option to specify a list of columns to be extracted using the columns parameter.

read_sql_query on the other hand allows you to specify the query you want to run against the database.

query = """
    SELECT column_1
        , column_2
        , column_3
    FROM table_name
    WHERE column_4 > 10
"""
df = pd.read_sql_query(query, 'postgres:///db_name')  

Obviously writing your own query gives you a lot more flexibility to extract exactly what you need. However, also consider the potential upside in terms of processing efficiency. Doing aggregations and transformations in a database, in almost all cases, will be much faster than doing it in pandas after it is extracted. As a result, some careful query planning can save a lot of time and effort later on.

Other Data Sources

Pandas also has functions for reading in data from a range of other sources, including HTML tables, to SPSS, Stata, SAS and HDF files. We won’t go into them here, but being aware that these options exist is often all you really need to know. If a case arises where you need to read data from these sources, you can always refer to the documentation.

Wrapping Up

We’ve looked at how we can use pandas to read in data from various sources of tabular data, from delimited files and Excel, to databases, to some other more uncommon sources. While these functions often have many parameters available, remember most of them will be unnecessary for any given dataset. These functions are designed to work with the minimum parameters provided (e.g. just a file location) in most cases. Also remember that once you have the data in a DataFrame, you will have a tonne of options to fix and change the data as needed – you don’t need to do everything in one function.

Official Release: Visual Analytics

I am proud to announce the release of an application I’ve been working on for the last few months – Visual Analytics. This application is designed to give you a new way to view your Google Analytics data using a range of interactive visualizations, allowing you to get a better understanding of who your users are, how they are getting to your site, and what they are doing when they get there.

For those worried about privacy and personal security, the application has a couple of features that will hopefully ease your mind. Firstly, there is no separate account or login details needed for Visual Analytics, everything is based on your existing Google account, and the login process is completed using Google authentication.

 

 

Secondly, the application does not currently store any user data. In fact, the application has no database at all (sensing a theme here?). That means that not only does that mean I can not sell your data to third parties, but that even if someone does manage to hack into the application, there is nothing to steal except my hacky code base.

For those interested in the technical specs, the backend of the application was built using Python and the Flask web framework. To access the data, once you are logged in using your Google credentials, the application makes calls to the Google Analytics API and then uses Pandas to handle the data manipulation (where needed). On the front end, the visualizations are created using D3.js and Highcharts (a big shout out to the Highcharts team and Mike Bostock for their excellent work on these libraries).

Anyway, if you have a Google Analytics account and are interested in getting some interesting insights into your data, take a look and let me know what you think. And please, if you find an issue or a bug, let me know!

 

 

How to create a flashcard app without a database

Last week, I covered how setting up a database may not be necessary when creating an app or visualization, even one that relies on data. This week we are going to walk through an example application that runs off data, but does not need a formal database.

First some background. For the last couple of months, I have been attending some basic Arabic classes to help get around Jordan easier. During a recent class, several of the students were discussing the time they had spent putting together physical cardboard flashcards to help them memorize words. Hearing this, and having played around with creating simple applications and visualizations for a couple of years now, it occurred to me that generating flashcards using a simple application would probably be significantly quicker and easier to do. In addition, if done the right way, it could work on your phone, making it available to you anytime you had your phone and an internet connection.

Perhaps as you are reading this, you are thinking that Arabic is a widely spoken language, surely there is already an app for that? And you would be correct, there are multiple apps for learning Arabic. However, the complication with Arabic is that each country/region uses a significantly different version of the language. In addition to these regional dialects, there is Modern Standard Arabic, which is the formal written version of the language. When it comes to the apps currently available, the version of Arabic being presented is almost always Modern Standard Arabic, as opposed to the Levantine Arabic which is spoken throughout Palestine, Jordan, Lebanon and Syria. Additionally, the apps are quite expensive (up to $10), of questionable accuracy/quality, or both.

To address this problem, and for the challenge and learning opportunity, a few weeks back I sat down over a weekend and put together a simple application that would generate Arabic flashcards (you can see the current version here). The application is based on an Excel spreadsheet with translations that I continue to enter over time based on my notes and the class textbook. Using an Excel spreadsheet in this case is advantageous for two key reasons:

  • It is simple to edit and update with new translations
  • It is a format that almost everyone is familiar with so I can recruit other students and teachers to add translations

With that out of the way, let’s take a look at the high-level process for creating a flashcards app.

1. Collecting the Data

The first step is creating the Excel spreadsheet for our translations. In this case, it is fairly simple and looks something like this:

arabic flashcards

In this spreadsheet, each row represents one word and, through the application, will represent one flashcard. In one column, we have the Arabic script for each word, then in the next column the English translation. In addition, we also have a third version of the word with a column header of ‘transcribed’. This column represents how the Arabic word would look/sound if written in Latin script, something that is commonly used in beginner classes when students cannot yet read Arabic script.[1] Finally, in the last column we have a category column. This will be used to provide a feature where the user can filter the words in the application, allowing them to focus their study on particular sets of words.

A quick note, we also have an ID column, which is not used in the application. It is only included to provide a unique key in our datasets, as good practice.

2. Processing the Data

The next step is to take the data from our spreadsheet, convert it to a format that we can use to generate flashcards in the application, then save it. To do this we will use my favorite Python library, pandas, and the short script shown below.

[cc lang=”Python” escaped=”true” tab_size=”4″]
# -*- coding: utf-8 -*-
import pandas as pd

# Read In Data
df = pd.read_excel(“./data.xlsx”, header=0)

# Create JSON String
json_string = df.to_json(orient=”records”, force_ascii=False)
json_string = “var data = ” + json_string + “;”

# Write to file
text_file = open(“data.js”, “w”)
text_file.write(json_string)
text_file.close()
[/cc]

What this script in does is read in the file (in this case, data.xlsx) to a pandas dataframe (line 5). After that (line 8), we use the to_json method to output the contents of the dataframe to a JSON string. In line 9 we add some JavaScript to the beginning and end of that JSON string, then in lines 12-14 we save the string as a JavaScript file, data.js.

There are a couple of important things to note here. The first is that when dealing with non-Latin text characters (like Arabic characters), we need to specify that force_ascii=False (the default value is True). If we don’t do this, the script will return an error and/or convert the Arabic letters into a combination of Latin characters representing the Unicode character (i.e. it will look like gibberish).

The second thing to note for those that have not worked with JSON, or key-value stores more generally, is that this is the format that most data comes in when used in programs and applications. It is a highly flexible structure and, as a result, there are many ways we could represent the data shown above. In this case, we are using the ‘records’ format (as specified by pandas), which will look like this:

[

{
“id”:1,
“arabic”:”كتير”,
“english”:”A lot\/Many\/Very”,
“transcribed”:”kteer”,
“category”:”Basics”
},
{
“id”:2,
“arabic”:”عَن”,
“english”:”About”,
“transcribed”:”3an”,
“category”:”Basics”
},…

]

If this isn’t making any sense, or you would like to see some of the other possibilities, copy and paste some spreadsheet data into this CSV to JSON convertor. Toggling a few options, it should quickly become obvious how many different ways a given dataset can be represented in JSON format.

3. Building the App

Now that the data is ready, we create the files needed for the flashcards application. In this case, it is only three files, a HTML document (index.html) for the page, a CSS file for the styling, and an additional JavaScript file that will use the data in data.js to create the flashcards and generate the various features of the application. For those that are interested in the full code or want to create your own version, please feel free to checkout/fork the GitHub repo. For those that do not want to get too far into the weeds, there are just a few things I want to highlight about what the code is doing.

Firstly, the filtering and language options in the application are being generated directly from the data. What this means is that as more categories are added to the Excel spreadsheet, or if the languages change (i.e. the headings in the spreadsheet change), as soon as I update the underlying Excel and run the script shown above, all the options in the application will also update accordingly.

Secondly, I added a feature that allows the user to keep score. It is a simple honesty-based system, but I found it does provide some motivation to keep improving, as well as removing an element of self-deception as to how well you are actually doing. Often I would find myself thinking that I was getting almost all of them correct, only to find my correct percentage hovering around 70%.

Finally, a note on randomness. Whether the user is going through the cards unfiltered, or filtering for some category, the application is displaying the flashcards in a random[2] order. This random selection algorithm went through several iterations:

  1. In version 1, the algorithm would simply select four (the number of flashcards presented to the user at one time) random selections from the pool of eligible words.
  2. Upon testing version 1, it was found that, with surprising regularity, the same word would be selected more than once in a group of four flashcards. To address this, in version 2 a condition was added that when randomly selecting a word, it would only be accepted if that word had not already been selected in the given pool of four words.
  3. On further testing, I noticed another annoying issue. As I continually refreshed the four flashcards being displayed, some words would show up repeatedly, while others would take forever to show up, or not show up at all. To avoid this, for version 3, I changed the algorithm again. Now, instead of selecting four words at random, the algorithm instead took the whole list of words, shuffled them in a random order, and ran through the list in the new shuffled order. When the list ran out of words, it took the full list, shuffled it again, and continued.
  4. This was a big improvement. As I refreshed, I got different words, and was able to see all the words before they started repeating. But then I found another issue. In cases where the number of eligible words was not divisible by four, the old shuffled list and the new shuffled list would overlap in a selection of four words. In these cases, there was a possibility that the same word would be repeated. This is a little difficult to visualize, so the illustration below tries to present what was happening using an example list of ten words:

arabic flashcards

To address this, in version 4, a new condition was added. In cases like the example shown above, the algorithm will check the words from the new shuffled list to ensure they are not already selected from the old list. If a word is already selected, it will move that word to the end of the list and instead take the next word on the list. Here is another diagram to show what is happening:

arabic flashcards

4. Finishing Up

Ok, for those stepping through this and creating your own flashcards app, at this point you have copied the code available from the repo, made any changes to the spreadsheet, and rerun the script to refresh the data. For the final step, there are a couple of things that can be done.

If you are only planning to use the app on the same computer as you are using to create the flashcards app, you are done! Just open the index.html file using Chrome, Firefox or Safari (you can try Internet Explorer, but you know…) and you can test and use the app as you would use any website.

If you want to publish your flashcards app online to share with others, by far the easiest way is to use a service such as GitHub pages. I don’t want to turn this into a beginners guide to using git and GitHub, but there is excellent documentation available to help get you started if you would like to do this. You can see my version at the following address: https://vladimiriii.github.io/arabic-flashcards/, but there is even an option to redirect it to a domain of your choosing should you have one.

arabic flashcards

 

I hope this was a helpful guide to how a simple application can be created without a database, even if the application runs on some underlying form of data. Let me know what you think in the comments below!

 

[1] Because Arabic has many sounds that are difficult to convey in Latin script, this is also why when Arabic is transcribed, you will often find multiple spellings of the same word (e.g. Al-Qaeda vs Al-Qaida).

[2] As will be discussed in a new piece to be written, it is not truly random, and the reasons why are pretty interesting.

Forget SQL or NoSQL – 5 scenarios where you may not need a database at all

A while back, I attended a hackathon in Belgrade as a mentor. This hackathon was the first ‘open data’ hackathon in Serbia and focused on making applications using data that had recently been released by various ministries, government agencies, and independent bodies in Serbia. As we walked around talking to the various teams, one of the things I noticed at the time, was that almost all teams were using databases to manage their data . In most cases, the database being used was something very lightweight like SQLite3, but in some cases more serious databases (MySQL, PostgreSQL, MongoDB) were also being used.

What I have come to realize is that in many cases this was probably completely unnecessary, particularly given the tight timeframe the teams were working towards – a functional prototype within 48 hours. However, even if you have more time to build an application, there are several good reasons that you may not need to worry about using a formal database. These are outlined below.

1. The data is small

Firstly, let’s clarify what I mean when I say ‘small data’. For me, small data is any dataset under 10,000 records (assuming a reasonable number of data points for each record). For many non-data people, 10,000 records may seem quite big, but when using programming languages such as Python or JavaScript, this amount of data is usually very quick and easy to work with. In fact, as Josh Zeigler found, even loading 100,000 records or 15MB of data into a page was possible, completing in as little as 463ms (Safari FTW).

Leaving aside the numbers for a second, the key point here is that in many cases, the data being displayed in an application has far fewer than 10,000 records. If your data is less than 10,000 records, you should probably ask yourself, do you need a database? It is often far simpler, and requires significantly less overhead to simply have your data in a JSON file and load it into the page directly. Alternatively, CSV and Excel files can also be converted to JSON and dumped to a file very quickly and easily using a Python/Pandas script.

ecis visualization

The ECIS Development Tracker uses data from six Worldwide Governance Indicators and two other series over 20 years and 18 countries – a total of almost 3,000 data points and a perfect example of small data.

2. The data is static

Another reason you may not need a database is if you have a reasonable expectation that the data you are using is not going to change. This is often the case where the data is going to be used for read only purposes – for example visualizations, dashboards and other apps where you are presenting information to users. In these cases, again it may make sense to avoid a database, and rely on a flat file instead.

The important point here is that if the data is not changing or being altered, then static files are probably all that is needed. Even if the data is larger, you can use a script to handle any data processing and load the (assumedly) aggregated or filtered results into the page. If your needs are more dynamic (i.e. you want to show different data to different users and do not want to load everything), you may need a backend (something you would need for a database anyway) that extracts the required data from the flat file, but again, a database may be overkill.

kosovo mosaic

The Kosovo Mosaic visualizer – based on data from a survey conducted once every three years – is an example of a case where the data is not expected to change any time soon.

3. The data is simple

One of the big advantages of databases is their ability to store and provide access to complex data. For example, think about representing data from a chain of retail stores on the sale of various products by different sales people. In this case, because there are three related concepts (products, sales people and stores), representing this data without using a database becomes very difficult without a large amount of repetition[1]. In this case, even if the data is small and static, it may simply be better to use a relational database to store the data.

However, in cases where the data can be represented in a table, or multiple unrelated tables, subject to points 1 and 2 above, it may make sense to avoid the overhead of a database.

database schema

If you need a schema diagram like this to describe your data, you can probably skip the rest of this article.

4. The data is available from a good API

I have recently been working on a project to develop an application that is making extensive use of the Google API. While still under development, the app is already quite complex, making heavy use of data to generate charts and tables on almost every page. However, despite this complexity, so far, I have not had to use a database.

One of the primary reasons I have not needed to implement a database is that the Google API is flexible enough for me to effectively use that as a database. Every time I need data to generate a chart or table, the app makes a call to the API (using Python), passes the results to the front end where, because the data is small (the Google API returns a maximum of 10,000 rows in a query), most of the data manipulation is handled using JavaScript on the client side. For the cases where more heavy data manipulation is required, I make use of Python libraries like Pandas to handle the data processing before sending the data to the front end. What this boils down to is a data intensive application that, as yet, still does not need a database.

Of course, this isn’t to say I will not need a database in the future. If I plan to store user settings and preferences, track usage of the application, or collect other meta data, I will need to implement a database to store that information. However, if you are developing an application that will make use of a flexible and reliable API, you may not need to implement your own database.

google apis

Google has APIs available for almost all of its products – most of them with a lot of flexibility and quick response times.

5. The app is being built for a short-term need

While it might seem unusual to build a web app with the expectation that it will not be used six months later, this is a surprisingly common use case. In fact, this is often the expectation for visualizations and other informative pages, or pages built for a specific event.

In these particular use cases, keeping down overhead should be a big consideration, in addition to potential hosting options. Developing these short-term applications without a backend and database means free and easy hosting solutions like that provided by GitHub can be used. Adding a backend or database immediately means a more complex hosting setup is required.

Wrapping up, this is a not an argument against databases…

… it is simply an argument to use the best and simplest tools for a given job. As someone who has worked with a number of different databases throughout their career, I am actually a big user of databases and find most of them intuitive and easy to use. There is also a large number of advantages that only a database can provide, from ensuring data consistency, to facilitating large numbers of users simultaneously making updates, to managing large and complex datasets, there are a number of very good reasons to use a database (SQL or NoSQL, whichever flavor you happen to prefer).

But, as we have covered above, there may be some cases where you do not need these features and can avoid adding an unnecessary complication to your app.

 

Next week we’ll take a look at a simple app that uses an Excel spreadsheet to generate the data required for the application.

 

[1] With repetition comes an increased risk of data quality issues

JSONify It – CSV to JSON Converter

Go to JSONify It

For those who have some experience in creating visualizations, particularly online visualizations using JavaScript and libraries such as D3.js, one thing that you will often come across is the need to convert your data. Typically this need will arise because the data you receive or collect will be in a human-friendly format such as an Excel spreadsheet, and in order for you to use it for the visualization you will need that data in JSON format. Annoyingly, this will often be just a one time conversion, meaning writing a stand alone script to do the conversion often seems like overkill.

Handily, there are a number of CSV to JSON converters lying around on the internet for people to use, and most of them work more or less as expected. However, a problem I encountered when building this Procurement Zoomable Treemap visualization, is that you sometimes need the JSON to be nested, and this was not a feature I encountered on any of the online converters.

In order to address my need (and to see if I could pull it off), when I built that visualization I also used Python/Flask/Pandas to build a simple API that generated nested JSON datasets on the fly from an underlying CSV file. Having this allowed me to build the zoomable tree map that could be reconfigured by the user. That is, the user could specify the categories and the order in which the treemap would zoom through.

While this was great, it always felt a little incomplete. Then a few months back, I had some time on my hands and decided to take this API and upgrade it to have a full user interface so that, like many of the online convertors, users could copy and paste data straight into the browser, configure some options, and get a JSON formatted dataset back. The result was JSONify It – a simple but (I hope) easy to use app that is not only very flexible in formatting JSON, but as far as I can tell, is the only CSV to JSON convertor that allows you to nest the JSON by any column (or columns) you specify.

So, for those interested, feel free to take a look, try it out, look at the code, and if you come across any bugs or issues, or would like any further information, please let me know in the comments below.

 

Data Science: A Kaggle Walkthrough – Creating a Model

This article is Part VI in a series looking at data science and machine learning by walking through a Kaggle competition. If you have not done so already, you are strongly encouraged to go back and read the earlier parts – (Part I, Part II, Part III, Part IV and Part V).

Continuing on the walkthrough, in this part we build the model that will predict the first booking destination country for each user based on the dataset created in the earlier parts.

Choosing an Algorithm

The first step to building a model is to decide what type of algorithm to use. Below we look at some of the options.

Decision Tree

Arguably the most well known algorithm, and one of the simplest conceptually. The decision tree works in a similar manner to the decision tree that you might create when trying to understand which decision to make based on a range of variables.

The goal of the decision tree algorithm used for classification problems (like the one we are looking at) is to create one of these decision trees to classify records into a set number of categories. To do this, it starts with all the records in the training dataset and looks through all the features until it finds the one that allows it to most ‘cleanly’ split the records according to their categories. For example, if you are using daily weather data to try and determine whether it will rain the following day (i.e. there are two categories, ‘it does rain’ and ‘it does not rain’), the algorithm will look for a feature that best splits the records (in this case representing days) into those two categories. When it finds that feature, and the value to split on, it creates one point (‘decision node’) on the decision tree. It then takes each subpopulation and does the same thing again, building up a tree until either all the records are correctly classified, or the number in each subpopulation becomes too small to split. Below is an example decision tree using the described weather data to predict if it will rain tomorrow or not (thanks to Graham Williams’ excellent Rattle package for R):

The way to interpret the above tree is to start at the top. The first criteria the algorithm splits on is the humidity at 3pm. Starting with 100% of the records, if the the humidity at 3pm is less than 71, as it is the case for 93% of the records, we move to the left and find the next decision node. If the humidity at 3pm is greater than or equal to 71, we move to the right, which takes us to a leaf node where the model predicts that there will be rain tomorrow (‘yes’). We can see from the numbers in the node that this represents 7% of all records, and that 74% of the records that reach this node are correctly classified.

The first thing to note is that the model does not accurately predict whether it will rain tomorrow for all records, and in some leaf nodes, it is only slightly better than a coin toss. This is not necessarily a bad thing. The biggest problem that data scientists have with decision trees is the classic problem of overfitting. In the example above, parameters have been set to stop model splitting once the population of records at a given node gets too small (minimum split) and when a certain number of splits have occurred (‘maximum depth’). These values have been set at values to prevent the tree from growing to large. The reason for this is that if the tree gets too large, it will start modelling random noise and hence will not work for data not in the training dataset (it will not ‘generalize’ well).

To picture what this means, imagine extending the example decision tree above further until the model starts splitting out single records using criteria like ‘Humidty3pm = 54’ and ‘Humidty3pm = 31’. That type of decision node may work for this particular training data because there is a specific record that meet that criteria, but it is highly unlikely that it represents any predictive ability and so is unlikely to be accurate if applied to other data.

All this discussion of overfitting with decision trees does however raise an important problem. That problem is how do you know how large you should grow the tree. How do you set the parameters to avoid overfitting but still have an accurate model? The truth is that is is extremely difficult to know how to set the parameters. Set them too conservatively and the model will lose too much predictive power. Set them too aggressively and the model will start overfitting the data.

Seeing the Forest for the Trees

Given the limitations of decisions trees and the risk of overfitting, it may be tempting to think “why bother?” Fortunately, methods have been found to reduce the risk of overfitting and increase predictive power of decisions trees and the two most popular methods both have the same basic premise – to train multiple trees.

One of the most well known algorithms that utilizes decision trees is the ‘random forest’ algorithm. As the name suggests, the algorithm constructs a large number of different trees (as defined by the user) by randomly selecting the features that can be used to build each tree (as opposed to using all the features for each tree). Typically, the trees in a random forest also have the parameters set to ensure each tree will also be relatively shallow, meaning that the algorithm creates a large number of shallow decision trees (decision bonsai?). Once the trees are constructed, each tree is used to predict the outcome for a new record, with these multiple predictions then serving as votes, with a majority rules approach applied.

Another algorithm which has become almost the default algorithm of choice for Kagglers, and is the type of the model we will use, uses a method called ‘boosting’, which means it builds trees iteratively such that each tree ‘learns’ from earlier trees. To do this the algorithm builds a first tree – again typically a shallower tree than if you were going to use a one tree approach – and makes predictions using that tree. Then the algorithm finds the records that are misclassified by that tree, and assigns a higher weight of importance to those records than the records that were correctly classified. The algorithm then builds a new tree with these new weightings. This whole process is repeated as many times as specified by the user. Once the specified number of trees have been built, all the trees built during this process are used to classify the records, with a majority rules approach used to determine the final prediction.

It should be noted that this methodology (‘boosting’) can actually be applied to many classification algorithms, but has really grown popular with the decision tree based implementation. It should also be noted there are different implementations of this algorithm even just using trees. In this case, we will be using the very popular XGBoost algorithm.

Alternative Models

So far we have only covered decision trees and decision tree-based algorithms. However, there are a range of different algorithms that can be used for classification problems. Given this is supposed to be a short blog series, I will not go into too much detail on each algorithm here. But if you want more information on these algorithms, or other algorithms that I haven’t covered here, there is a growing amount of information online. I also strongly recommend the Data Science specialization offered by John Hopkins University, for free, on Coursera.

K-Nearest Neighbors

The K-nearest neighbor algorithms are arguably one of the simplest algorithms in concept. The algorithm classifies a given object by looking at the classification of the k most similar records[1] and seeing how those records are classified. This type of algorithm is called a lazy learner because during the training phase, it essentially just stores the data provided. Only when a new object needs to be classified does the algorithm start looking through the data to try to find the closest matches.

Neural Networks

As the name suggests, these algorithms simulate biological networks by creating a series of nodes and connecting them together. A neural network typically consists of three layers; an input layer, a hidden layer (although there can be multiple hidden layers) and an output layer.

A model is trained by passing records through the network and weights adjusted at each node continually adjusted to ensure that the record ends up at the right ‘output node’.

Support Vector Machines

This type of algorithm, commonly used for text classification problems, is arguably the most difficult to visualize. At the simplest level, the algorithm tries to draw straight lines (or planes for classifications with more than 2 features) that best separate the classes provided. Although this sounds like a fairly simplistic approach to classifying objects, it becomes far more powerful due to the transformations (sometimes called a ‘kernel trick’) the algorithm can apply to the data before drawing these lines/planes. The mathematics behind this are far too complex to go into here, but the Wikipedia page has some nice visuals to help picture how this is working. In addition, this video provides a nice example of how a Support Vector Machine can separate classes using this kernel trick:

Creating the Model

Back to the modelling – now that we know what algorithm we are using (XGBoost algorithm for those skipping ahead), let talk about the approach.

Cross Validation

As mentioned in regards to decision trees, one of the keys risks when creating models of any type is the risk of overfitting. One of the primary ways data scientists will guard against overfitting is to estimate the accuracy of their models on data that was not used to train the model. To do this they typically use a method called cross validation. There are different methods for doing cross validation, but the method we will employ is called k-fold cross validation.

k-fold cross validation involves splitting the training data into k subsets (where k is greater than or equal to 2), training the model using k – 1 of those subsets, then running the model on the subset that was not used in the training process. Because all of the data used in the cross validation process is training data, the correct classification for each record is known and so the predicted category can be compared to the actual category. Once all folds have been completed, the average score across all folds is taken as an estimate of how the model will perform on other data. An example of a 3-fold cross validation is shown below:

Parameter Tuning

As you may have realized from the earlier description of the XGBoost algorithm – there are quite a few options (parameters) that we need to define to build the model. How many trees to build? How deep should each tree be? How much extra weight will be attached to each misclassified record? Tuning these parameters to get the best results from the model is often one of the most time consuming things that data scientists do. Fortunately, the process can be automated to a large degree so that we do not have to sit there rerunning the model repeatedly and noting down the results. Even better, using the Scikit-Learn package, we can merge the parameter tuning and cross validation steps into one, allowing us to search for the best combination of parameters while using k-fold cross validation to verify the results.

Training the Model

In order to train the model (using cross validation and parameter tuning as outlined above), we first need to define our training dataset – remembering that we previously combined the training and test data to simplify the cleaning and transforming process. To feed these into the model, we also need to split the training data into the three main components – the user IDs (we don’t want to use these for training as they are randomly generated), the features to use for training (X), and the categories we are trying to predict (y).

# Prepare training data for modelling
df_train.set_index('id', inplace=True)
df_train = pd.concat([df_train['country_destination'], df_all], axis=1, join='inner')

id_train = df_train.index.values
labels = df_train['country_destination']
le = LabelEncoder()
y = le.fit_transform(labels)
X = df_train.drop('country_destination', axis=1, inplace=False)

Now that we have our training data ready, we can use GridSearchCV to run the algorithm with a range of parameters, then select the model that has the highest cross validated score based on the chosen measure of a performance (in this case accuracy, but there are a range of metrics we could use based on our needs).

# Grid Search - Used to find best combination of parameters
XGB_model = xgb.XGBClassifier(objective='multi:softprob', subsample=0.5, colsample_bytree=0.5, seed=0)
param_grid = {'max_depth': [3, 4, 5], 'learning_rate': [0.1, 0.3], 'n_estimators': [25, 50]}
model = grid_search.GridSearchCV(estimator=XGB_model, param_grid=param_grid, scoring='accuracy', verbose=10, n_jobs=1, iid=True, refit=True, cv=3)

model.fit(X, y)
print("Best score: %0.3f" % model.best_score_)
print("Best parameters set:")
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Please note that running this step can take a significant amount of time. Running the algorithm with 25 trees takes around 2.5 minutes for each cross validation on my Macbook Pro. Running the script above with all the options specified will likely take well over an hour.

Making the Predictions

Now that we have trained a model based on the best parameters, the next step is to use the model to make predictions for the records in the testing dataset. Again we need to extract the testing data out of the combined dataset we created for the cleaning and transformation steps, and again we need to separate the main components for the model. After these steps, we use the model created in the previous step to make the predictions.

# Prepare test data for prediction
df_test.set_index('id', inplace=True)
df_test = pd.merge(df_test[['date_first_booking']], df_all, how='left', left_index=True, right_index=True, sort=False)
X_test = df_test.drop('date_first_booking', axis=1, inplace=False)
X_test = X_test.fillna(-1)
id_test = df_test.index.values

# Make predictions
y_pred = model.predict_proba(X_test)

As you may have noted from the code above, we have used the predict_proba method instead of the usual predict method. This is done because of the way Kaggle will assess the results for this particular competition. Rather than just assessing one prediction for each user, Kaggle will assess up to 5 predictions for each user. In order to maximize the score, we will use the predicted probabilities that predict_proba produces to select the 5 best predictions. Finally, we will write these results to a file that will be created in the same folder as the script.

#Taking the 5 classes with highest probabilities
ids = [] #list of ids
cts = [] #list of countries
for i in range(len(id_test)):
    idx = id_test[i]
    ids += [idx] * 5
    cts += le.inverse_transform(np.argsort(y_pred[i])[::-1])[:5].tolist()

#Generate submission
print("Outputting final results...")
sub = pd.DataFrame(np.column_stack((ids, cts)), columns=['id', 'country'])
sub.to_csv('./submission.csv', index=False)

For those that wish to, you should be able to submit the file produced from this script on Kaggle. The competition is now finished and you will not receive an official position on the leaderboard, but your results will be processed and you will be told where you would have finished.

Wrapping Up

Those that are more experienced with data science may realize this series, as lengthy as it is, does not even scratch the surface of a lot of topics related to data science. Unsupervised learning, association rules mining, text analytics and deep learning are all topics that have not been covered at all. Unfortunately, the full scope of data science and machine learning are not something that can be covered in a blog. That said, I did have two goals for those reading these blog articles.

Firstly, I hope that this series demystifies some aspects of data science for those that currently see it as a black box. Although one can spend their career working in data science and still not master all aspects, even a cursory understanding of how machine learning algorithms work can help provide understanding as to what sort of questions machine learning can help to answer, and what sort of questions are problematic.

Secondly, I hope this series encourages some of you to dig deeper, to learn more about this topic. Machine learning is a rapidly growing field that is expanding to every aspect of life. This includes, recommendation engines on websites, astronomy – where it helps to identify stars and planets, the pharmaceutical industry – where it is being used to predict which molecular structures that are likely to produce useful drugs, and maybe most famously, in training self‑driving cars to drive in the real world. Whatever your primary interest, there is likely to be some machine learning applications being developed or being used already.

[1] There are a range of metrics that can be used to do this. For available metrics in the Scikit Learn package, see here.

Full script:

import pandas as pd
import numpy as np
import xgboost as xgb

from sklearn import cross_validation, decomposition, grid_search
from sklearn.preprocessing import LabelEncoder

####################################################
# Functions #
####################################################
# Remove outliers
def remove_outliers(df, column, min_val, max_val):
    col_values = df[column].values
    df[column] = np.where(np.logical_or(col_values<=min_val, col_values>=max_val), np.NaN, col_values)

    return df

# Home made One Hot Encoder
def convert_to_binary(df, column_to_convert):
    categories = list(df[column_to_convert].drop_duplicates())

    for category in categories:
        cat_name = str(category).replace(" ", "_").replace("(", "").replace(")", "").replace("/", "_").replace("-", "").lower()
        col_name = column_to_convert[:5] + '_' + cat_name[:10]
        df[col_name] = 0
        df.loc[(df[column_to_convert] == category), col_name] = 1

    return df

# Count occurrences of value in a column
def convert_to_counts(df, id_col, column_to_convert):
    id_list = df[id_col].drop_duplicates()

    df_counts = df[[id_col, column_to_convert]]
    df_counts['count'] = 1
    df_counts = df_counts.groupby(by=[id_col, column_to_convert], as_index=False, sort=False).sum()

    new_df = df_counts.pivot(index=id_col, columns=column_to_convert, values='count')
    new_df = new_df.fillna(0)

    # Rename Columns
    categories = list(df[column_to_convert].drop_duplicates())
    for category in categories:
        cat_name = str(category).replace(" ", "_").replace("(", "").replace(")", "").replace("/", "_").replace("-", "").lower()
        col_name = column_to_convert + '_' + cat_name
        new_df.rename(columns = {category:col_name}, inplace=True)

    return new_df

####################################################
# Cleaning #
####################################################
# Import data
print("Reading in data...")
tr_filepath = "./train_users_2.csv"
df_train = pd.read_csv(tr_filepath, header=0, index_col=None)
te_filepath = "./test_users.csv"
df_test = pd.read_csv(te_filepath, header=0, index_col=None)

# Combine into one dataset
df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True)

# Change Dates to consistent format
print("Fixing timestamps...")
df_all['date_account_created'] = pd.to_datetime(df_all['date_account_created'], format='%Y-%m-%d')
df_all['timestamp_first_active'] = pd.to_datetime(df_all['timestamp_first_active'], format='%Y%m%d%H%M%S')
df_all['date_account_created'].fillna(df_all.timestamp_first_active, inplace=True)

# Remove date_first_booking column
df_all.drop('date_first_booking', axis=1, inplace=True)

# Fixing age column
print("Fixing age column...")
df_all = remove_outliers(df=df_all, column='age', min_val=15, max_val=90)
df_all['age'].fillna(-1, inplace=True)

# Fill first_affiliate_tracked column
print("Filling first_affiliate_tracked column...")
df_all['first_affiliate_tracked'].fillna(-1, inplace=True)

####################################################
# Data Transformation #
####################################################
# One Hot Encoding
print("One Hot Encoding categorical data...")
columns_to_convert = ['gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked', 'signup_app', 'first_device_type', 'first_browser']

for column in columns_to_convert:
    df_all = convert_to_binary(df=df_all, column_to_convert=column)
    df_all.drop(column, axis=1, inplace=True)

####################################################
# Feature Extraction #
####################################################
# Add new date related fields
print("Adding new fields...")
df_all['day_account_created'] = df_all['date_account_created'].dt.weekday
df_all['month_account_created'] = df_all['date_account_created'].dt.month
df_all['quarter_account_created'] = df_all['date_account_created'].dt.quarter
df_all['year_account_created'] = df_all['date_account_created'].dt.year
df_all['hour_first_active'] = df_all['timestamp_first_active'].dt.hour
df_all['day_first_active'] = df_all['timestamp_first_active'].dt.weekday
df_all['month_first_active'] = df_all['timestamp_first_active'].dt.month
df_all['quarter_first_active'] = df_all['timestamp_first_active'].dt.quarter
df_all['year_first_active'] = df_all['timestamp_first_active'].dt.year
df_all['created_less_active'] = (df_all['date_account_created'] - df_all['timestamp_first_active']).dt.days

# Drop unnecessary columns
columns_to_drop = ['date_account_created', 'timestamp_first_active', 'date_first_booking', 'country_destination']
for column in columns_to_drop:
    if column in df_all.columns:
        df_all.drop(column, axis=1, inplace=True)

####################################################
# Add data from sessions.csv #
####################################################
# Import sessions data
s_filepath = "./sessions.csv"
sessions = pd.read_csv(s_filepath, header=0, index_col=False)

# Determine primary device
print("Determining primary device...")
sessions_device = sessions[['user_id', 'device_type', 'secs_elapsed']]
aggregated_lvl1 = sessions_device.groupby(['user_id', 'device_type'], as_index=False, sort=False).aggregate(np.sum)
idx = aggregated_lvl1.groupby(['user_id'], sort=False)['secs_elapsed'].transform(max) == aggregated_lvl1['secs_elapsed']
df_primary = aggregated_lvl1.loc[idx, ['user_id', 'device_type', 'secs_elapsed']].copy()
df_primary.rename(columns = {'device_type':'primary_device', 'secs_elapsed':'primary_secs'}, inplace=True)
df_primary = convert_to_binary(df=df_primary, column_to_convert='primary_device')
df_primary.drop('primary_device', axis=1, inplace=True)

# Determine Secondary device
print("Determining secondary device...")
remaining = aggregated_lvl1.drop(aggregated_lvl1.index[idx])
idx = remaining.groupby(['user_id'], sort=False)['secs_elapsed'].transform(max) == remaining['secs_elapsed']
df_secondary = remaining.loc[idx, ['user_id', 'device_type', 'secs_elapsed']].copy()
df_secondary.rename(columns = {'device_type':'secondary_device', 'secs_elapsed':'secondary_secs'}, inplace=True)
df_secondary = convert_to_binary(df=df_secondary, column_to_convert='secondary_device')
df_secondary.drop('secondary_device', axis=1, inplace=True)

# Aggregate and combine actions taken columns
print("Aggregating actions taken...")
session_actions = sessions[['user_id', 'action', 'action_type', 'action_detail']]
columns_to_convert = ['action', 'action_type', 'action_detail']
session_actions = session_actions.fillna('not provided')
first = True

for column in columns_to_convert:
    print("Converting " + column + " column...")
    current_data = convert_to_counts(df=session_actions, id_col='user_id', column_to_convert=column)

    # If first loop, current data becomes existing data, otherwise merge existing and current
    if first:
        first = False
        actions_data = current_data
    else:
        actions_data = pd.concat([actions_data, current_data], axis=1, join='inner')

# Merge device datasets
print("Combining results...")
df_primary.set_index('user_id', inplace=True)
df_secondary.set_index('user_id', inplace=True)
device_data = pd.concat([df_primary, df_secondary], axis=1, join="outer")

# Merge device and actions datasets
combined_results = pd.concat([device_data, actions_data], axis=1, join='outer')
df_sessions = combined_results.fillna(0)

# Merge user and session datasets
df_all.set_index('id', inplace=True)
df_all = pd.concat([df_all, df_sessions], axis=1, join='inner')

####################################################
# Building Model #
####################################################
# Prepare training data for modelling
df_train.set_index('id', inplace=True)
df_train = pd.concat([df_train['country_destination'], df_all], axis=1, join='inner')

id_train = df_train.index.values
labels = df_train['country_destination']
le = LabelEncoder()
y = le.fit_transform(labels)
X = df_train.drop('country_destination', axis=1, inplace=False)

# Training model
print("Training model...")

# Grid Search - Used to find best combination of parameters
XGB_model = xgb.XGBClassifier(objective='multi:softprob', subsample=0.5, colsample_bytree=0.5, seed=0)
param_grid = {'max_depth': [3, 4], 'learning_rate': [0.1, 0.3], 'n_estimators': [25, 50]}
model = grid_search.GridSearchCV(estimator=XGB_model, param_grid=param_grid, scoring='accuracy', verbose=10, n_jobs=1, iid=True, refit=True, cv=3)

model.fit(X, y)
print("Best score: %0.3f" % model.best_score_)
print("Best parameters set:")
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

####################################################
# Make predictions #
####################################################
print("Making predictions...")

# Prepare test data for prediction
df_test.set_index('id', inplace=True)
df_test = pd.merge(df_test[['date_first_booking']], df_all, how='left', left_index=True, right_index=True, sort=False)
X_test = df_test.drop('date_first_booking', axis=1, inplace=False)
X_test = X_test.fillna(-1)
id_test = df_test.index.values

# Make predictions
y_pred = model.predict_proba(X_test)

#Taking the 5 classes with highest probabilities
ids = [] #list of ids
cts = [] #list of countries
for i in range(len(id_test)):
    idx = id_test[i]
    ids += [idx] * 5
    cts += le.inverse_transform(np.argsort(y_pred[i])[::-1])[:5].tolist()

#Generate submission
print("Outputting final results...")
sub = pd.DataFrame(np.column_stack((ids, cts)), columns=['id', 'country'])
sub.to_csv('./submission.csv',index=False)

Data Science: A Kaggle Walkthrough – Adding New Data

This article is Part V in a series looking at data science and machine learning by walking through a Kaggle competition. If you have not done so already, you are strongly encouraged to go back and read the earlier parts – (Part I, Part II, Part III and Part IV).

Continuing on the walkthrough, in this part we take the data from sessions.csv that we left aside initially and add it to the transformed and expanded data from Part IV.  This part will cover, in brief, all the steps in Parts II – IV.

Understanding the Data

As we did for the user data in training.csv, the first step here is to understand what the data in sessions.csv looks like. Although this file, with over 10 million rows, is too large to display in entirety in Excel[1], we can still open the file using Excel to get an understanding of what columns we have and what at least the first million rows of data looks like. Some sample rows are provided below:

user_idactionaction_typeaction_detaildevice_typesecs_elapsed
4rvqpxoh3hcampaigns-unknown--unknown-iPhone375
4rvqpxoh3hactive-unknown--unknown-iPhone728
4rvqpxoh3hcreate-unknown--unknown-iPhone
4rvqpxoh3hnotifications-unknown--unknown-iPhone187
4rvqpxoh3hlistings-unknown--unknown-iPhone154
4rvqpxoh3hunavailabilities-unknown--unknown-iPhone204
4rvqpxoh3hindex-unknown--unknown-iPhone21
4rvqpxoh3hindex-unknown--unknown-iPhone886
c8mfesvkv0confirm_emailclickconfirm_email_linkiPad Tablet1371616
c8mfesvkv0header_userpicdataheader_userpiciPad Tablet8672
c8mfesvkv0createsubmitcreate_useriPad Tablet
xwxei6hdk4dashboardviewdashboardiPhone1355
xwxei6hdk4header_userpicdataheader_userpiciPhone1246
xwxei6hdk4message_postmessage_postiPad Tablet
xwxei6hdk4ask_questionsubmitcontact_hostiPad Tablet386
xwxei6hdk4ask_questionsubmitcontact_hostiPad Tablet424
xwxei6hdk4message_postmessage_postiPad Tablet0
xwxei6hdk4confirm_emailclickconfirm_email_linkiPhone46262

As can be seen, the dataset contains records of user actions, with each row representing one action a user took. Every time a user reviewed search results, updated a wish list or updated their account information, a new row was created in this dataset. Although this data is likely to be very useful for our goal of predicting which country a user will make their first booking in, it also complicates the process of combining this data with the data from training.csv, as it will have to be aggregated so that there is one row per user (as opposed to many rows for each user, currently).

Aside from details of the actions taken, there are a couple of interesting fields in this data. The first is device_type – this field contains the type of device used for the specified action. The second interesting field is the secs_elapsed field. This shows us how long (in seconds) was spent on a particular action.

Both of these fields provide us with potentially important information that could help to more accurately predict which country a user will make a first booking in. For example, it is not difficult to imagine that people spending relatively little time to make a booking on a phone are likely to be making bookings in locations closer to home (i.e. the US) than someone spending more time to make a booking on a desktop computer. Of course this is just a theory that needs to be proven, but it is a good reason to ensure we are capturing this information in our final training dataset.

Cleaning and Transforming the Data

Now that we have a basic understanding of the data, we need to undertake the cleaning and transformation steps. Because of the structure of this data (and for the sake of brevity), we are going to do both of these things at the same time.

The first step is to import the data:

# Import sessions data
s_filepath = "./sessions.csv"
sessions = pd.read_csv(s_filepath, header=0, index_col=False)

Extract the primary and secondary devices for each user

Remembering that we need to get the final data into a format that can be merged with the data created in Part IV (i.e. a dataset where one row equals one user), the first piece of information we are going to extract is the primary and secondary device for each user. How do we determine what a user’s primary and secondary devices are? We look at how much time they spent on each device. In short we are going to make the following changes to the data:

devices

One thing to note as we make these transformations is that by aggregating the data this way, we are also implicitly removing the missing values. The code to do this transformation is shown below:

# Determine primary device
print("Determining primary device...")
sessions_device = sessions[['user_id', 'device_type', 'secs_elapsed']]
aggregated_lvl1 = sessions_device.groupby(['user_id', 'device_type'], as_index=False, sort=False).sum()
idx = aggregated_lvl1.groupby(['user_id'], sort=False)['secs_elapsed'].transform(max) == aggregated_lvl1['secs_elapsed']
df_primary = aggregated_lvl1.loc[idx, ['user_id', 'device_type', 'secs_elapsed']].copy()
df_primary.rename(columns = {'device_type':'primary_device', 'secs_elapsed':'primary_secs'}, inplace=True)
df_primary = convert_to_binary(df=df_primary, column_to_convert='primary_device')
df_primary.drop('primary_device', axis=1, inplace=True)

# Determine Secondary device
print("Determining secondary device...")
remaining = aggregated_lvl1.drop(aggregated_lvl1.index[idx])
idx = remaining.groupby(['user_id'], sort=False)['secs_elapsed'].transform(max) == remaining['secs_elapsed']
df_secondary = remaining.loc[idx, ['user_id', 'device_type', 'secs_elapsed']].copy()
df_secondary.rename(columns = {'device_type':'secondary_device', 'secs_elapsed':'secondary_secs'}, inplace=True)
df_secondary = convert_to_binary(df=df_secondary, column_to_convert='secondary_device')
df_secondary.drop('secondary_device', axis=1, inplace=True)

Determine Counts of Actions

The next thing we are going to do is take counts of how many times each action was taken by each user. This is a two-step process. The first step is to determine the count of each action type for each user:

Step 1

actions_step_1

Step 2

actions_step_2

For you Excel buffs out there, the second step might strike you as something that could be achieved using a pivot table – and you would be right. In fact, the custom function that we use to make this transformation uses a pandas method called ‘pivot’. This is important to note for a couple of reasons. The first is that, with all the talk about new data, people who have worked with data mostly (or entirely) using ‘old technology’ like Excel and SQL are often given the impression that their skills are redundant or not useful in modern data science. As this example shows, the ways of thinking about data that you develop working with Excel and SQL are not only relevant, but often extremely useful.

The second reason is that for people (like me) who do not know all the methods available for pandas dataframes off by heart, being able to identify techniques you have used in other programs and languages provides you with a way to find corresponding methods in new languages. I discovered this method by searching for “pandas pivot”, knowing that this way of manipulating data was likely to have some equivalent in pandas.

Looping Through the Actions Columns

Looking at the examples above, you may have realized that the transformation as shown only works for one action column at a time, but in the data we have three action columns: action, action_type and action_detail.

To handle the multiple action columns, we repeat these steps for each column individually, effectively creating three separate tables. Because we have now created tables where each row represents one user, we can now join (another concept SQL users will be very familiar with) these three tables together on the basis of the user id. The full code for these steps is shown below:

# Count occurrences of value in a column
def convert_to_counts(df, id_col, column_to_convert):
    id_list = df[id_col].drop_duplicates()

    df_counts = df[[id_col, column_to_convert]]
    df_counts['count'] = 1
    df_counts = df_counts.groupby(by=[id_col, column_to_convert], as_index=False, sort=False).sum()

    new_df = df_counts.pivot(index=id_col, columns=column_to_convert, values='count')
    new_df = new_df.fillna(0)

    # Rename Columns
    categories = list(df[column_to_convert].drop_duplicates())
    for category in categories:
       cat_name = str(category).replace(" ", "_").replace("(", "").replace(")", "").replace("/", "_").replace("-", "").lower()
       col_name = column_to_convert + '_' + cat_name
       new_df.rename(columns = {category:col_name}, inplace=True)

    return new_df

# Aggregate and combine actions taken columns
print("Aggregating actions taken...")
session_actions = sessions[['user_id', 'action', 'action_type', 'action_detail']].copy()
columns_to_convert = ['action', 'action_type', 'action_detail']
session_actions = session_actions.fillna('not provided')
first = True

for column in columns_to_convert:
    print("Converting " + column + " column...")
    current_data = convert_to_counts(df=session_actions, id_col='user_id', column_to_convert=column)

    # If first loop, current data becomes existing data, otherwise merge existing and current
    if first:
        first = False
        actions_data = current_data
    else:
        actions_data = pd.concat([actions_data, current_data], axis=1, join='inner')

Combine Data Sets

The final steps are to combine the various datasets we have created into one large dataset. First we combine the two device dataframes (df_primary and df_secondary) to create a device dataframe. Then we combine the device dataframe with the actions dataframe to create a sessions dataframe with all the features we extracted from sessions.csv. Finally, we combine the sessions dataframe with the user data dataframe from Part IV. The code for the various combinations is shown below:

# Merge device datasets
print("Combining results...")
df_primary.set_index('user_id', inplace=True)
df_secondary.set_index('user_id', inplace=True)
device_data = pd.concat([df_primary, df_secondary], axis=1, join="outer")

# Merge device and actions datasets
combined_results = pd.concat([device_data, actions_data], axis=1, join='outer')
df_sessions = combined_results.fillna(0)

# Merge user and session datasets
df_all.set_index('id', inplace=True)
df_all = pd.concat([df_all, df_sessions], axis=1, join='inner')

A Note on Joins

For those that can read a little bit of code and are familiar with joins in SQL, you may be asking why I am using (full) outer joins for the first two combinations, but an inner join for the final step[2].

The first step requires an outer join because not all users have a secondary device. That is, some users only logged onto Airbnb using one device (or at least one type of device). Doing an outer join here ensures that our dataset includes all users regardless of this fact.

The second step could use an inner or an outer join, as both the device and actions datasets should contain all users. In this case we use an outer join just to ensure that if a user is missing from one of the datasets (for whatever reason), we will still capture them. You may also notice that after the second step we fill any missing values with 0s to ensure we do not have any NULL values that may have been generated by these outer joins.

For the third step we use an inner join for a key reason – we want our final training dataset to only include users that also have sessions data. Using an inner join here is an easy way to join the datasets and filter for the users with sessions data in one step.

Wrapping Up

In the first four parts of this series, we looked in detail at some of the various steps in the process of building a model. Although these steps should be distinct thought processes that occur for each model building process, hopefully what this article provides is an insight into how some of these steps can be combined if planned out carefully. In relatively few steps, we have taken a dataset containing 10 million rows of user actions data, cleaned it, extracted a bunch of important information, and added it to our user data, ready for training a model.

The other important thing to take away from this article is how useful ‘old school’ ways of thinking about data still are. For all the talk about unstructured data and NoSQL databases, the fact is that knowing how to work with and manipulate old fashioned columns and rows is still as important as ever. Whether it is joins and aggregation in SQL, pivot tables and VLOOKUPS in Excel, or just the general concept of relational data, not only is that knowledge relevant, but it is often extremely useful.

Next Time

In the next piece, we will finally get to the good stuff and train the algorithm to make the final predictions.

[1] Nope, still doesn’t qualify as ‘Big Data’…

[2] For those that do not understand what I mean by inner and outer joins (and are interested in knowing) – stackoverflow comes to the rescue again with this great illustrated answer.

« Older posts

© 2020 Brett Romero

Theme by Anders NorenUp ↑