Brett Romero

Data Inspired Insights

Category: Technology (page 2 of 2)

Data Science: A Kaggle Walkthrough – Data Transformation and Feature Extraction

This article on data transformation and feature extraction is Part IV in a series looking at data science and machine learning by walking through a Kaggle competition. If you have not done so already, you are strongly encouraged to go back and read Part I, Part II and Part III.

Continuing on the walkthrough, in this part we focus on getting the data we cleaned in Part III ready for use in the classification algorithm. These steps are often referred to as data transformation and feature extraction.

Data Transformation and Feature Extraction as a Concept

The main purpose of data transformation and feature extraction is to enhance the data in such a way that it increases the likelihood that the classification algorithm will be able to make meaningful predictions. Unlike the steps taken during cleaning, which are designed to address problems with the raw data (missing and erroneous values, formatting issues etc.), these steps change the values and/or structure of the data (data transformation) and add additional features (feature extraction).

As you might imagine, this is quite an open-ended process, and hence a lot of the value that data scientists provide comes in these steps. There is no textbook or walkthrough that can tell you exactly what steps you should take for a given dataset, that knowledge can come only from experience, curiosity and trial and error. However, we can take a look at some common methods to provide a sense of what is possible. Please keep in mind this is not an exhaustive list of options.

Data Transformation

Covering steps taken to modify the data, data transformation is undertaken with the intention to enhance the ability of the classification algorithm to extract information from the data. Below are a few common data transformation methods used.


A common method for manipulating numeric data, binning or bucketing is when the numerical values in a particular column are converted from a continuous series into fixed ranges. For example, instead of using the age value of all our users, we could place them into buckets such as 15-20 years old, 21-25 years old and so on.

Typically this technique is used to manage ‘noisy data’. To understand what this means, think of the movements of the stock market over time: it goes up and down on an almost daily basis. However, if you are trying to predict the overall direction of the stock market over the next 6 months, these daily movements become kind of irrelevant – what you really want your model to focus on are the movements over longer periods of time. What is more, the essentially random daily movements in stock prices may actually confuse your prediction model – causing less accurate predictions. In this example, the daily movements are the noise and what you want to extract (the longer term direction of the market) is ‘the signal’.

The same logic can be applied to any numerical field in your dataset. If you are concerned that small changes in a given value may simply be representing random ‘noise’, you may want to consider bucketing/binning to remove that noise.


Although normalization can take on a large number of meanings depending on the context, the type of normalization being referred to here is the statistical type – converting the values of a column into a ‘normalized’ range. This could be translating heights from centimeter values anywhere from 100cm to 220cm to a scale where 0 represents the average (mean) height for your dataset and -1/+1 represent one standard deviation from that average. It could be translating those heights into a range of values from 0 to 1, where 0 is the lowest value in your dataset and 1 is the maximum value. There is a number of other methods that can be used here as well.

This type of transformation is more important for certain types of algorithms than others. For some algorithms – like the one we will be using – this type of transformation is not typically necessary. But for other algorithms, the magnitude of the values in each column will impact the calculations. In these cases, it is optimal to convert (‘normalize’) the values in each column onto the same scale to ensure each column is treated the equally. For a more detailed explanation on this subject, this answer from Quora is a good place to start.

Other Mathematical Transformations

In a similar manner to normalization, there is an almost unlimited number of ways that the numerical values of a given column can be transformed such that they are more suitable for the algorithm being used.

To provide one example, arguably the most common transformation (other than normalization) is to use a logarithm function. This transformation is a commonly used method of dealing with exponential data series (i.e. a column where there a lot of low values and relatively few high values). For those wanting to understand this transformation better, the Wikipedia page on this topic has a great illustrated example.

As I am hemorrhaging readers at this point, I won’t go into detail on the various other transformations possible – the key point is to be aware that there is a large range of possibilities here depending on your needs.

One Hot Encoding

Looking at one more example, and the most relevant one for our Kaggle competition, this transformation is one used for categorical data. What this transformation does is take one column with x categories (x must be greater than 2 for this to make sense) and convert it into x columns where each column represents one category in the original column. An illustrated example is shown below:

data transformation

For those familiar with regression modeling, you may recognize this as the same process of creating dummy variables.

Again there are a few reasons for doing this type of transformation. Some algorithms are structured in such a way that they do not handle categorical data very well – particularly when the categories do not have an inherent order (this answer on Stack Overflow does a good job of explaining why). Some other types of algorithms require numerical data to function. The only way to work out whether this transformation will be beneficial is to either read through the documentation for the algorithm you are using or to test it yourself.

Feature Extraction

Often broken down into sub steps of feature construction and feature selection, here we will focus on feature construction. Below are a couple of ways additional features can be constructed and added to your dataset.

Using Hierarchical Information

It will sometimes be the case that data in your dataset represents one level of a particular hierarchy, and that extracting the other implied levels of that hierarchy will provide the model with useful information.

For example, imagine a dataset with a column containing countries. This column allows the algorithm to look for patterns (in combination with all other columns) at the country level. However, by adding a new ‘region’ column based on the country column (Europe, South Asia, North Africa etc.), you may be providing information to the algorithm that allows it look for patterns across countries.

One of the most common ways to do this is with date fields. Take the date fields in the dataset we are working with as an example. By extracting the day of the week, the month of the year or the hour of the day, we could add important information for the algorithm to use. Maybe people who create their accounts in summer months are more likely to make a booking in a warmer country. Maybe people who were first active late at night are more disorganized travelers and are therefore more likely to make a domestic first booking. Additionally, it could be any combination of these factors that makes the difference (e.g. users first active late at night, in the summer months, on a weekday are more likely to travel to Portugal). The point is not to be able to explain why a factor may be important, but to think of as many factors as possible to test, and allow the algorithm to determine what is important and not important.

Adding External Data

One of the aspects of feature extraction that often gets overlooked is how data can be enriched through the addition of new external data. Using techniques such as record linkage, existing datasets can be greatly expanded by adding new data points for a given record. This new data often provides valuable new information that the algorithm can use to make more accurate predictions.

For example, a training dataset that contains a column with countries could be enriched with demographic data about the country such as population, income per capita or land area – all factors that may allow the algorithm to draw conclusions across similar groups of countries on any of those measures.

Relating this concept to the competition we are working through, consider how much more accurately we could predict a first booking country of a user if we could link the data from their Airbnb profile to data from one of their social media profiles (Facebook, Twitter etc.) or even better, from a Tripadvisor or Expedia account.

The key point here is that it is worth investing time looking for ways to add new and useful data to your existing dataset before moving onto the modeling step. Expending your dataset in this manner will often produce far bigger improvements in prediction accuracy than the choice of algorithm or the tuning of the algorithm parameters.

The Importance of Domain Knowledge

One of the things that may have occurred to you as you read through the various ways to modify and expand a dataset is how are you supposed to know what will help or not?

This is where knowledge about the data you are using and what it represents becomes so important. This knowledge – referred to as domain knowledge – helps guide this entire process, including what was covered in Part III, cleaning the data.

Understanding how the data was collected helps to provide insight into potential errors in the data that might need to be addressed or shortcomings in the way the data was sampled (sample selection bias/errors). Understanding the relevant industry or market can also provide a range of insights including:

  • what additional information is available to expand your dataset
  • what information may help to increase prediction accuracy and what is likely to be irrelevant
  • if the model makes intuitive sense (e.g. can you predict the likelihood of a waking up with a headache based on whether someone slept with their shoes on?[1]), and
  • if the industry or market is changing in such a way that it is likely to make the model redundant in the near future.

In practical terms, where does this leave aspiring data scientists?

The first thing is to realize that, obviously, it is not possible to be a domain expert for every domain. Acknowledging this limitation is important as it forces a second realization – you will almost always need to seek out this expertise. For most of us that means involving and utilizing people who are domain experts when constructing your dataset and model. Having access to that expertise is likely to be the difference between a model that gets thrown out in 6 months and one that fundamentally improves a business and/or fulfills a customer need.

Step by Step

After all the theory, let’s put some of these techniques into practice.

Transforming Categorical Data

The first step we are going to undertake is some One Hot Encoding – replacing the categorical fields in the dataset with multiple columns representing one value from each column.

To do this, the Scikit Learn library comes with a One Hot Encoder method that we could use to do these transformations, but it is often instructive to write your own function, particularly if it is a relative simple one like this. The code snippet below creates a simple function to do the encoding for a specified column, and then uses that function in a loop to convert all the categorical columns (and then delete the original columns).

# Home made One Hot Encoding function
def convert_to_binary(df, column_to_convert):
    categories = list(df[column_to_convert].drop_duplicates())

    for category in categories:
        cat_name = str(category).replace(" ", "_").replace("(", "").replace(")", "").replace("/", "_").replace("-", "").lower()
        col_name = column_to_convert[:5] + '_' + cat_name[:10]
        df[col_name] = 0
        df.loc[(df[column_to_convert] == category), col_name] = 1

    return df

# One Hot Encoding
print("One Hot Encoding categorical data...")
columns_to_convert = ['gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked', 'signup_app', 'first_device_type', 'first_browser']

for column in columns_to_convert:
    df_all = convert_to_binary(df=df_all, column_to_convert=column)
    df_all.drop(column, axis=1, inplace=True)

Creating New Features

From Part II of this series, one of the things we observed about the training (and test) datasets is that there is not a huge number of columns to work with. This limits what new features we can add based on the existing data. However, two fields that can be used to create some new features are the two date fields – date_account_created and timestamp_first_active. We want to extract all the information we can out of these two date fields that could potentially differentiate which country someone will make their first booking in. The code for extracting a range of different data points from these two date columns (and then deleting the original date columns) is shown below:

# Add new date related fields
print("Adding new fields...")
df_all['day_account_created'] = df_all['date_account_created'].dt.weekday
df_all['month_account_created'] = df_all['date_account_created'].dt.month
df_all['quarter_account_created'] = df_all['date_account_created'].dt.quarter
df_all['year_account_created'] = df_all['date_account_created'].dt.year
df_all['hour_first_active'] = df_all['timestamp_first_active'].dt.hour
df_all['day_first_active'] = df_all['timestamp_first_active'].dt.weekday
df_all['month_first_active'] = df_all['timestamp_first_active'].dt.month
df_all['quarter_first_active'] = df_all['timestamp_first_active'].dt.quarter
df_all['year_first_active'] = df_all['timestamp_first_active'].dt.year
df_all['created_less_active'] = (df_all['date_account_created'] - df_all['timestamp_first_active']).dt.days

# Drop unnecessary columns
columns_to_drop = ['date_account_created', 'timestamp_first_active', 'date_first_booking', 'country_destination']
for column in columns_to_drop:
    if column in df_all.columns:
        df_all.drop(column, axis=1, inplace=True)

Wrapping Up

In two relatively simple steps, we have changed our training dataset from 14 columns to 163 columns. Although this seems like a lot more information, most of this expansion was caused by the One Hot Encoding, which is not adding more information, but simply expanding out the existing information. We have not added any external data, and I didn’t even really investigate what information we could have extracted from the other non-date columns.

Again, this process is open ended, so there is an almost unlimited range of possibilities that we have not even really begun to explore. As such, if you see an additional transformation or have an idea for the addition of a new feature, please feel free to let me know in a comment!

Next Time

In the next piece, we will look at the data in sessions.csv that we left aside initially and see how we can add that data to our training dataset.


[1] This is an example of the existence of a confounding factor. A model predicting whether someone will wakeup with a headache based on whether they slept with their shoes on ignores that there is a more logical explanation for the headaches – in this case that both the headaches and sleeping with shoes on are caused by a third factor – going to bed drunk.


Data Science: A Kaggle Walkthrough – Cleaning Data

This article on cleaning data is Part III in a series looking at data science and machine learning by walking through a Kaggle competition. If you have not done so already, it is recommended that you go back and read Part I and Part II.

In this part we will focus on cleaning the data provided for the Airbnb Kaggle competition.

Cleaning Data

When we talk about cleaning data, what exactly are we talking about? Generally when people talk about cleaning data, there are a few specific things they are referring to:

  1. Fixing up formats – Often when data is saved or translated from one format to another (for example in our case from CSV to Python), some data may not be translated correctly. We saw a good example of this in the last article in csv. The timestamp_first_active column contained numbers like 20090609231247 instead of timestamps in the expected format: 2009-06-09 23:12:47. A typical job when it comes to cleaning data is correcting these types of issues.
  2. Filling in missing values – As we also saw in Part II, it is quite common for some values to be missing from datasets. This typically means that a piece of information was simply not collected. There are several options for handling missing data that will be covered below.
  3. Correcting erroneous values – For some columns, there are values that can be identified as obviously incorrect. This may be a ‘gender’ column where someone has entered a number, or an ‘age’ column where someone has entered a value well over 100. These values either need to be corrected (if the correct value can be determined) or assumed to be missing.
  4. Standardizing categories – More of a subcategory of ‘correcting erroneous values’, this type of data cleansing is so common it is worth special mention. In many (all?) cases where data is collected from users directly – particularly using free text fields – spelling mistakes, language differences or other factors will result in a given answer being provided in multiple ways. For example, when collecting data on country of birth, if users are not provided with a standardized list of countries, the data will inevitably contain multiple spellings of the same country (e.g. USA, United States, U.S. and so on). One of the main cleaning tasks often involves standardizing these values to ensure that there is only one version of each value.

Options for Dealing with Missing Data

Missing data in general is one of the trickier issues that is dealt with when cleaning data. Broadly there are two solutions:

1. Deleting/Ignoring rows with missing values

The simplest solution available when faced with missing values is to not use the records with missing values when training your model. However, there are some issues to be aware of before you starting deleting masses of rows from your dataset.

The first is that this approach only makes sense if the number of rows with missing data is relatively small compared to the dataset. If you are finding that you will be deleting more than around 10% of your dataset due to rows having missing values, you may need to reconsider.

The second issue is that in order to delete the rows containing missing data, you have to be confident that the rows you are deleting do not contain information that is not contained in other rows. For example, in the current Airbnb dataset we have seen that many users have not provided their age. Can we assume that the people who chose not to provide their age are the same as the users who did? Or are they likely to represent a different type of user, perhaps an older and more privacy conscious user, and therefore a user that is likely to make different choices on which countries to visit? If the answer is the latter, we probably do not want to just delete the records.

2. Filling in the Values

The second broad option for dealing with missing data is to fill the missing values with a value. But what value to use? This depends on a range of factors, including the type of data you are trying to fill.

If the data is categorical (i.e. countries, device types, etc.), it may make sense to simply create a new category that will represent ‘unknown’. Another option may be to fill the values with the most common value for that column (the mode). However, because these are broad methods for filling the missing values, this may oversimplify your data and/or make your final model less accurate.

For numerical values (for example the age column) there are some other options. Given that in this case using the mode to fill values makes less sense, we could instead use the mean or median. We could even take an average based on some other criteria – for example filling the missing age values based on an average age for users that selected the same country_destination.

For both types of data (categorical and numerical), we can also use far more complicated methods to impute the missing values. Effectively, we can use a similar methodology that we are planning to use to predict the country_destination to predict the values in any of the other columns, based on the columns that do have data. And just like with modeling in general, there are an almost endless number of ways this can be done, which won’t be detailed here. For more information on this topic, the orange Python library provides some excellent documentation.

Step by Step

With that general overview out of the way, let’s start cleaning the Airbnb data. In relation to the datasets provided for the Airbnb Kaggle competition, we will focus our cleaning efforts on two files – train_users_2.csv and test_users.csv and leave aside sessions.csv.

Loading in the Data

The first step is to load the data from the CSV files using Python. To do this we will use the Pandas library and load the data from two files train_users_2.csv and test_users.csv. After loading, we will combine them into one dataset so that any cleaning (and later any other changes) will be done to all the data at once[1].

import pandas as pd

# Import data
print("Reading in data...")
tr_filepath = "./train_users_2.csv"
df_train = pd.read_csv(tr_filepath, header=0, index_col=None)
te_filepath = "./test_users.csv"
df_test = pd.read_csv(te_filepath, header=0, index_col=None)

# Combine into one dataset
df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True)

Clean the Timestamps

Once the data has been loaded and combined, the first cleaning step we will undertake is fixing the format of the dates – as we saw in Part II, at least one of the date columns looks like it is formatted as one long number. You may be wondering why this is necessary – after all, can’t we all see what the dates are supposed to represent when we look at the data?

The reason we need to convert the values in the date columns is that, if we want to do anything with those dates (e.g. subtract one date from another, extract the month of the year from each date etc.), it will be far easier if Python recognizes the values as dates. This will become much clearer next week when we start adding various new features to the training data based on this date information.

Luckily, fixing date formats is relatively easy. Pandas has a simple function, to_datetime, that will allow us to input a column and get the correctly formatted dates as a result. When using this function we also provide a parameter called ‘format’ that is like a regular expression for dates. In simpler terms, we are providing the function with a generalized form of the date so that it can interpret the data in the column. For example, for the date_account_created column we are telling the function to expect a four-digit year (%Y) followed by a ‘-’, then a two-digit month (%m), then ‘-’, then a two-digit day (%d) – altogether the expression would be ‘%Y-%m-%d’ (for the full list of directives that can be used, see here). For the timestamp_first_active column, the date format provided is different so we adjust our expression accordingly.

Once we have fixed the date formats, we simply replace the existing date columns with the corrected data. Finally, because the date_account_created column is sometimes empty, we replace the empty values with the value in the date_account_created column using the fillna function. The code for this step is provided below:

# Change Dates to consistent format
print("Fixing timestamps...")
df_all['date_account_created'] = pd.to_datetime(df_all['date_account_created'], format='%Y-%m-%d')
df_all['timestamp_first_active'] = pd.to_datetime(df_all['timestamp_first_active'], format='%Y%m%d%H%M%S')
df_all['date_account_created'].fillna(df_all.timestamp_first_active, inplace=True)

Remove booking date field

Those following along and/or paying attention may have noticed that in the original dataset, there are three date fields, but we have only covered two above. The remaining date field, date_first_booking, we are going to drop (remove) from the training data altogether. The reason is that this field is only populated for users who have made a booking. For the data in training_users_2.csv, all the users that have a first booking country have a value in the date_first_booking column and for those that have not made a booking (country_destination = NDF) the value is missing. However, for the data in test_users.csv, the date_first_booking column is empty for all the records.

This means that this column is not going to be useful for predicting which country a booking will be made. What is more, if we leave it in the training dataset when building the model, it will likely increase the chances that the model predicts NDF as those are the records without dates in the training dataset. The code for removing the column is provided below:

# Remove date_first_booking column
df_all.drop('date_first_booking', axis=1, inplace=True)

Clean the Age column

As identified in Part II, there are several age values that are clearly incorrect (unreasonably high or too low). In this step, we replace these incorrect values with ‘NaN’, which literally stands for Not a Number, but implies we do not know the age value. In other words we are changing the incorrect values into missing values. To do this, we create a simple function that intakes a dataframe (table), a column name, a maximum acceptable value (90) and a minimum acceptable value (15). This function will then replace the values in the specified column that are outside the acceptable range with NaN.

Again from Part II we know there were also a significant number of users who did not provide their age at all – so they also show up as NaN in the dataset. After we have converted the incorrect age values to NaN, we then change all the NaN values to -1.

The code for these steps is shown below:

import numpy as np

# Remove outliers function
def remove_outliers(df, column, min_val, max_val):
col_values = df[column].values
df[column] = np.where(np.logical_or(col_values<=min_val, col_values>=max_val), np.NaN, col_values)
return df

# Fixing age column
print("Fixing age column...")
df_all = remove_outliers(df=df_all, column='age', min_val=15, max_val=90)
df_all['age'].fillna(-1, inplace=True)

As mentioned earlier, there are several more complicated ways to fill in the missing values in the age column. We are selecting this simple method for two main reasons:

  1. Clarity – this series of articles is going to be long enough without adding the complication of a complex methodology for imputing missing ages.
  2. Questionable results – in my testing during the actual competition, I did test several more complex imputation methodologies. However, none of the methods I tested actually produced a better end result than the methodology outlined above.

Identify and fill additional columns with missing values

From more detailed analysis of the data, you may have also realized there is one more column that has missing values – the first_affiliate_tracked column. In the same way we have been filling in the missing values in other columns, we now fill in the values in this column.

# Fill first_affiliate_tracked column
print("Filling first_affiliate_tracked column...")
df_all['first_affiliate_tracked'].fillna(-1, inplace=True)

Sample Output

So what does the data look like after all these changes? Here is a sample of some rows from our cleaned dataset:

gxn3p5htnndirectdirect-1.0NDF2010-06-28 00:00:00untrackedChromeMac Desktop-unknown-enWeb0facebook2009-03-19 04:32:55
820tgsjxq7seogoogle38.0NDF2011-05-25 00:00:00untrackedChromeMac DesktopMALEenWeb0facebook2009-05-23 17:48:09
4ft3gnwmtxdirectdirect56.0US2010-09-28 00:00:00untrackedIEWindows DesktopFEMALEenWeb3basic2009-06-09 23:12:47
bjjt8pjhukdirectdirect42.0other2011-12-05 00:00:00untrackedFirefoxMac DesktopFEMALEenWeb0facebook2009-10-31 06:01:29
87mebub9p4directdirect41.0US2010-09-14 00:00:00untrackedChromeMac Desktop-unknown-enWeb0basic2009-12-08 06:11:05
osr2jwljorotherother-1.0US2010-01-01 00:00:00omgChromeMac Desktop-unknown-enWeb0basic2010-01-01 21:56:19
lsw9q7uk0jothercraigslist46.0US2010-01-02 00:00:00untrackedSafariMac DesktopFEMALEenWeb0basic2010-01-02 01:25:58
0d01nltbrsdirectdirect47.0US2010-01-03 00:00:00omgSafariMac DesktopFEMALEenWeb0basic2010-01-03 19:19:05
a1vcnhxeijothercraigslist50.0US2010-01-04 00:00:00untrackedSafariMac DesktopFEMALEenWeb0basic2010-01-04 00:42:11
6uh8zyj2gnothercraigslist46.0US2010-01-04 00:00:00omgFirefoxMac Desktop-unknown-enWeb0basic2010-01-04 02:37:58
yuuqmid2rpothercraigslist36.0US2010-01-04 00:00:00untrackedFirefoxMac DesktopFEMALEenWeb0basic2010-01-04 19:42:51
om1ss59ys8othercraigslist47.0NDF2010-01-05 00:00:00untracked-unknown-iPhoneFEMALEenWeb0basic2010-01-05 05:18:12
k6np330cm1directdirect-1.0FR2010-01-05 00:00:00-1-unknown-Other/Unknown-unknown-enWeb0basic2010-01-05 06:08:59
dy3rgx56cuothercraigslist37.0NDF2010-01-05 00:00:00linkedFirefoxMac DesktopFEMALEenWeb0basic2010-01-05 08:32:59
ju3h98ch3wothercraigslist36.0NDF2010-01-07 00:00:00untrackedMobile SafariiPhoneFEMALEenWeb0basic2010-01-07 05:58:20
v4d5rl22pxdirectdirect33.0CA2010-01-07 00:00:00untrackedChromeWindows DesktopFEMALEenWeb0basic2010-01-07 20:45:55
2dwbwkx056othercraigslist-1.0NDF2010-01-07 00:00:00-1-unknown-Other/Unknown-unknown-enWeb0basic2010-01-07 21:51:25
frhre329auothercraigslist31.0US2010-01-07 00:00:00-1-unknown-Other/Unknown-unknown-enWeb0basic2010-01-07 22:46:25
cxlg85pg1rseofacebook-1.0NDF2010-01-08 00:00:00-1-unknown-Other/Unknown-unknown-enWeb0basic2010-01-08 01:56:41
gdka1q5ktddirectdirect29.0FR2010-01-10 00:00:00untrackedChromeMac DesktopFEMALEenWeb0basic2010-01-10 01:08:17

Is that all?

Those more experienced with working with data may be thinking that we have not done all that much cleaning with this data – and you would be right. One of the nice things about Kaggle competitions is that the data provided does not require all that much cleaning as that is not what the providers of the data want participants to focus on. Many of the problems that would be found in real world data (as covered earlier) do not exist in this dataset, saving us significant time.

However, what this relatively easy cleaning process also tells us is that even when datasets are provided with the intention of needing no or minimal cleaning, there is always something that needs to be done.

Next Time

In the next piece, we will focus on transforming the data and feature extraction, allowing us to create a training dataset that will hopefully allow the model to make better predictions. To make sure you don’t miss out, use the subscription feature below.


[1] For those with more data mining experience you may realize that combining the test and training data at this stage is not best practice. The best practice would be to avoid using the test dataset in any of the data preprocessing or model tuning/validation steps to avoid over fitting. However, in the context of this competition, because we are only trying to create the model to classify one unchanging dataset, simply maximizing the accuracy of the model for that dataset is the primary concern.


Data Science: A Kaggle Walkthrough – Understanding the Data

This article on understanding the data is Part II in a series looking at data science and machine learning by walking through a Kaggle competition. Part I can be found here.

Continuing on the walkthrough of data science via a Kaggle competition entry, in this part we focus on understanding the data provided for the Airbnb Kaggle competition.

Reviewing the Data

In any process involving data, the first goal should always be understanding the data. This involves looking at the data and answering a range of questions including (but not limited to):

  1. What features (columns) does the dataset contain?
  2. How many records (rows) have been provided?
  3. What format is the data in (e.g. what format are the dates provided, are there numerical values, what do the different categorical values look like)?
  4. Are there missing values?
  5. How do the different features relate to each other?

For this competition, Airbnb have provided 6 different files. Two of these files provide background information (countries.csv and age_gender_bkts.csv), while sample_submission_NDF.csv provides an example of how the submission file containing our final predictions should be formatted. The three remaining files are the key ones:

  1. train_users_2.csv – This dataset contains data on Airbnb users, including the destination countries. Each row represents one user with the columns containing various information such the users’ ages and when they signed up. This is the primary dataset that we will use to train the model.
  2. test_users.csv – This dataset also contains data on Airbnb users, in the same format as train_users_2.csv, except without the destination country. These are the users for which we will have to make our final predictions.
  3. sessions.csv – This data is supplementary data that can be used to train the model and make the final predictions. It contains information about the actions (e.g. clicked on a listing, updated a  wish list, ran a search etc.) taken by the users in both the testing and training datasets above.

With this information in mind, an easy first step in understanding the data is reviewing the information provided by the data provider – Airbnb. For this competition, the information can be found here. The main points (aside from the descriptions of the columns) are as follows:

  • All the users in the data provided are from the USA.
  • There are 12 possible outcomes of the destination country: ‘US’, ‘FR’, ‘CA’, ‘GB’, ‘ES’, ‘IT’, ‘PT’, ‘NL’,’DE’, ‘AU’, ‘NDF’ (no destination found), and ‘other’.
  • ‘other’ means there was a booking, but in a country not included in the list, while ‘NDF’ means there was not a booking.
  • The training and test sets are split by dates. In the test set, you will predict the destination country for all the new users with first activities after 7/1/2014
  • In the sessions dataset, the data only dates back to 1/1/2014, while the training dataset dates back to 2010.

After absorbing this information, we can start looking at the actual data. For now we will focus on the train_users_2.csv file only.

Table 1 – Three rows (transposed) from train_users_2.csv

Column NameExample 1Example 2Example 3
first_device_typeWindows DesktopiPhoneWindows Desktop

Looking at the sample of three records above provides us with a few key pieces of information about this dataset. The first is that at least two columns have missing values – the age column and date_first_booking column. This tells us that before we use this data for training a model, these missing values need to be filled or the rows excluded altogether. These options will be discussed in more detail in the next part of this series.

Secondly, most of the columns provided contain categorical data (i.e. the values represent one of some fixed number of categories). In fact 11 of the 16 columns provided appear to be categorical. Most of the algorithms that are used in classification do not handle categorical data like this very well, and so when it comes to the data transformation step, we will need to find a way to change this data into a form that is more suited for classification.

Thirdly, the timestamp_first_active column looks to be a full timestamp, but in the format of a number. For example 20090609231247 looks like it should be 2009-06-09 23:12:47. This formatting will need to be corrected if we are to use the date values.

Diving Deeper

Now that we have gained a basic understanding of the data by looking at a few example records, the next step is to start looking at the structure of the data.

Country Destination Values

Arguably, the most important column in the dataset is the one the model will try to predict – country_destination. Looking at the number of records that fall into each category can help provide some insights into how the model should be constructed as well as pitfalls to avoid.

Table 2 – Users by Destination

DestinationRecords% of Total
Grand Total213,451100.0%

Looking at the breakdown of the data, one thing that immediately stands out is that almost 90% of users fall into two categories, that is, they are either yet to make a booking (NDF) or they made their first booking in the US. What’s more, breaking down these percentage splits by year reveals that the percentage of users yet to make a booking increases each year and reached over 60% in 2014.

Table 3 – Users by Destination and Year


For modeling purposes, this type of split means a couple of things. Firstly, the spread of categories has changed over time. Considering that our final predictions will be made against user data from July 2014 onwards, this change provides us with an incentive to focus on more recent data for training purposes, as it is more likely to resemble the test data.

Secondly, because the vast majority of users fall into 2 categories, there is a risk that if the model is too generalized, or in other words not sensitive enough, it will select one of those two categories for every prediction. A key step will be ensuring the training data has enough information to ensure the model will predict other categories as well.

Account Creation Dates

Let’s now move onto the date_account_created column to see how the values have changed over time.

Chart 1 – Accounts Created Over Time

Chart 1 provides excellent evidence of the explosive growth of Airbnb, averaging over 10% growth in new accounts created per month. In the year to June 2014, the number of new accounts created was 125,884 – 132% increase from the year before.

But aside from showing how quickly Airbnb has grown, this data also provides another important insight, the majority of the training data provided comes from the latest 2 years. In fact, if we limited the training data to accounts created from January 2013 onwards, we would still be including over 70% of all the data. This matters because, referring back to the notes provided by Airbnb, if we want to use the data in sessions.csv we would be limited to data from January 2014 onwards. Again looking at the numbers, this means that even though the sessions.csv data only covers 11% of the time period (6 out of 54 months), it still covers over 30% of the training data – or 76,466 users.

Age Breakdown

Looking at the breakdown by age, we can see a good example of another issue that anyone working with data (whether a Data Scientist or not) faces regularly – data quality issues. As can be seen from Chart 2, there are a significant number of users that have reported their ages as well over 100. In fact, a significant number of users reported their ages as over 1000.

Chart 2 – Reported Ages of Users

So what is going on here? Firstly, it appears that a number of users have reported their birth year instead of their age. This would help to explain why there are a lot of users with ‘ages’ between 1924 and 1953. Secondly, we also see significant numbers of users reporting their age as 105 and 110. This is harder to explain but it is likely that some users intentionally entered their age incorrectly for privacy reasons. Either way, these values would appear to be errors that will need to be addressed.

Additionally, as we saw in the example data provided above, another issue with the age column is that sometimes age has not been reported at all. In fact, if we look across all the training data provided, we can see a large number of missing values in all years.

Table 4 – Missing Ages

YearMissing ValuesTotal Records% Missing

When we clean the data, we will have to decide what to do with these missing values.

First Device Type

Finally, one last column that we will look at is the first_device_used column.

Table 5 – First Device Used

Device20102011201220132014All Years
Mac Desktop37.2%40.4%47.2%44.2%37.3%42.0%
Windows Desktop21.6%25.2%37.7%36.9%31.0%34.1%
Android Phone1.1%1.2%0.7%0.4%2.6%1.3%
Android Tablet0.4%0.4%0.3%0.5%0.9%0.6%
Desktop (Other)0.4%0.4%0.4%0.6%0.7%0.6%
SmartPhone (Other)0.0%0.1%0.1%0.0%0.0%0.0%

The interesting thing about the data in this column is how the types of devices used have changed over time. Windows users have increased significantly as a percentage of all users. iPhone users have tripled their share, while users using ‘Other/unknown’ devices have gone from the second largest group to less than 5% of users. Further, the majority of these changes occurred between 2011 and 2012, suggesting that there may have been a change in the way the classification was done.

Like with the other columns we have reviewed above, this change over time reinforces the presumption that recent data is likely to be the most useful for building our model.

Other Columns

It should be noted that although we have not covered all of them here, having some understanding of all the data provided in a dataset is important for building an accurate classification model. In some cases, this may not be possible due to the presence of a very large number of columns, or due to the fact that the data has been abstracted (that is, the data has been converted into a different form). However, in this particular case, the number of columns is relatively small and the information is easily understandable.

Next Time

Now that we have taken the first step – understanding the data – in the next piece, we will start cleaning the data to get it into a form that will help to optimize the model’s performance.


Data Science: A Kaggle Walkthrough – Introduction

I have spent a lot of time working with spreadsheets, databases, and data more generally. This work has led to me having a very particular set of skills, skills I have acquired over a very long career. Skills that make me a nightmare for people like you. If you let my daughter go now, that’ll be the end of it. I will not look for you, I will not pursue you. But if you don’t, I will look for you, I will find you, and I will kill you.

The badassery of Liam Neeson aside, although I have spent years working with data in a range of capacities, the skills and techniques required for ‘data science’ are a very specific subset that do not tend to come up in too many jobs. What is more, data science tends to involve a lot more programming than most other data related work and this can be intimidating for people who are not coming from a computer science background. The problem is, people who work with data in other contexts (e.g. economics and statistics), as well as those with industry specific experience and knowledge, can often bring different and important perspectives to data science problems. Yet, these people often feel unable to contribute because they do not understand programming or the black box models being used.

Something that has nothing to do with data science

Therefore, in a probably futile attempt to shed some light on this field, this will be the first part in a multi-part series looking at what data science involves and some of the techniques most commonly used. This series is not intended to make everyone experts on data science, rather it is intended to simply try and remove some of the fear and mystery surrounding the field. In order to be as practical as possible, this series will be structured as a walk through of the process of entering a Kaggle competition and the steps taken to arrive at the final submission.

What is Kaggle?

For those that do not know, Kaggle is a website that hosts data science problems for an online community of data science enthusiasts to solve. These problems can be anything from predicting cancer based on patient data, to sentiment analysis of movie reviews and handwriting recognition – the only thing they all have in common is that they are problems requiring the application of data science to be solved.

The problems on Kaggle come from a range of sources. Some are provided just for fun and/or educational purposes, but many are provided by companies that have genuine problems they are trying to solve. As an incentive for Kaggle users to compete, prizes are often awarded for winning these competitions, or finishing in the top x positions. Sometimes the prize is a job or products from the company, but there can also be substantial monetary prizes. Home Depot for example is currently offering $40,000 for the algorithm that returns the most relevant search results on

Despite the large prizes on offer though, many people on Kaggle compete simply for practice and the experience. The competitions involve interesting problems and there are plenty of users who submit their scripts publically, providing an excellent opportunity for learning for those just trying to break into the field. There are also active discussion forums full of people willing to provide advice and assistance to other users.

What is not spelled out on the website, but is assumed knowledge, is that to make accurate predictions, you will have to use machine learning.

Machine Learning

When it comes to machine learning, there is a lot of general misunderstanding about what this actually involves. While there are different forms of machine learning, the one that I will focus on here is known as classification, which is a form of ‘supervised learning’. Classification is the process of assigning records or instances (think rows in a dataset) to a specific category in a pre-determined set of categories. Think about a problem like predicting which passengers on the Titanic survived (i.e. there are two categories – ‘survived’ and ‘did not survive’) based on their age, class and gender[1].

Titanic Classification Problem

[table “titanic_example” not found /]

Referring specifically to ‘supervised learning’ algorithms, the way these predictions are made is by providing the algorithm with a dataset (typically the larger the better) of ‘training data’. This training data contains all the information available to make the prediction as well as the categories each record corresponds to. This data is then used to ‘train’ the algorithm to find the most accurate way to classify those records for which we do not know the category.

Training Data


Although that seems relatively straightforward, part of what makes data science such a complex field is the limitless number of ways that a predictive model can be built. There are a huge number of different algorithms that can be trained, mostly with weird sounding names like Neural Network, Random Forest and Support Vector Machine (we will look at some of these in more detail in future installments). These algorithms can also be combined to create a single model. In fact, the people/teams that end up winning Kaggle competitions often combine the predictions of a number of different algorithms.

To make things more complicated, within each algorithm, there is a range of parameters that can be adjusted to significantly alter the prediction accuracy, and these parameters will vary for each classification problem. Finding the optimal set of parameters to maximize accuracy is often an art in itself.

Finally, just feeding the training data into an algorithm and hoping for the best is typically a fast track to poor performance (if it works at all). Significant time is needed to clean the data, correct formats and add additional ‘features’ to maximize the predictive capability of the algorithm. We will go into more detail on both of these requirements in future installments.

OK, so now let’s put all this into context by looking at the competition I entered, provided by Airbnb. The aim of the competition was to predict the country that users will make their first booking in, based on some basic user profile data[2]. In this case, the categories were the different country options and an additional category for users that had not made a previous booking through Airbnb. The training data was a set of users for whom we were provided with the correct category (i.e. what country they made their first booking in). Using the training data, I was required to train the model to accurately predict the country of first booking, and then submit my predictions for a set of users for whom we did not know the outcome.


The aim of this series is to walk through the process of assessing and analyzing data, cleaning, transforming and adding new features, constructing and testing a model, and finally creating final predictions. The primary technology I will be using as I walk through this is Python, in combination with Excel/Google Sheets to analyze some of the outputs. Why Python? There are several reasons:

  1. It is free and open source.
  2. It has a great range of libraries (also free) that provide access to a large number of machine learning algorithms and other useful tools. The libraries I will primarily use are numpy, pandas and sklearn.
  3. It is very popular, meaning when I get stuck on a problem, there is usually plenty of material and documentation to be found online for help.
  4. It is very fast (primarily the reason I have chosen Python over R).

For those that are interested in following this series but do not have a programming background, do not panic – although I will show code snippets as we go – being able to read the code is not vital to understanding what is happening.

Next Time

In the next piece, we will start looking at the data in more detail and discuss how we can clean and transform it, to help optimize the model performance.


[1] This is an actual competition on Kaggle at the moment (no prizes are awarded, it is for experience only).

[2] The data has been anonymized so that users cannot be identified


Web Analytics – Looking Under the Hood

On occasion I get the sense from bloggers that talking about your traffic statistics is a bit like talking about salary – not something to be done amongst polite company. However, unlike discussing pay, which can generate bad feelings, jealousy, poor morale and a range of other negative side effects, discussing website stats should provide a great learning opportunity for everyone taking part. With that said, in the name of transparency, let me offer a peak under the hood here at

Overall Traffic

For those that have not looked at web traffic statistics, first a quick introduction. When it comes to web traffic, there are two primary measures of volume – sessions and page views. A session is a continuous period of time that one user spends on a website. One session can result in multiple page views – or just the one if the user leaves after reading one article as is often the case. Chart 1 below shows the traffic to, as measured in sessions per day.

Chart 1 – All Traffic – Daily

There are a couple of large peaks worth explaining in this chart. The first peak, on 3 November 2015, was the day I discovered just how much traffic can generate. Posting to the TrueReddit subreddit, I posted what, to that point, had been by far my most popular article – 4 Reasons Working Long Hours is Crazy. The article quickly gained over 100 upvotes and, over the course of the day, generated well over 500 sessions. To put that in perspective, the traffic generated from that one post on Reddit in one day is greater than all traffic from LinkedIn and Twitter combined… for the entire time the blog has been online.

The second big peak on 29 December 2015 was also a Reddit generated spike (in fact, all four spikes post 3 November were from Reddit). In this instance it was the posting of the Traffic Accidents Involving Cyclists visualization to two subreddits – the DataIsBeautiful subreddit and the Canberra subreddit.

Aside from these large peaks though, the data as represented in Chart 1 is a bit difficult to decipher – there is too much noise on a day-to-day basis to really see what is going on. Chart 2 shows the same data at a weekly level.

Chart 2 – All Traffic – Weekly

Looking at the weekly data the broader trend seems to show two different periods for the website. The first period, from March to around August has more consistent traffic, around 200 sessions a week, but with smaller spikes. The second period, from August onwards shows less consistent traffic, around 50 sessions a week, but with much larger spikes. But how accurate is this data? Let’s break some of the statistics down.

Breakdown by Channel

When looking at web traffic using Google Analytics, there are a couple of breakdowns worth looking at. The first is the breakdown by ‘channel’ – or how users got to your website for a given session. The four channels are:

  1. Direct – the user typed your website URL directly into the address bar
  2. Referral – the user navigated to your site from another (non-social media) website by clicking on a link
  3. Social – the user accessed your website from a social media website (Facebook, Twitter, Reddit, LinkedIn and so on)
  4. Organic Search – a user searched for something in a search engine (primarily Google) and clicked on a search result to access your site.

The breakdown of sessions by channel for is shown in Table 1 below:

Table 1 – Breakdown by Channel

Channel Grouping








Organic Search




Referral Traffic

Looking at referral traffic specifically, Google Analytics allows you to view which specific sites you are getting referral traffic from. This is shown in Table 2.

Table 2 – Top Referrers

Rank Source






















Total All Referral Traffic


Looking at the top 10 referrers to, the first thing you may notice is that these site addresses look a bit… fake. You would be right. What you are seeing above is a prime example of what is known as ‘referrer spam’. In order to generate traffic to their sites, some unscrupulous people use a hack that tricks Google Analytics into recording visitors to your site coming from a URL they want you to visit. In short, they are counting on you looking at this data, getting curious and trying to work out where all this traffic is coming from. Over time these fake hits can build up to significant levels.

There are ways to customize your analytics to exclude traffic from certain domains, and initially I was doing this. However, I quickly realized that this spam comes from an almost unlimited number of domains and trying to block them all is basically a waste of time.

Looking at the full list of sites that have ‘referred’ traffic to my site, I can actually only find a handful of genuine referrals. These are shown in Table 3.

Table 3 – Genuine Referrers

Rank Source












Total Genuine Referral Traffic


Total Referrer Spam


What does the total traffic look like if I exclude all the referrer spam? Chart 3 below shows the updated results.

Chart 3 – All Traffic Excluding Referrals

As can be seen, a lot of the traffic in the period March through August was actually coming from referrer spam. Although May still looks to have been a strong month, April, June and July now appear to be hovering around that baseline 50 sessions a month.

Search Traffic

Search traffic is generally the key channel for website owners in the long term. Unlike traffic from social media or from referrals, it is traffic that is generated on an ongoing basis without additional effort (posting, promotion and so on) on the part of the website. As you would expect though, to get to the first page of search results for any combination of key words that is searched regularly is very difficult. In fact it is so difficult, an entire industry has developed around trying to achieve this – Search Engine Optimization or SEO.

For, search traffic has been difficult to come by for the most part. Below is a chart showing all search traffic since the website started:

Chart 4 – Search Traffic – All

Keeping in mind the y-axis in this chart is on a smaller scale than the previous charts, there doesn’t seem to be much pattern to this data. August again seemed to be a strong month, as well as the weeks in late May and early June. Recent months have been flatter, but more consistent.

Going one step further, Table 4 shows the keywords that were searched by users to access

Table 4 – Top Search Terms

Rank Keyword


1 (not provided)


2 beat with a shovel the weak google spots




4 непереводимая.рф


5 “why you probably don’t need a financial advisor”






8 vitaly rules google ☆*:.。.゚゚・*ヽ(^ᴗ^)丿*・゚゚.。.:*☆ ¯\_(ツ)_/¯(•ิ_•ิ)(ಠ益ಠ)(ಥ‿ಥ)(ʘ‿ʘ)ლ(ಠ_ಠლ)( ͡° ͜ʖ ͡°)ヽ(゚д゚)ノʕ•̫͡•ʔᶘ ᵒᴥᵒᶅ(=^. .^=)oo




10 ghost spam is free from the politics, we dancing like a paralytics


Again, we see something unexpected – most of the keywords are actually URLs or nonsensical phrases (or both). As you might suspect, this is another form of spam. Other website promoters are utilizing another hack – this one tricks Google Analytics into recording a search session, with the keyword being a message or URL the promoter wants to display. Looking at the full list, the only genuine search traffic appears be the records for which keywords are not provided[1]. Chart 5 shows search traffic with the spam excluded.

Chart 5 – Search Traffic – Spam Removed

With the spam removed, we see something a little bit more positive. After essentially nothing from March through July, we see a spike in activity in August and September, before falling back to a new baseline of around 5-10 sessions per week. Although this is obviously still miniscule, it does suggest that the website is starting to show up regularly in people’s searches.

Referring back to the total sessions over time, Chart 6 shows how removing the spam search impacts our overall number of sessions chart.

Chart 6 – All Traffic Excluding Referrals and Spam Search

Social Traffic and the Reddit Effect

As was shown in Table 1, one of the two main sources of (real) traffic for the website is social media.

Social media provides a real bonus for people who are starting from zero. Most people now have large social networks they can utilize, allowing them to get their content in front of a lot of people from a very early stage. That said, there is a line and spamming your friends with content continuously is more likely to get you muted than generate additional traffic.

Publicizing content on social media can also be a frustrating experience. Competing against a never-ending flood of viral memes and mindless, auto-generated content designed specifically to generate clicks, can often feel like a lost cause. However, even though it seems like posts simply get lost amongst the tsunami of rubbish, social media is still generally a good indicator of how ‘catchy’ a given article is. Better content will almost always generate more likes/retweets/shares.

In terms of the effectiveness of each social media platform, Reddit and Facebook have proven to be the most effective for generating traffic by some margin. Table 5 shows sessions by social media source.

Table 5 – Sessions by Social Media Source

Rank Social Network


1 Reddit


2 Facebook


3 Twitter


4 LinkedIn


5 Blogger


6 Google+


7 Pocket


When looking at the above data, also keep in mind, I only started posting to Reddit at the start of November, effectively giving Facebook a 7 month head start. This means Reddit is by far the most effective tool I have found to date to get traffic to the website. However, there is a catch to posting on Reddit – the audience can be brutal.

Generally on Facebook, Twitter and LinkedIn, people who do not agree with your article will just ignore it. On Reddit, if people do not agree with you – or worse still, if they do not like your writing – they will comment and tell you. They will not be delicate. They will down vote your post (meaning they are actively trying to discourage other people from viewing it). Finally, just to be vindictive, they will down vote any comments you make as well. If you are planning to post on Reddit, make sure you read the rules of the subreddit (many explicitly ban people from promoting their own content) and try to contribute in ways that are not just self‑promotional.

Pages Visited

Finally, let’s look at one final breakdown for Table 5 shows the top 10 pages viewed on

Table 6 – 10 Most Viewed Pages

Rank Page


1 /


2 /wordpress/


3 /wordpress/4-reasons-working-long-hours-is-crazy/


4 /cyclist-accidents-act/


5 /wordpress/climbing-mount-delusion-the-path-from-beginner-to-expert/


6 /wordpress/the-dark-side-of-meritocracy/


7 /wordpress/why-australians-love-fosters-and-other-beer-related-stories/


8 /blog.html


9 /?from=


10 /wordpress/visualizations/


As mentioned earlier, 4 Reasons Working Long Hours is Crazy has been by some margin my popular article. Although Reddit gave this article a boost traffic wise, it was also by some margin the best performing article I have posted to Reddit with over 100 upvotes. The next best performing, the Traffic Accidents Involving Cyclists visualization, only managed 20 upvotes.


As I mentioned at the outset, web traffic statistics tend to be a subject that is not openly discussed all that often. As a result, I have little idea how good or bad these statistics are. Given I have made minimal effort to promote my blog, generate back links (incoming links from other websites) or get my name out there by guest blogging, I suspect that these numbers are pretty unimpressive in the wider scheme of things. Certainly I am not thinking about putting up a pay wall any time soon anyway.

As unimpressive as the numbers may be though, I hope they have provided an interesting glimpse into the world of web analytics and, for those other bloggers out there, some sort of useful comparison.


Spotted something interesting that I missed? Please leave a comment!


[1] For further information on why the keywords are often not provided, this article has a good explanation.

Traffic Accidents Involving Cyclists in the ACT

I’ve had a few days off lately and I decided to try something a bit different. Instead of writing an(other) lengthy article, I thought I would go back to my roots and actually look at some data. To that end I recently discovered a website for open data in Australia, This website has literally thousands of interesting datasets released from all levels of government, covering everything from the tax bills of Australia’s largest companies to the locations of trees in Ballarat.

One of the first datasets that caught my eye was one published by the Australian Capital Territory (ACT) Government on traffic accidents involving cyclists. For those that don’t know, Canberra (the main city in the ACT) is a very bike friendly city and is home to a large number of recreational and more serious cyclists, so seeing where the accidents were/are occurring was something I thought would be interesting.

Using a few new things I have not used before (primarily Mapbox and leaflet.js), I put (slapped?) together an interactive map that uses the data provided and also gives you a few different ways of viewing it. The full version of the map can be accessed by clicking the picture below:



See a bug? Found it particularly useful? Hate it? Leave a comment below!

Excel Tips – Array Functions

Excel’s array functions probably rival pivot tables for the title of most misunderstood and underutilized features available to Excel users. That is a shame as they are powerful tools that can be used to simply and elegantly address some of the key problems that users will regularly encounter. Although there is an almost endless list of ways in which array functions can be used, below I discuss three scenarios that I regularly encountered in which array functions can be useful.

Ensuring Formula Consistency

One of the simplest and most useful ways to utilize array functions is to ensure the consistency of formulae in large spreadsheets. There are numerous infamous examples where simple formula errors have undermined the credibility of analyses done in Excel (including a recent popular economics paper) and although this method (or any method for that matter) can’t prevent all errors, it can help to minimize the places where things can go wrong.

This method can be applied to any situation where you have a column or row in which you need to apply the same formula repeatedly (anywhere you would normally drag the formula across or down). To use an array function in this situation, select the full range of cells you need the formula to apply to and then create the formula in the same way you would normally, except replacing the single cell references with references to the range of cells instead. When the formula is ready, press Control + Shift + Enter to confirm it [1]. A simple example summing two columns is shown below:


Using this method ensures that the formulae are consistent and also that individual formula in the column or row cannot be modified. You can test how this works yourself – once you have created an array formula (or opened the example file at the bottom of this page) try to delete or modify one of the individual cell formulas. You should be presented with an error like the following:


This error prevents any modifications that users may inadvertently make, including the deletion of a row or column in the dataset.

Transposing Datasets

Occasionally the need will arise to transpose data (convert data that runs across a row into data that runs down a column, or vice versa). In my experience, these cases typically involve time series data running across a spreadsheet being transposed to run down the spreadsheet to make it easier to view.

The method here is to highlight the cells where you want the transposed results to display and enter the function “=TRANSPOSE( “, then highlight the values to be transposed. Note, that if the size of the ranges selected do not match (e.g. the number of columns to be transposed is greater than the number of rows in the range selected to display the results) the extra values will be dropped, or in the reverse case, the extra cells in the result set will be display as errors. An example is provided below:


The big advantage of doing this (as opposed to simply copying and pasting the set with the transpose option selected) is that it maintains a live connection. Any updates in the original dataset will be reflected in the transposed dataset.

SUMIF with two conditions[2]

One of the primary reasons I started learning about array functions was this exact problem – how do I do a SUMIF with two conditions? This may seem like an esoteric requirement but, as the datasets you are working with become more complex, the need to summarize the data by multiple criteria becomes increasingly useful. Looking at an example, let’s imagine we have a small dataset of campaign contributions that contains the name, gender, city and the amount contributed for 10 individuals. This data is shown below:


Now let’s imagine you want to identify the total contributions from all females in Boston. Broadly speaking, there are three ways you can do this:

  1. Manually select all the values and add them up
  2. Create a pivot table [3]
  3. Use an array formula

Often, the actual dataset is too big for method 1 (any dataset with more than 10 entries probably falls into this category), and a pivot table can be overkill when you only need to derive one value. In these cases, method 3 can be the simplest option.

To create our SUMIF with two criteria, we create a formula that actually looks like a SUMIF in some ways. Essentially we create two nested IF statements inside a SUM function. Our final formula is going to be as follows (if you are recreating this, don’t forget to use Control + Shift + Enter):


The formula works by creating an array of TRUEs and FALSEs based on the criteria specified in the IF statements (i.e. that the value in the gender column = “F” and the value in the City column = “Boston”). If both conditions are met, the formula captures the corresponding value in the specified range (the contributions column in this case); otherwise, the FALSE value (zero) is captured. Once there is an array of values and zeroes, the SUM function will sum up all the values to provide the answer.

Finally, this structure can also be used for a range of functions including:

  • MAX – replace SUM with MAX
  • MIN – replace SUM with MIN and the false value (zero) with a value larger than the maximum value in the dataset
  • COUNT – set the TRUE value to 1 instead of a range
  • AVERAGE – use the COUNT and SUM functions described above to create an average

Still confused? Please feel free to download this example file: 


[1] This combination confirms to Excel that the formula is an array formula. If you simply press enter, the formula will not work as expected. Completed array formulas will display in the formula bar surrounded by {}, but you do not manually type these characters in.

[2] Since writing this article I have become aware of two new(ish) Excel functions that have effectively replaced array functions for this purpose. SUMIFS and COUNTIFS are built in functions available from the 2007 version of Excel onwards which allow the user to sum or count records and specify multiple criteria.

[3] Pivot tables are very useful for a range of purposes, but will be covered in a separate blog entry

Excel Tips – Template to Table

There was an interesting Excel problem I encountered a while back which I thought I would share as the solution ended up being quite well received and even got me a nice Swiss dinner.


The problem arose out of a reasonably common scenario – a friend of mine had begun working in a new workplace where they had been using a MS Word template to collect the results of survey they had been undertaking. Being more of a data focused person, she managed to convince them to migrate the template to MS Excel as this would allow easier aggregation and analysis of the data. However, in convincing them to migrate, she had to design the Excel template in a manner more or less mirroring the Word template, which meant merged cells, data in various columns and rows on the spreadsheet, and generally no easy way to tabulate the data across multiple surveys.

After the surveys starting coming in from various areas and she had spent an inordinate amount of time manually copying and pasting the results into a table, she asked me if there was a better way to do this.


After a couple of false starts, I came to a point where I thought the only way to do this was going to be a macro. The issues with using a macro are:

  • if it breaks, it is likely that no one else will be able to fix it
  • it isn’t easy for others to determine what exactly the macro is doing, and
  • moving a macro across multiple spreadsheets and workbooks can get messy.

Asides from those issues, non-technical people tend to get a little squeamish around macros.

I ended up stumbling across this solution almost by pure chance as I happened to be doing something at work that resulted in me using an Excel function that I hadn’t used very often: INDIRECT. What INDIRECT does is takes any string (either typed directly into the formula, referenced from another cell, or some combination of both) and tries to read it as a cell reference.

For example – if you type “A3” into a cell, then reference that cell (“=A1” for example), the result will be that both the original cell and the cell with the reference will now read “A3”. But what if you wanted the cell referencing the original cell to actually use the value in that cell as a reference to look up a third value in cell A3? That is where INDIRECT comes in:


In this simple example we have used it to reference a cell in the same sheet, but we can also use INDIRECT to build a reference to a different sheet or workbook. And that’s where this is going. All the templates were the same – the same cell reference for each data point every time – but the name of the sheet (or workbook if they haven’t been added together yet) was always different. So what if we use an INDIRECT formula to dynamically look up the name of the sheet, which the user inputs? We can then use that value to complete the cell references for all the cells looking up the information on that sheet. Here is an example of what that would look like:


Once you have the basic formula working there are 3 steps left:

  1. Wrap the formula in an IFERROR clause ( =IFERROR(formula, “”) ) so that if the ‘Sheet Name’ column is blank, you get blank data fields instead of a ton of #REF errors.
  2. Manually ‘program’ the first row. You will have to ensure all the right cells are being looked up in the right columns.
  3. Drag the formula down for as many rows as you like.

Once this is done, users of the workbook will be able to move a completed template into this workbook, enter the sheet name in the table, and all the results should immediately populate in the table. No Macros, no mess, and a convenient and simple way to convert information from an irregularly formatted template into tabular data.

Still confused? Please feel free to download this example file: 

Newer posts

© 2018 Brett Romero

Theme by Anders NorenUp ↑