Brett Romero

Data Inspired Insights

Tag: data (page 2 of 2)

Data Science: A Kaggle Walkthrough – Cleaning Data

This article on cleaning data is Part III in a series looking at data science and machine learning by walking through a Kaggle competition. If you have not done so already, it is recommended that you go back and read Part I and Part II.

In this part we will focus on cleaning the data provided for the Airbnb Kaggle competition.

Cleaning Data

When we talk about cleaning data, what exactly are we talking about? Generally when people talk about cleaning data, there are a few specific things they are referring to:

  1. Fixing up formats – Often when data is saved or translated from one format to another (for example in our case from CSV to Python), some data may not be translated correctly. We saw a good example of this in the last article in csv. The timestamp_first_active column contained numbers like 20090609231247 instead of timestamps in the expected format: 2009-06-09 23:12:47. A typical job when it comes to cleaning data is correcting these types of issues.
  2. Filling in missing values – As we also saw in Part II, it is quite common for some values to be missing from datasets. This typically means that a piece of information was simply not collected. There are several options for handling missing data that will be covered below.
  3. Correcting erroneous values – For some columns, there are values that can be identified as obviously incorrect. This may be a ‘gender’ column where someone has entered a number, or an ‘age’ column where someone has entered a value well over 100. These values either need to be corrected (if the correct value can be determined) or assumed to be missing.
  4. Standardizing categories – More of a subcategory of ‘correcting erroneous values’, this type of data cleansing is so common it is worth special mention. In many (all?) cases where data is collected from users directly – particularly using free text fields – spelling mistakes, language differences or other factors will result in a given answer being provided in multiple ways. For example, when collecting data on country of birth, if users are not provided with a standardized list of countries, the data will inevitably contain multiple spellings of the same country (e.g. USA, United States, U.S. and so on). One of the main cleaning tasks often involves standardizing these values to ensure that there is only one version of each value.

Options for Dealing with Missing Data

Missing data in general is one of the trickier issues that is dealt with when cleaning data. Broadly there are two solutions:

1. Deleting/Ignoring rows with missing values

The simplest solution available when faced with missing values is to not use the records with missing values when training your model. However, there are some issues to be aware of before you starting deleting masses of rows from your dataset.

The first is that this approach only makes sense if the number of rows with missing data is relatively small compared to the dataset. If you are finding that you will be deleting more than around 10% of your dataset due to rows having missing values, you may need to reconsider.

The second issue is that in order to delete the rows containing missing data, you have to be confident that the rows you are deleting do not contain information that is not contained in other rows. For example, in the current Airbnb dataset we have seen that many users have not provided their age. Can we assume that the people who chose not to provide their age are the same as the users who did? Or are they likely to represent a different type of user, perhaps an older and more privacy conscious user, and therefore a user that is likely to make different choices on which countries to visit? If the answer is the latter, we probably do not want to just delete the records.

2. Filling in the Values

The second broad option for dealing with missing data is to fill the missing values with a value. But what value to use? This depends on a range of factors, including the type of data you are trying to fill.

If the data is categorical (i.e. countries, device types, etc.), it may make sense to simply create a new category that will represent ‘unknown’. Another option may be to fill the values with the most common value for that column (the mode). However, because these are broad methods for filling the missing values, this may oversimplify your data and/or make your final model less accurate.

For numerical values (for example the age column) there are some other options. Given that in this case using the mode to fill values makes less sense, we could instead use the mean or median. We could even take an average based on some other criteria – for example filling the missing age values based on an average age for users that selected the same country_destination.

For both types of data (categorical and numerical), we can also use far more complicated methods to impute the missing values. Effectively, we can use a similar methodology that we are planning to use to predict the country_destination to predict the values in any of the other columns, based on the columns that do have data. And just like with modeling in general, there are an almost endless number of ways this can be done, which won’t be detailed here. For more information on this topic, the orange Python library provides some excellent documentation.

Step by Step

With that general overview out of the way, let’s start cleaning the Airbnb data. In relation to the datasets provided for the Airbnb Kaggle competition, we will focus our cleaning efforts on two files – train_users_2.csv and test_users.csv and leave aside sessions.csv.

Loading in the Data

The first step is to load the data from the CSV files using Python. To do this we will use the Pandas library and load the data from two files train_users_2.csv and test_users.csv. After loading, we will combine them into one dataset so that any cleaning (and later any other changes) will be done to all the data at once[1].

import pandas as pd

# Import data
print("Reading in data...")
tr_filepath = "./train_users_2.csv"
df_train = pd.read_csv(tr_filepath, header=0, index_col=None)
te_filepath = "./test_users.csv"
df_test = pd.read_csv(te_filepath, header=0, index_col=None)

# Combine into one dataset
df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True)

Clean the Timestamps

Once the data has been loaded and combined, the first cleaning step we will undertake is fixing the format of the dates – as we saw in Part II, at least one of the date columns looks like it is formatted as one long number. You may be wondering why this is necessary – after all, can’t we all see what the dates are supposed to represent when we look at the data?

The reason we need to convert the values in the date columns is that, if we want to do anything with those dates (e.g. subtract one date from another, extract the month of the year from each date etc.), it will be far easier if Python recognizes the values as dates. This will become much clearer next week when we start adding various new features to the training data based on this date information.

Luckily, fixing date formats is relatively easy. Pandas has a simple function, to_datetime, that will allow us to input a column and get the correctly formatted dates as a result. When using this function we also provide a parameter called ‘format’ that is like a regular expression for dates. In simpler terms, we are providing the function with a generalized form of the date so that it can interpret the data in the column. For example, for the date_account_created column we are telling the function to expect a four-digit year (%Y) followed by a ‘-’, then a two-digit month (%m), then ‘-’, then a two-digit day (%d) – altogether the expression would be ‘%Y-%m-%d’ (for the full list of directives that can be used, see here). For the timestamp_first_active column, the date format provided is different so we adjust our expression accordingly.

Once we have fixed the date formats, we simply replace the existing date columns with the corrected data. Finally, because the date_account_created column is sometimes empty, we replace the empty values with the value in the date_account_created column using the fillna function. The code for this step is provided below:

# Change Dates to consistent format
print("Fixing timestamps...")
df_all['date_account_created'] = pd.to_datetime(df_all['date_account_created'], format='%Y-%m-%d')
df_all['timestamp_first_active'] = pd.to_datetime(df_all['timestamp_first_active'], format='%Y%m%d%H%M%S')
df_all['date_account_created'].fillna(df_all.timestamp_first_active, inplace=True)

Remove booking date field

Those following along and/or paying attention may have noticed that in the original dataset, there are three date fields, but we have only covered two above. The remaining date field, date_first_booking, we are going to drop (remove) from the training data altogether. The reason is that this field is only populated for users who have made a booking. For the data in training_users_2.csv, all the users that have a first booking country have a value in the date_first_booking column and for those that have not made a booking (country_destination = NDF) the value is missing. However, for the data in test_users.csv, the date_first_booking column is empty for all the records.

This means that this column is not going to be useful for predicting which country a booking will be made. What is more, if we leave it in the training dataset when building the model, it will likely increase the chances that the model predicts NDF as those are the records without dates in the training dataset. The code for removing the column is provided below:

# Remove date_first_booking column
df_all.drop('date_first_booking', axis=1, inplace=True)

Clean the Age column

As identified in Part II, there are several age values that are clearly incorrect (unreasonably high or too low). In this step, we replace these incorrect values with ‘NaN’, which literally stands for Not a Number, but implies we do not know the age value. In other words we are changing the incorrect values into missing values. To do this, we create a simple function that intakes a dataframe (table), a column name, a maximum acceptable value (90) and a minimum acceptable value (15). This function will then replace the values in the specified column that are outside the acceptable range with NaN.

Again from Part II we know there were also a significant number of users who did not provide their age at all – so they also show up as NaN in the dataset. After we have converted the incorrect age values to NaN, we then change all the NaN values to -1.

The code for these steps is shown below:

import numpy as np

# Remove outliers function
def remove_outliers(df, column, min_val, max_val):
    col_values = df[column].values
    df[column] = np.where(np.logical_or(col_values<=min_val, col_values>=max_val), np.NaN, col_values)
    return df

# Fixing age column
print("Fixing age column...")
df_all = remove_outliers(df=df_all, column='age', min_val=15, max_val=90)
df_all['age'].fillna(-1, inplace=True)

As mentioned earlier, there are several more complicated ways to fill in the missing values in the age column. We are selecting this simple method for two main reasons:

  1. Clarity – this series of articles is going to be long enough without adding the complication of a complex methodology for imputing missing ages.
  2. Questionable results – in my testing during the actual competition, I did test several more complex imputation methodologies. However, none of the methods I tested actually produced a better end result than the methodology outlined above.

Identify and fill additional columns with missing values

From more detailed analysis of the data, you may have also realized there is one more column that has missing values – the first_affiliate_tracked column. In the same way we have been filling in the missing values in other columns, we now fill in the values in this column.

# Fill first_affiliate_tracked column
print("Filling first_affiliate_tracked column...")
df_all['first_affiliate_tracked'].fillna(-1, inplace=True)

Sample Output

So what does the data look like after all these changes? Here is a sample of some rows from our cleaned dataset:

idaffiliate_channelaffiliate_provideragecountry_destinationdate_account_createdfirst_affiliate_trackedfirst_browserfirst_device_typegenderlanguagesignup_appsignup_flowsignup_methodtimestamp_first_active
gxn3p5htnndirectdirect-1.0NDF2010-06-28 00:00:00untrackedChromeMac Desktop-unknown-enWeb0facebook2009-03-19 04:32:55
820tgsjxq7seogoogle38.0NDF2011-05-25 00:00:00untrackedChromeMac DesktopMALEenWeb0facebook2009-05-23 17:48:09
4ft3gnwmtxdirectdirect56.0US2010-09-28 00:00:00untrackedIEWindows DesktopFEMALEenWeb3basic2009-06-09 23:12:47
bjjt8pjhukdirectdirect42.0other2011-12-05 00:00:00untrackedFirefoxMac DesktopFEMALEenWeb0facebook2009-10-31 06:01:29
87mebub9p4directdirect41.0US2010-09-14 00:00:00untrackedChromeMac Desktop-unknown-enWeb0basic2009-12-08 06:11:05
osr2jwljorotherother-1.0US2010-01-01 00:00:00omgChromeMac Desktop-unknown-enWeb0basic2010-01-01 21:56:19
lsw9q7uk0jothercraigslist46.0US2010-01-02 00:00:00untrackedSafariMac DesktopFEMALEenWeb0basic2010-01-02 01:25:58
0d01nltbrsdirectdirect47.0US2010-01-03 00:00:00omgSafariMac DesktopFEMALEenWeb0basic2010-01-03 19:19:05
a1vcnhxeijothercraigslist50.0US2010-01-04 00:00:00untrackedSafariMac DesktopFEMALEenWeb0basic2010-01-04 00:42:11
6uh8zyj2gnothercraigslist46.0US2010-01-04 00:00:00omgFirefoxMac Desktop-unknown-enWeb0basic2010-01-04 02:37:58
yuuqmid2rpothercraigslist36.0US2010-01-04 00:00:00untrackedFirefoxMac DesktopFEMALEenWeb0basic2010-01-04 19:42:51
om1ss59ys8othercraigslist47.0NDF2010-01-05 00:00:00untracked-unknown-iPhoneFEMALEenWeb0basic2010-01-05 05:18:12
k6np330cm1directdirect-1.0FR2010-01-05 00:00:00-1-unknown-Other/Unknown-unknown-enWeb0basic2010-01-05 06:08:59
dy3rgx56cuothercraigslist37.0NDF2010-01-05 00:00:00linkedFirefoxMac DesktopFEMALEenWeb0basic2010-01-05 08:32:59
ju3h98ch3wothercraigslist36.0NDF2010-01-07 00:00:00untrackedMobile SafariiPhoneFEMALEenWeb0basic2010-01-07 05:58:20
v4d5rl22pxdirectdirect33.0CA2010-01-07 00:00:00untrackedChromeWindows DesktopFEMALEenWeb0basic2010-01-07 20:45:55
2dwbwkx056othercraigslist-1.0NDF2010-01-07 00:00:00-1-unknown-Other/Unknown-unknown-enWeb0basic2010-01-07 21:51:25
frhre329auothercraigslist31.0US2010-01-07 00:00:00-1-unknown-Other/Unknown-unknown-enWeb0basic2010-01-07 22:46:25
cxlg85pg1rseofacebook-1.0NDF2010-01-08 00:00:00-1-unknown-Other/Unknown-unknown-enWeb0basic2010-01-08 01:56:41
gdka1q5ktddirectdirect29.0FR2010-01-10 00:00:00untrackedChromeMac DesktopFEMALEenWeb0basic2010-01-10 01:08:17

Is that all?

Those more experienced with working with data may be thinking that we have not done all that much cleaning with this data – and you would be right. One of the nice things about Kaggle competitions is that the data provided does not require all that much cleaning as that is not what the providers of the data want participants to focus on. Many of the problems that would be found in real world data (as covered earlier) do not exist in this dataset, saving us significant time.

However, what this relatively easy cleaning process also tells us is that even when datasets are provided with the intention of needing no or minimal cleaning, there is always something that needs to be done.

Next Time

In the next piece, we will focus on transforming the data and feature extraction, allowing us to create a training dataset that will hopefully allow the model to make better predictions. To make sure you don’t miss out, use the subscription feature below.

 

[1] For those with more data mining experience you may realize that combining the test and training data at this stage is not best practice. The best practice would be to avoid using the test dataset in any of the data preprocessing or model tuning/validation steps to avoid over fitting. However, in the context of this competition, because we are only trying to create the model to classify one unchanging dataset, simply maximizing the accuracy of the model for that dataset is the primary concern.

 

Data Science: A Kaggle Walkthrough – Understanding the Data

This article on understanding the data is Part II in a series looking at data science and machine learning by walking through a Kaggle competition. Part I can be found here.

Continuing on the walkthrough of data science via a Kaggle competition entry, in this part we focus on understanding the data provided for the Airbnb Kaggle competition.

Reviewing the Data

In any process involving data, the first goal should always be understanding the data. This involves looking at the data and answering a range of questions including (but not limited to):

  1. What features (columns) does the dataset contain?
  2. How many records (rows) have been provided?
  3. What format is the data in (e.g. what format are the dates provided, are there numerical values, what do the different categorical values look like)?
  4. Are there missing values?
  5. How do the different features relate to each other?

For this competition, Airbnb have provided 6 different files. Two of these files provide background information (countries.csv and age_gender_bkts.csv), while sample_submission_NDF.csv provides an example of how the submission file containing our final predictions should be formatted. The three remaining files are the key ones:

  1. train_users_2.csv – This dataset contains data on Airbnb users, including the destination countries. Each row represents one user with the columns containing various information such the users’ ages and when they signed up. This is the primary dataset that we will use to train the model.
  2. test_users.csv – This dataset also contains data on Airbnb users, in the same format as train_users_2.csv, except without the destination country. These are the users for which we will have to make our final predictions.
  3. sessions.csv – This data is supplementary data that can be used to train the model and make the final predictions. It contains information about the actions (e.g. clicked on a listing, updated a  wish list, ran a search etc.) taken by the users in both the testing and training datasets above.

With this information in mind, an easy first step in understanding the data is reviewing the information provided by the data provider – Airbnb. For this competition, the information can be found here. The main points (aside from the descriptions of the columns) are as follows:

  • All the users in the data provided are from the USA.
  • There are 12 possible outcomes of the destination country: ‘US’, ‘FR’, ‘CA’, ‘GB’, ‘ES’, ‘IT’, ‘PT’, ‘NL’,’DE’, ‘AU’, ‘NDF’ (no destination found), and ‘other’.
  • ‘other’ means there was a booking, but in a country not included in the list, while ‘NDF’ means there was not a booking.
  • The training and test sets are split by dates. In the test set, you will predict the destination country for all the new users with first activities after 7/1/2014
  • In the sessions dataset, the data only dates back to 1/1/2014, while the training dataset dates back to 2010.

After absorbing this information, we can start looking at the actual data. For now we will focus on the train_users_2.csv file only.

Table 1 – Three rows (transposed) from train_users_2.csv

Column NameExample 1Example 2Example 3
id4ft3gnwmtxv5lq9bj8gvmsucfwmlzc
date_account_created28/9/1030/6/1430/6/14
timestamp_first_active200906092312472014063023442920140630234729
date_first_booking2/8/1016/3/15
genderFEMALE-unknown-MALE
age5643
signup_methodbasicbasicbasic
signup_flow3250
languageenenen
affiliate_channeldirectdirectdirect
affiliate_providerdirectdirectdirect
first_affiliate_trackeduntrackeduntrackeduntracked
signup_appWebiOSWeb
first_device_typeWindows DesktopiPhoneWindows Desktop
first_browserIE-unknown-Firefox
country_destinationUSNDFUS

Looking at the sample of three records above provides us with a few key pieces of information about this dataset. The first is that at least two columns have missing values – the age column and date_first_booking column. This tells us that before we use this data for training a model, these missing values need to be filled or the rows excluded altogether. These options will be discussed in more detail in the next part of this series.

Secondly, most of the columns provided contain categorical data (i.e. the values represent one of some fixed number of categories). In fact 11 of the 16 columns provided appear to be categorical. Most of the algorithms that are used in classification do not handle categorical data like this very well, and so when it comes to the data transformation step, we will need to find a way to change this data into a form that is more suited for classification.

Thirdly, the timestamp_first_active column looks to be a full timestamp, but in the format of a number. For example 20090609231247 looks like it should be 2009-06-09 23:12:47. This formatting will need to be corrected if we are to use the date values.

Diving Deeper

Now that we have gained a basic understanding of the data by looking at a few example records, the next step is to start looking at the structure of the data.

Country Destination Values

Arguably, the most important column in the dataset is the one the model will try to predict – country_destination. Looking at the number of records that fall into each category can help provide some insights into how the model should be constructed as well as pitfalls to avoid.

Table 2 – Users by Destination

DestinationRecords% of Total
NDF124,54358.3%
US62,37629.2%
other10,0944.7%
FR5,0232.4%
IT2,8351.3%
GB2,3241.1%
ES2,2491.1%
CA1,4280.7%
DE1,0610.5%
NL7620.4%
AU5390.3%
PT2170.1%
Grand Total213,451100.0%

Looking at the breakdown of the data, one thing that immediately stands out is that almost 90% of users fall into two categories, that is, they are either yet to make a booking (NDF) or they made their first booking in the US. What’s more, breaking down these percentage splits by year reveals that the percentage of users yet to make a booking increases each year and reached over 60% in 2014.

Table 3 – Users by Destination and Year

Destination20102011201220132014Overall
NDF42.5%45.4%55.0%59.2%61.8%58.3%
US44.0%38.1%31.1%28.9%26.7%29.2%
other2.8%4.7%4.9%4.6%4.8%4.7%
FR4.3%4.0%2.8%2.2%1.9%2.4%
IT1.1%1.7%1.5%1.2%1.3%1.3%
GB1.0%1.5%1.3%1.0%1.0%1.1%
ES1.5%1.7%1.2%1.0%0.9%1.1%
CA1.5%1.1%0.7%0.6%0.6%0.7%
DE0.6%0.8%0.7%0.5%0.3%0.5%
NL0.4%0.6%0.4%0.3%0.3%0.4%
AU0.3%0.3%0.3%0.3%0.2%0.3%
PT0.0%0.2%0.1%0.1%0.1%0.1%
Total100.0%100.0%100.0%100.0%100.0%100.0%

For modeling purposes, this type of split means a couple of things. Firstly, the spread of categories has changed over time. Considering that our final predictions will be made against user data from July 2014 onwards, this change provides us with an incentive to focus on more recent data for training purposes, as it is more likely to resemble the test data.

Secondly, because the vast majority of users fall into 2 categories, there is a risk that if the model is too generalized, or in other words not sensitive enough, it will select one of those two categories for every prediction. A key step will be ensuring the training data has enough information to ensure the model will predict other categories as well.

Account Creation Dates

Let’s now move onto the date_account_created column to see how the values have changed over time.

Chart 1 – Accounts Created Over Time

Chart 1 provides excellent evidence of the explosive growth of Airbnb, averaging over 10% growth in new accounts created per month. In the year to June 2014, the number of new accounts created was 125,884 – 132% increase from the year before.

But aside from showing how quickly Airbnb has grown, this data also provides another important insight, the majority of the training data provided comes from the latest 2 years. In fact, if we limited the training data to accounts created from January 2013 onwards, we would still be including over 70% of all the data. This matters because, referring back to the notes provided by Airbnb, if we want to use the data in sessions.csv we would be limited to data from January 2014 onwards. Again looking at the numbers, this means that even though the sessions.csv data only covers 11% of the time period (6 out of 54 months), it still covers over 30% of the training data – or 76,466 users.

Age Breakdown

Looking at the breakdown by age, we can see a good example of another issue that anyone working with data (whether a Data Scientist or not) faces regularly – data quality issues. As can be seen from Chart 2, there are a significant number of users that have reported their ages as well over 100. In fact, a significant number of users reported their ages as over 1000.

Chart 2 – Reported Ages of Users

So what is going on here? Firstly, it appears that a number of users have reported their birth year instead of their age. This would help to explain why there are a lot of users with ‘ages’ between 1924 and 1953. Secondly, we also see significant numbers of users reporting their age as 105 and 110. This is harder to explain but it is likely that some users intentionally entered their age incorrectly for privacy reasons. Either way, these values would appear to be errors that will need to be addressed.

Additionally, as we saw in the example data provided above, another issue with the age column is that sometimes age has not been reported at all. In fact, if we look across all the training data provided, we can see a large number of missing values in all years.

Table 4 – Missing Ages

YearMissing ValuesTotal Records% Missing
20101,0822,78838.8%
20114,09011,77534.7%
201213,74039,46234.8%
201334,95082,96042.1%
201434,12876,46644.6%
Total87,990213,45141.2%

When we clean the data, we will have to decide what to do with these missing values.

First Device Type

Finally, one last column that we will look at is the first_device_used column.

Table 5 – First Device Used

Device20102011201220132014All Years
Mac Desktop37.2%40.4%47.2%44.2%37.3%42.0%
Windows Desktop21.6%25.2%37.7%36.9%31.0%34.1%
iPhone5.8%6.3%3.8%7.5%15.9%9.7%
iPad4.6%4.8%6.1%7.1%7.0%6.7%
Other/Unknown28.8%21.3%3.8%2.8%4.6%5.0%
Android Phone1.1%1.2%0.7%0.4%2.6%1.3%
Android Tablet0.4%0.4%0.3%0.5%0.9%0.6%
Desktop (Other)0.4%0.4%0.4%0.6%0.7%0.6%
SmartPhone (Other)0.0%0.1%0.1%0.0%0.0%0.0%
Total100.0%100.0%100.0%100.0%100.0%100.0%

The interesting thing about the data in this column is how the types of devices used have changed over time. Windows users have increased significantly as a percentage of all users. iPhone users have tripled their share, while users using ‘Other/unknown’ devices have gone from the second largest group to less than 5% of users. Further, the majority of these changes occurred between 2011 and 2012, suggesting that there may have been a change in the way the classification was done.

Like with the other columns we have reviewed above, this change over time reinforces the presumption that recent data is likely to be the most useful for building our model.

Other Columns

It should be noted that although we have not covered all of them here, having some understanding of all the data provided in a dataset is important for building an accurate classification model. In some cases, this may not be possible due to the presence of a very large number of columns, or due to the fact that the data has been abstracted (that is, the data has been converted into a different form). However, in this particular case, the number of columns is relatively small and the information is easily understandable.

Next Time

Now that we have taken the first step – understanding the data – in the next piece, we will start cleaning the data to get it into a form that will help to optimize the model’s performance.

 

Data Science: A Kaggle Walkthrough – Introduction

I have spent a lot of time working with spreadsheets, databases, and data more generally. This work has led to me having a very particular set of skills, skills I have acquired over a very long career. Skills that make me a nightmare for people like you. If you let my daughter go now, that’ll be the end of it. I will not look for you, I will not pursue you. But if you don’t, I will look for you, I will find you, and I will kill you.

The badassery of Liam Neeson aside, although I have spent years working with data in a range of capacities, the skills and techniques required for ‘data science’ are a very specific subset that do not tend to come up in too many jobs. What is more, data science tends to involve a lot more programming than most other data related work and this can be intimidating for people who are not coming from a computer science background. The problem is, people who work with data in other contexts (e.g. economics and statistics), as well as those with industry specific experience and knowledge, can often bring different and important perspectives to data science problems. Yet, these people often feel unable to contribute because they do not understand programming or the black box models being used.

Something that has nothing to do with data science

Therefore, in a probably futile attempt to shed some light on this field, this will be the first part in a multi-part series looking at what data science involves and some of the techniques most commonly used. This series is not intended to make everyone experts on data science, rather it is intended to simply try and remove some of the fear and mystery surrounding the field. In order to be as practical as possible, this series will be structured as a walk through of the process of entering a Kaggle competition and the steps taken to arrive at the final submission.

What is Kaggle?

For those that do not know, Kaggle is a website that hosts data science problems for an online community of data science enthusiasts to solve. These problems can be anything from predicting cancer based on patient data, to sentiment analysis of movie reviews and handwriting recognition – the only thing they all have in common is that they are problems requiring the application of data science to be solved.

The problems on Kaggle come from a range of sources. Some are provided just for fun and/or educational purposes, but many are provided by companies that have genuine problems they are trying to solve. As an incentive for Kaggle users to compete, prizes are often awarded for winning these competitions, or finishing in the top x positions. Sometimes the prize is a job or products from the company, but there can also be substantial monetary prizes. Home Depot for example is currently offering $40,000 for the algorithm that returns the most relevant search results on homedepot.com.

Despite the large prizes on offer though, many people on Kaggle compete simply for practice and the experience. The competitions involve interesting problems and there are plenty of users who submit their scripts publically, providing an excellent opportunity for learning for those just trying to break into the field. There are also active discussion forums full of people willing to provide advice and assistance to other users.

What is not spelled out on the website, but is assumed knowledge, is that to make accurate predictions, you will have to use machine learning.

Machine Learning

When it comes to machine learning, there is a lot of general misunderstanding about what this actually involves. While there are different forms of machine learning, the one that I will focus on here is known as classification, which is a form of ‘supervised learning’. Classification is the process of assigning records or instances (think rows in a dataset) to a specific category in a pre-determined set of categories. Think about a problem like predicting which passengers on the Titanic survived (i.e. there are two categories – ‘survived’ and ‘did not survive’) based on their age, class and gender[1].

Titanic Classification Problem

PassengerAgeClassGenderSurvived?
000132FirstFemale?
000212SecondMale?
000364SteerageMale?
000423SteerageMale?
000511SteerageMale?
000642SteerageMale?
00079SecondFemale?
00088SteerageFemale?
000919SteerageMale?
001055FirstMale?
001153FirstFemale?
001227SecondMale?

Referring specifically to ‘supervised learning’ algorithms, the way these predictions are made is by providing the algorithm with a dataset (typically the larger the better) of ‘training data’. This training data contains all the information available to make the prediction as well as the categories each record corresponds to. This data is then used to ‘train’ the algorithm to find the most accurate way to classify those records for which we do not know the category.

Training Data

PassengerAgeClassGenderSurvived?
001323SecondFemale1
001421SteerageFemale0
001546SteerageMale0
001632FirstMale0
001713FirstFemale1
001824SecondMale0
001929FirstMale1
002080SecondMale1
00219SteerageFemale0
002244SteerageMale0
002335SteerageFemale1
002410SteerageMale0

Although that seems relatively straightforward, part of what makes data science such a complex field is the limitless number of ways that a predictive model can be built. There are a huge number of different algorithms that can be trained, mostly with weird sounding names like Neural Network, Random Forest and Support Vector Machine (we will look at some of these in more detail in future installments). These algorithms can also be combined to create a single model. In fact, the people/teams that end up winning Kaggle competitions often combine the predictions of a number of different algorithms.

To make things more complicated, within each algorithm, there is a range of parameters that can be adjusted to significantly alter the prediction accuracy, and these parameters will vary for each classification problem. Finding the optimal set of parameters to maximize accuracy is often an art in itself.

Finally, just feeding the training data into an algorithm and hoping for the best is typically a fast track to poor performance (if it works at all). Significant time is needed to clean the data, correct formats and add additional ‘features’ to maximize the predictive capability of the algorithm. We will go into more detail on both of these requirements in future installments.

OK, so now let’s put all this into context by looking at the competition I entered, provided by Airbnb. The aim of the competition was to predict the country that users will make their first booking in, based on some basic user profile data[2]. In this case, the categories were the different country options and an additional category for users that had not made a previous booking through Airbnb. The training data was a set of users for whom we were provided with the correct category (i.e. what country they made their first booking in). Using the training data, I was required to train the model to accurately predict the country of first booking, and then submit my predictions for a set of users for whom we did not know the outcome.

How?

The aim of this series is to walk through the process of assessing and analyzing data, cleaning, transforming and adding new features, constructing and testing a model, and finally creating final predictions. The primary technology I will be using as I walk through this is Python, in combination with Excel/Google Sheets to analyze some of the outputs. Why Python? There are several reasons:

  1. It is free and open source.
  2. It has a great range of libraries (also free) that provide access to a large number of machine learning algorithms and other useful tools. The libraries I will primarily use are numpy, pandas and sklearn.
  3. It is very popular, meaning when I get stuck on a problem, there is usually plenty of material and documentation to be found online for help.
  4. It is very fast (primarily the reason I have chosen Python over R).

For those that are interested in following this series but do not have a programming background, do not panic – although I will show code snippets as we go – being able to read the code is not vital to understanding what is happening.

Next Time

In the next piece, we will start looking at the data in more detail and discuss how we can clean and transform it, to help optimize the model performance.

 

[1] This is an actual competition on Kaggle at the moment (no prizes are awarded, it is for experience only).

[2] The data has been anonymized so that users cannot be identified

 

The argument for taxing capital gains at the full rate

Politicians, both in Australia and the US, when asked how they will find the money to fund various policy proposals, often resort to the magic pudding of funding sources that is “closing the loop holes in the tax code”. After all, who can argue with stopping tax dodgers rorting the system? But as Megan McArdle recently pointed out, raising any significant revenue from closing loop holes requires denying deductions for things that a lot of middle and lower class people also benefit from. This includes, among other things, deductions for mortgage interest, employee sponsored health insurance, lower (or no) tax on money set aside for pensions and no tax on capital gains when the family house is sold.[1]

Broadly, I agree with McArdle’s point. The public, in general, are far too easily convinced by simplistic arguments about changes to taxation – as if after decades of tax policy changes there are still simple ways to increase revenues without anyone suffering. Any changes made at this point are going to cause winners and losers, and often, the people intended to be the losers (usually the rich) are less affected than some other group that also happened to be taking advantage of a particular deduction.

That said, there is one point, addressed breifly in McArdle’s article, that I thought deserved greater attention – the concessional taxation of capital gains. In the list provided in the article, it was the second most expensive tax deduction in the US at $85 billion a year[2]. You see, for a while now I have been somewhat of a closet skeptic of the need for lower tax rates on capital income (i.e. capital gains and dividends). The reason for my skepticism is two fold:

  1. Everyone seems to be in agreement that concessional rates for capital income are absolutely necessary, but no one seems to really understand why.
  2. Capital income makes up a much larger percentage of income for the wealthy than for the lower or middle class. When you hear that story about billionaire Warren Buffet paying a lower rate of tax than his secretary, it is because of the low rate of tax on capital income.

So, now that I am finally voicing my skepticism, this article is going to look at what arguments are made for lower tax rates on capital income (focusing on capital gains for individuals) and whether those arguments hold water.

Why are capital gains taxed at a lower rate?

Once you start digging, you quickly find there is a range of arguments (of variable quality) being made for why capital gains should be taxed at a lower rate. These arguments can largely be grouped into the following broad categories:

  1. Inflation
  2. Lock-In
  3. Double Taxation
  4. Capital is Mobile
  5. The Consumption – Savings tradeoff

Inflation

Taxing capital gains implies taxing the asset holder for any increases in the price of that asset. In an economy where inflation exists (i.e. every economy) this means you are taxing increases in the price of the asset due to inflation, as well as any increase in the value of the asset itself. Essentially, even if you had an asset which had only increased in value at the exact same rate as inflation (i.e. the asset was tradable for the same amount of goods as when you bought it), you would still have to pay capital gains tax.

The inflation argument although legitimate, is relatively easy to legislate around by allowing asset holders to adjust up the cost base of their assets by the inflation rate each year.

Lock In

‘Lock-in’ is the idea that investors, to avoid paying capital gains tax, will stop selling their assets. An investor holding onto assets to avoid tax implies they are being incentivized, through the tax system, to invest suboptimally – something economists really dislike. However, as far as ‘lock-in’ would occur, it cannot be considered anything other than an irrational reaction. Holding onto assets does not avoid tax, it only delays it, and given inflation is factored into the asset price (as discussed above), there is not even the benefit of time reducing the tax burden. The bottom line is this – to pay more capital gains tax, there must be larger capital gains. That is, even if the capital gains tax rate was 99%, an investor would still be better off making larger capital gains than smaller ones.

The other point to remember when it comes to ‘lock-in’ is that in both the US and Australia, the lower rate of capital gains tax only applies to assets held for more than a year. That means if ‘lock-in’ exists, it is already a major problem. Because asset holders can access a lower rate of tax by holding an asset for a year, they are already strongly incentivized to hold onto their underperforming assets longer than is optimal to access the concessional tax rate. In fact, increasing the long-term capital gains tax rate to the same level as the short-term rate should actually reduce lock-in by removing this incentive.

Double Taxation

The double taxation argument is a genuine concern for economists. The double tax situation arises because companies already pay tax on their profits. Taxing those profits in the hands of investors again, either as capital gains (on that company’s stock) or dividends, implies some high marginal tax rates on investment. This is one of the main reasons capital income is taxed at low rates in most countries.

Ideally, to avoid this situation, the tax code would be simplified by removing company tax altogether, as McArdle herself has argued in the past. However, we should probably both accept that, at best, the removal of corporate tax is a long way away. Nevertheless, this idea can form the basis for policies that achieve similar goals without the political issue of trying to sell the removal of corporate tax.

For dividends, for example, double taxation can be avoided by providing companies with a deduction for the value of dividends paid out to investors. Investors would then pay their full marginal tax rate on the dividends, more than replacing the lost company tax revenues.

Preventing double taxation of capital gains is a little more complicated, but the answer may lie in setting up a quarantined investment pool that companies can move profits into. Profits moved into this pool would not be subject to tax and, once in the pool, the money could only be used for certain legitimate investment activities. This would effectively remove taxation on profits going toward genuine reinvestment, as opposed to fattening bonus checks.

The overall point here is not that I have the perfect policy to avoid double taxation of company profits, but that there are other worthwhile avenues worth exploring that are not simply giving huge tax breaks to wealthy investors.

Capital is Mobile

This is one of the two arguments McArdle briefly mentions in her article. The ‘capital is mobile argument’ is the argument that if we tax wealthy investors too much, they will do a John Galt, take their money and go to another country that won’t be so “mean” to them.

When it comes to moving money offshore, obviously, not everyone is in a position to make the move. Pension funds and some investment vehicles cannot simply move country. Companies and some other investment vehicles do not receive a capital gains tax discount currently, meaning raising tax rates for capital gains for individuals would not impact them at all. Finally, even for investors that would be affected and do have the means, a hike in the capital gains rate does not automatically move all their investments below the required rate of return.

This argument also overlooks the vast array of complications in moving money offshore and the risks involved with that action. Moving assets offshore exposes investors to new risks such as exchange rate risk[3] and sovereign risk[4]. It also significantly complicates the administrative, compliance and legal burden the investor has to manage.

However, even if we concede that yes, some money would move offshore as a result of higher taxes on capital gains, let’s look at the long term picture. What is the logical end point for a world where each country employs a policy of attracting wealthy investors by lowering taxes on capital? A world where no country taxes capital!

Of course, there are alternatives. Countries (and developed countries in particular should take the lead on this) can stop chasing the money through tax policy and focus on other ways of competing for investment capital. Education, productivity, infrastructure, network effects, low administrative and compliance costs are all important factors in the assessment of how attractive a location is for investors. California, for example, is not the home of Silicon Valley because it has low taxes on capital. Pulling the ‘lower taxes to attract investment’ lever is essentially the lazy option.

Consumption vs. Savings

The second point raised by McArdle is the argument that if you reduce the returns from investing (by raising tax rates), people will substitute away from saving and investing (future consumption) and instead spend the money now (immediate consumption).

The way to think of this is not of someone cashing in all their assets and going on a spending spree because the capital gains tax rate increased. That is extremely unlikely to happen and would actually make no sense. The change will come on the margin – because the returns on investment have decreased slightly (for certain asset types), there will be slightly less incentive to save and invest. As a result, over time, less money ends up being invested and is instead consumed.

But let’s consider who would be affected. If we think about the vast majority of people, their only exposure to capital gains is through their pension fund and the property they live in, neither of which would be affected by increasing the individual capital gains tax rate. Day traders, high frequency traders and anyone holding stocks for less than a year on average would also be unaffected. Most investors in start-ups do so through investment vehicles that are, again, not subject to individual capital gains tax[5]. That leaves two main groups of investors impacted by an increase in the capital gains tax rate for individuals:

  1. Property investors
  2. High net worth individual investors

Given property investing is not what most people are thinking about when concerns about capital gains tax rates reducing investment are raised, let’s focus on high wealth investors.

The key issue when considering how these investors would be affected by an increase in the capital gains tax rate is identifying what drives them to invest in the first place. Many of them literally have more money than they could ever spend, which means their investment decisions cannot be driven by a desire for future consumption. Many of their kids will never want for anything either, so even ensuring the financial security of their kids is not an issue. The only real motivation that can be left is simply status, power and prestige. Or as the tech industry has helpfully rebadged it – ‘making the world a better place.’

If that is the motivation though, does a rise in the capital gains tax rate change that motivation?

To my mind, the answer to that question is ‘No’. These people are already consuming everything they want, or in economic parlance, their desire for goods and services has been satiated. They will gain no additional pleasure (‘utility’) from diverting savings to consumption, so there is no incentive to do so even when the gains from investing are reduced.

Of course, there are exceptions, and it is quite possible (even likely) that there are high net worth individuals who live somewhat frugally and as a result of this policy change would really start splashing out. The question is how significant is this amount of lost investment, and does the loss of that investment capital outweigh the cost to society more widely of a deduction that flows almost entirely to the wealthy.

The Research

Putting this piece together, I have studiously attempted to avoid confirmation bias.[6] Despite the fact that I would benefit personally from lower tax rates on capital gains (well, at least I would if my portfolio would increase in value for a change), I definitely want to believe that aligning capital gains tax rates with the tax rates on normal income would raise significant amounts of tax, mostly from wealthy individuals, with few negative consequences.

In my attempts to avoid confirmation bias, I have deliberately searched for articles and research papers that provide empirical evidence that lower capital gains tax rates were found to lead to higher rates of savings, investment and/or economic growth. I have not been able to find any. There were some papers that claimed to show that decreasing capital gains tax rates actually increased tax revenue, but reading the Australian section of this paper (about which I have some knowledge), it quickly became clear this conclusion had been reached using a combination of cherry picking dates[7] and leaving out important details.[8]

I did also find some papers that, through theoretical models, concluded higher taxes on capital income would cause a range of negative impacts. But the problem with papers that rely on theoretical models is that for every paper based on a theoretical model that concludes “… a capital income tax… reduces the number of entrepreneurs…” there is another paper based on a theoretical model that concludes “… higher capital income taxes lead to faster growth…

Leaving research aside, there were a number of articles supporting the lowering or removing of capital income taxes. The problem is they all recite the same old arguments (“it will cause lock-in!”) and tend to come from a very specific type of institution. Without going too much into what type of institution, let me just list where almost all the material I located was coming from (directly or indirectly):

Even when I found an article from a less partisan source (Forbes), it turned out to be written by a senior fellow at the Cato Institute, and was rebutted by another article in the same publication.

Of course we should not ignore what people say because they work for a certain type of institution – just because they have an agenda does not mean they are wrong. In fact, it stands to reason that organizations interested in reducing taxation and limiting government would research this particular topic. The problem is that if there are genuine arguments being made, they are being lost amongst the misleading and the nonsensical.

Take this argument for lower taxes on capital as an example. First there is a chart taken from this textbook:

Capital per Worker vs. Income per Worker

The article then uses this as evidence to suggest more capital equals more income for workers. As straightforward as this seems, what this conclusion misleadingly skips over is:

  • income per worker is not equivalent to income for workers, and
  • almost all the countries towards the top right hand corner of this chart (i.e. the rich ones) got to their highly capital intensive states despite having high taxes on capital.

A Change in Attitude?

The timing of this article seems to have conveniently coincided with the announcement by Hilary Clinton of a new policy proposal – a ‘Fair Share Surcharge’. In short, the surcharge would be a 4% tax on all income above $5 million, regardless of the source. Matt Yglesias has done a good job of outlining the details in this article if you are interested.

The interesting aspect of this policy is, given the lower rate of tax typically applied to dividends and capital gains, it is a larger percentage increase in taxes on capital income than wage income. Of course, unless something major changes, this policy is very unlikely to make it past Congress and so may simply be academic, but at least it shows one side of politics may be starting to question the idea that taxes on capital should always be lower.

The Data

Finally, I want to finish up with a few charts. The charts below show how various economic indicators changed as various changes were made to the rate of capital gains tax, historically and across countries. Please note, these charts should not be taken as conclusive evidence one way or the other. The curse of economics is the inability to know (except in rare circumstances) what would have happened if a tax rate had not been raised, or if an interest rate rise had been postponed. The same applies with changes to the capital gains tax rate. Without knowing what would have happened if the capital gains tax rate had not been changed, we cannot draw firm conclusions as to what the result of that change was.

However, what we can see is that the indicators shown below do not seem to be significantly affected by changes in the capital gains tax rate, one way or the other – the effects appear to be drowned out by larger changes in the economy. That could be considered a conclusion in itself.

Chart1 – Maximum Long Term CGT Rate vs. Personal Savings rate, US 1959 to 2014

Chart 2 – Maximum Long Term CGT Rate vs. Annual GDP Growth, US 1961 to 2014

Chart 3 – Maximum Long Term CGT Rate vs. Gross Savings, Multiple Countries, 2011-2015 Average

Gross savings are calculated as gross national income less total consumption, plus net transfers. This amount is then divided by GDP (the overall size of the economy to normalize the value across countries.

Chart 4 – Maximum Long Term CGT Rate vs. Gross Fixed Capital Formation, Multiple Countries, 2011-2015 Average

Gross fixed capital formation is money invested in assets such as land, machinery, buildings or infrastructure. For the full definition, please see here. This amount is then divided by GDP (the overall size of the economy to normalize the value across countries.

Chart 5 – Maximum Long Term CGT Rate vs. Gini Index, 2011-2015 Average

The Gini index is a measure of income inequality within a country. A Gini index of 100 represents a country in which one person receives all of the income (i.e. total inequality). An index of 0 represents total equality.

 

[1] Interestingly, two of these four deductions (mortgage interest and employee sponsored health insurance) will be completely foreign to Australians.

[2] A similar policy (50% tax discount for capital gains) in Australia costs around AUD$6-7 billion per year.

[3] The risk that the exchange rate changes and has an adverse impact on the value of your investments.

[4] The risk that the government of the country you are investing in will change the rules in such a way to hurt your investments.

[5] Capital Gains Tax Policy Toward Entrepreneurship, James M. Poterba, National Tax Journal, Vol. 42, No. 3, Revenue Enhancement and Other Word Games: When is it a Tax? (September, 1989), pp. 375-389

[6] Confirmation basis is the tendency of people, consciously or subconsciously, to disregard or discount evidence that disagrees with their preconceived notions while perceiving evidence that confirms those notions as more authoritative.

[7] “After Australian CGT rates for individuals were cut by 50% in 1999 revenue from individuals grew strongly and the CGT share of tax revenue nearly doubled over the subsequent nine years.” Note the carefully selected time period includes the huge run up in asset prices from 2000 to 2007 and avoids the 2008 financial crisis, which caused huge declines in CGT revenues.

[8] “Individuals enjoyed a larger discount under the 1999 reforms than superannuation funds (50% versus 33%), yet yielded a larger increase in CGT payable.” This neglects to mention that even after the discounts were applied, the rate for of capital gains tax for almost all individuals was still higher than for superannuation funds.

Web Analytics – Looking Under the Hood

On occasion I get the sense from bloggers that talking about your traffic statistics is a bit like talking about salary – not something to be done amongst polite company. However, unlike discussing pay, which can generate bad feelings, jealousy, poor morale and a range of other negative side effects, discussing website stats should provide a great learning opportunity for everyone taking part. With that said, in the name of transparency, let me offer a peak under the hood here at BrettRomero.com.

Overall Traffic

For those that have not looked at web traffic statistics, first a quick introduction. When it comes to web traffic, there are two primary measures of volume – sessions and page views. A session is a continuous period of time that one user spends on a website. One session can result in multiple page views – or just the one if the user leaves after reading one article as is often the case. Chart 1 below shows the traffic to BrettRomero.com, as measured in sessions per day.

Chart 1 – All Traffic – Daily


There are a couple of large peaks worth explaining in this chart. The first peak, on 3 November 2015, was the day I discovered just how much traffic Reddit.com can generate. Posting to the TrueReddit subreddit, I posted what, to that point, had been by far my most popular article – 4 Reasons Working Long Hours is Crazy. The article quickly gained over 100 upvotes and, over the course of the day, generated well over 500 sessions. To put that in perspective, the traffic generated from that one post on Reddit in one day is greater than all traffic from LinkedIn and Twitter combined… for the entire time the blog has been online.

The second big peak on 29 December 2015 was also a Reddit generated spike (in fact, all four spikes post 3 November were from Reddit). In this instance it was the posting of the Traffic Accidents Involving Cyclists visualization to two subreddits – the DataIsBeautiful subreddit and the Canberra subreddit.

Aside from these large peaks though, the data as represented in Chart 1 is a bit difficult to decipher – there is too much noise on a day-to-day basis to really see what is going on. Chart 2 shows the same data at a weekly level.

Chart 2 – All Traffic – Weekly


Looking at the weekly data the broader trend seems to show two different periods for the website. The first period, from March to around August has more consistent traffic, around 200 sessions a week, but with smaller spikes. The second period, from August onwards shows less consistent traffic, around 50 sessions a week, but with much larger spikes. But how accurate is this data? Let’s break some of the statistics down.

Breakdown by Channel

When looking at web traffic using Google Analytics, there are a couple of breakdowns worth looking at. The first is the breakdown by ‘channel’ – or how users got to your website for a given session. The four channels are:

  1. Direct – the user typed your website URL directly into the address bar
  2. Referral – the user navigated to your site from another (non-social media) website by clicking on a link
  3. Social – the user accessed your website from a social media website (Facebook, Twitter, Reddit, LinkedIn and so on)
  4. Organic Search – a user searched for something in a search engine (primarily Google) and clicked on a search result to access your site.

The breakdown of sessions by channel for BrettRomero.com is shown in Table 1 below:

Table 1 – Breakdown by Channel

Channel Grouping

Sessions

Direct

2,923

Referral

2,776

Social

2,190

Organic Search

567

Total

8,456

Referral Traffic

Looking at referral traffic specifically, Google Analytics allows you to view which specific sites you are getting referral traffic from. This is shown in Table 2.

Table 2 – Top Referrers

Rank Source

Sessions

1 floating-share-buttons.com

706

2 traffic2cash.xyz

177

3 adf.ly

160

4 free-share-buttons.com

152

5 snip.to

74

6 get-free-social-traffic.com

66

7 www.event-tracking.com

66

8 claim60963697.copyrightclaims.org

63

9 free-social-buttons.com

57

10 sexyali.com

50

Total All Referral Traffic

2,776

Looking at the top 10 referrers to BrettRomero.com, the first thing you may notice is that these site addresses look a bit… fake. You would be right. What you are seeing above is a prime example of what is known as ‘referrer spam’. In order to generate traffic to their sites, some unscrupulous people use a hack that tricks Google Analytics into recording visitors to your site coming from a URL they want you to visit. In short, they are counting on you looking at this data, getting curious and trying to work out where all this traffic is coming from. Over time these fake hits can build up to significant levels.

There are ways to customize your analytics to exclude traffic from certain domains, and initially I was doing this. However, I quickly realized that this spam comes from an almost unlimited number of domains and trying to block them all is basically a waste of time.

Looking at the full list of sites that have ‘referred’ traffic to my site, I can actually only find a handful of genuine referrals. These are shown in Table 3.

Table 3 – Genuine Referrers

Rank Source

Sessions

17 uberdriverdiaries.com

35

18 vladimiriii.github.io

33

72 australiancraftbeer.org.au

3

76 alexa.com

2

95 opendatakosovo.org

1

Total Genuine Referral Traffic

74

Total Referrer Spam

2,702

What does the total traffic look like if I exclude all the referrer spam? Chart 3 below shows the updated results.

Chart 3 – All Traffic Excluding Referrals


As can be seen, a lot of the traffic in the period March through August was actually coming from referrer spam. Although May still looks to have been a strong month, April, June and July now appear to be hovering around that baseline 50 sessions a month.

Search Traffic

Search traffic is generally the key channel for website owners in the long term. Unlike traffic from social media or from referrals, it is traffic that is generated on an ongoing basis without additional effort (posting, promotion and so on) on the part of the website. As you would expect though, to get to the first page of search results for any combination of key words that is searched regularly is very difficult. In fact it is so difficult, an entire industry has developed around trying to achieve this – Search Engine Optimization or SEO.

For BrettRomero.com, search traffic has been difficult to come by for the most part. Below is a chart showing all search traffic since the website started:

Chart 4 – Search Traffic – All


Keeping in mind the y-axis in this chart is on a smaller scale than the previous charts, there doesn’t seem to be much pattern to this data. August again seemed to be a strong month, as well as the weeks in late May and early June. Recent months have been flatter, but more consistent.

Going one step further, Table 4 shows the keywords that were searched by users to access BrettRomero.com.

Table 4 – Top Search Terms

Rank Keyword

Sessions

1 (not provided)

272

2 beat with a shovel the weak google spots addons.mozilla.org/en-us/firefox/addon/ilovevitaly/

47

3 erot.co

45

4 непереводимая.рф

40

5 “why you probably don’t need a financial advisor”

33

6 howtostopreferralspam.eu

32

7 sexyali.com

16

8 vitaly rules google ☆*:.。.゚゚・*ヽ(^ᴗ^)丿*・゚゚.。.:*☆ ¯\_(ツ)_/¯(•ิ_•ิ)(ಠ益ಠ)(ಥ‿ಥ)(ʘ‿ʘ)ლ(ಠ_ಠლ)( ͡° ͜ʖ ͡°)ヽ(゚д゚)ノʕ•̫͡•ʔᶘ ᵒᴥᵒᶅ(=^. .^=)oo

14

9 http://w3javascript.com

13

10 ghost spam is free from the politics, we dancing like a paralytics

11

Again, we see something unexpected – most of the keywords are actually URLs or nonsensical phrases (or both). As you might suspect, this is another form of spam. Other website promoters are utilizing another hack – this one tricks Google Analytics into recording a search session, with the keyword being a message or URL the promoter wants to display. Looking at the full list, the only genuine search traffic appears be the records for which keywords are not provided[1]. Chart 5 shows search traffic with the spam excluded.

Chart 5 – Search Traffic – Spam Removed


With the spam removed, we see something a little bit more positive. After essentially nothing from March through July, we see a spike in activity in August and September, before falling back to a new baseline of around 5-10 sessions per week. Although this is obviously still miniscule, it does suggest that the website is starting to show up regularly in people’s searches.

Referring back to the total sessions over time, Chart 6 shows how removing the spam search impacts our overall number of sessions chart.

Chart 6 – All Traffic Excluding Referrals and Spam Search

Social Traffic and the Reddit Effect

As was shown in Table 1, one of the two main sources of (real) traffic for the website is social media.

Social media provides a real bonus for people who are starting from zero. Most people now have large social networks they can utilize, allowing them to get their content in front of a lot of people from a very early stage. That said, there is a line and spamming your friends with content continuously is more likely to get you muted than generate additional traffic.

Publicizing content on social media can also be a frustrating experience. Competing against a never-ending flood of viral memes and mindless, auto-generated content designed specifically to generate clicks, can often feel like a lost cause. However, even though it seems like posts simply get lost amongst the tsunami of rubbish, social media is still generally a good indicator of how ‘catchy’ a given article is. Better content will almost always generate more likes/retweets/shares.

In terms of the effectiveness of each social media platform, Reddit and Facebook have proven to be the most effective for generating traffic by some margin. Table 5 shows sessions by social media source.

Table 5 – Sessions by Social Media Source

Rank Social Network

Sessions

1 Reddit

999

2 Facebook

868

3 Twitter

224

4 LinkedIn

69

5 Blogger

26

6 Google+

3

7 Pocket

1

When looking at the above data, also keep in mind, I only started posting to Reddit at the start of November, effectively giving Facebook a 7 month head start. This means Reddit is by far the most effective tool I have found to date to get traffic to the website. However, there is a catch to posting on Reddit – the audience can be brutal.

Generally on Facebook, Twitter and LinkedIn, people who do not agree with your article will just ignore it. On Reddit, if people do not agree with you – or worse still, if they do not like your writing – they will comment and tell you. They will not be delicate. They will down vote your post (meaning they are actively trying to discourage other people from viewing it). Finally, just to be vindictive, they will down vote any comments you make as well. If you are planning to post on Reddit, make sure you read the rules of the subreddit (many explicitly ban people from promoting their own content) and try to contribute in ways that are not just self‑promotional.

Pages Visited

Finally, let’s look at one final breakdown for BrettRomero.com. Table 5 shows the top 10 pages viewed on BrettRomero.com.

Table 6 – 10 Most Viewed Pages

Rank Page

Pageviews

1 /

4,345

2 /wordpress/

1,450

3 /wordpress/4-reasons-working-long-hours-is-crazy/

1,038

4 /cyclist-accidents-act/

773

5 /wordpress/climbing-mount-delusion-the-path-from-beginner-to-expert/

306

6 /wordpress/the-dark-side-of-meritocracy/

205

7 /wordpress/why-australians-love-fosters-and-other-beer-related-stories/

194

8 /blog.html

192

9 /?from=http://www.traffic2cash.xyz/

177

10 /wordpress/visualizations/

165

As mentioned earlier, 4 Reasons Working Long Hours is Crazy has been by some margin my popular article. Although Reddit gave this article a boost traffic wise, it was also by some margin the best performing article I have posted to Reddit with over 100 upvotes. The next best performing, the Traffic Accidents Involving Cyclists visualization, only managed 20 upvotes.

Overall

As I mentioned at the outset, web traffic statistics tend to be a subject that is not openly discussed all that often. As a result, I have little idea how good or bad these statistics are. Given I have made minimal effort to promote my blog, generate back links (incoming links from other websites) or get my name out there by guest blogging, I suspect that these numbers are pretty unimpressive in the wider scheme of things. Certainly I am not thinking about putting up a pay wall any time soon anyway.

As unimpressive as the numbers may be though, I hope they have provided an interesting glimpse into the world of web analytics and, for those other bloggers out there, some sort of useful comparison.

 

Spotted something interesting that I missed? Please leave a comment!

 

[1] For further information on why the keywords are often not provided, this article has a good explanation.

5 Things I Learned in 2015

2015 has been an interesting year in many respects. A new country[1], a new language, a new job, and plenty of new experiences – both at work and in life in general. To get into the year-end spirit, I thought I would list out 5 key things I learned this year.

1. I Love Pandas

Yes, those pandas as well, who doesn’t? But I knew that well before 2015. The pandas I learned to love this year is a data analysis library for the programming language Python. “Whoa, slow down egg head” I hear you say. For those that are not regular coders, what that means is that pandas provides a large range of ways for people writing Python code to interact with data that makes life very easy.

Reading from and writing to Excel, CSV files and JSON (see lesson number 2) is super easy and fast. Manipulating large datasets in table like structures (dataframes) – check. Slicing, dicing, aggregating – check, check and check. In fact, as a result of pandas, I have almost entirely stopped using R[2]. All the (mostly basic) data manipulation for which I used to use R, I now use Python. Of course R still has an important role to play, particularly when it comes to complex statistical analysis, but that does not tend to come up all that regularly.

2. JSON is Everywhere

JSON, JavaScript Object Notation for the uninitiated, is a data interchange format that has become the default way of transferring data online. Anytime you are seeing data displayed on a webpage, including all the visualizations on this website, JSON is the format the underlying data is in.

JSON has two big advantages that have led to its current state of dominance. The first is that, as the name suggests, it is native to JavaScript – the key programming language, alongside HTML, that is interpreted by the browser you are reading this on. The second is that JSON is an extremely flexible way of representing data.

However, as someone who comes from a statistics and data background, as opposed to a technology background, JSON can take a while to get used to. The way data is represented in JSON is very different to the traditional tables of data that most people are used to seeing. Gone are the columns and rows, replaced with key-value pairs and lots of curly brackets – “{“ and “}”. If you are interested in seeing what it looks like, there are numerous CSV to JSON convertors online. This one even has a sample dataset to play with.

If you do bother to take a look at some JSON, you will note that it is also much more verbose than your standard tabular format. A table containing 10 columns by 30 rows – something that could easily fit into one screen on a spreadsheet – runs to 300+ lines of JSON, depending on how it is structured. That does not make it easy to get an overview of the data for a human reader, but that overlooks what JSON is designed for – to be read by computers. The fact that a human can read it at all is seen as one of JSON’s strengths.

For those interested in working with data (or any web based technology), knowing how to read and manipulate JSON is becoming as important as knowing how to use a spreadsheet.

3. Free Tools are Great

There are some people working for software vendors who will read this and be happy I have a very small audience. Having worked in the public sector, for a large corporate and now for a small NGO, one thing I have been pleasantly surprised by in 2015 is the number and quality of free tools available online.

For general office administration there are office communicator applications (Slack), task management tools (Trello) and Google’s free replacements for Excel, Word and PowerPoint. For version control and code management there is GitHub. For data analysis, the aforementioned Python and R are both free and open source. For data storage, there is a huge range of free database technologies available, in both SQL (PostgreSQL, MySQL, SQLite3) and NoSQL (MongoDB, Redis, Cassandra) variations.

To be fair to my previous larger employers and my software-selling friends, most of these tools/applications do have significant catches. Many operate on a ‘freemium’ model. This means that for individuals and small organizations with relatively few users, the service is free (or next to free), but costs quickly rise when you need larger numbers of users and/or want access to additional features, typically the types of features larger organizations need. Many of the above also provide no tech support or guarantees, meaning that executives have no one to blame if the software blows up. If you are responsible for maintaining the personal data of millions of clients, that may not be a risk you are willing to take.

For small business owners and entrepreneurs however, these tools are great news. They bring down barriers to entry for small businesses and make their survival more dependent on the quality of the product rather than how much money they have. That is surely only a good thing.

4. Blogging is a Full Time Job

Speaking of starting a business, a common dream these days is semi-retiring somewhere warm and writing a blog. My realization this year from running a blog (if only part time) is just how difficult it is to get any traction. Aside from being able to write reasonably well, there are two main hurdles that anyone planning to become a full time blogger needs to overcome – note that I have not come close to accomplishing either of these:

  1. You have to generate large amounts of good quality content – at least 2-3 longer form pieces a week if you want to maintain a consistent audience. That may seem easy, but after you have quickly bashed out the 5-10 article ideas you have been mulling over, the grind begins. You will often be writing things that are not super interesting to you. You will often not be happy with what you have written. You will quickly realize that your favorite time is the time immediately after you have finished an article and your least favorite is when you need to start a new piece.
  2. You will spend more time marketing your blog than writing. Yep, if you want a big audience (big enough to generate cash to live on) you will need to spend an inordinate amount of time:
    • cold emailing other blogs and websites, asking them to link to your blog (‘generating back links’ in blogspeak)
    • ensuring everything on your blog is geared towards your blog showing up in peoples’ Google search results (Search Engine Optimization or SEO)
    • promoting yourself on Facebook
    • building a following on Twitter
    • contributing to discussions on Reddit and LinkedIn to show people you are someone worth listening to, and
    • writing guest blogs for other sites.

None of this is easy. Begging strangers for links, incorporating ‘focus words’ into your page titles and headings, posting links on Facebook to something you spend days writing, only to find you get one like (thanks Mum!). Meanwhile, some auto-generated, barely readable click-bait trash from ‘viralnova’ or ‘quandly’ (yes, I am deliberately not linking to those sites) is clocking up likes in the 5 figures. It can be downright depressing.

Of course, there are an almost infinite number of people out there offering their services to help with these things (I should know, they regularly comment on my articles telling me how one weird trick can improve my ‘on page SEO’). The problem is, the only real help they can give you is adding more things to the list above. On the other hand, if you are thinking about paid promotion (buying like’s or a similar strategy) I’d recommend watching this video:

Still want to be a blogger? You’re welcome.

5. Do not be Afraid to Try New Things

One of the things that struck me in 2015 is how attached people get to doing things a certain way. To a large degree this makes sense, the more often you use/do something, the better you get at it. I am very good at writing SQL and using Excel – I have spent most of the last 10 years using those two things. As a result, I will often try to use those tools to solve problems because I feel most comfortable using them.

Where this becomes a problem is when you start trying to shoehorn problems into tools not just because you are comfortable with the tool, but to avoid using something you are less comfortable with. As you have seen above, two of the best things I learned this year were two concepts that were completely foreign to a SQL/Excel guy like me. But that is part of what made learning them so rewarding. I gained a completely new perspective on how data can be structured and manipulated and, even though I am far from an expert in those new skills, I now know they are available and which sorts of problems they are useful for.

So, do not be afraid to try new things, even if the usefulness of that experience is not immediately apparent. You never know when that skill might come in handy.

 

Happy New Year to everyone, I hope you have a great 2016!

 

[1] Or ‘Autonomous Province’ depending on your political views

[2] R is another programming language designed specifically for statistical analysis, data manipulation and data mining.

Traffic Accidents Involving Cyclists in the ACT

I’ve had a few days off lately and I decided to try something a bit different. Instead of writing an(other) lengthy article, I thought I would go back to my roots and actually look at some data. To that end I recently discovered a website for open data in Australia, data.gov.au. This website has literally thousands of interesting datasets released from all levels of government, covering everything from the tax bills of Australia’s largest companies to the locations of trees in Ballarat.

One of the first datasets that caught my eye was one published by the Australian Capital Territory (ACT) Government on traffic accidents involving cyclists. For those that don’t know, Canberra (the main city in the ACT) is a very bike friendly city and is home to a large number of recreational and more serious cyclists, so seeing where the accidents were/are occurring was something I thought would be interesting.

Using a few new things I have not used before (primarily Mapbox and leaflet.js), I put (slapped?) together an interactive map that uses the data provided and also gives you a few different ways of viewing it. The full version of the map can be accessed by clicking the picture below:

cyclist-map

 

See a bug? Found it particularly useful? Hate it? Leave a comment below!

Women in the Workplace – Understanding the Data

Cross Posted from OpenDataKosovo.org:

Continuing our series on Gender Inequality and Corruption in Kosovo (see Part I and Part II), in Part III and the next few parts, we are going to take a detailed look at the problems women face in the labour market in Kosovo.

To do this, we will be using information from several sources, including data on participation rates, by gender, from the Gender Statistics database at the World Bank, and a range of labour market statistics from various Kosovo Labour Force Surveys, released by the Kosovo Agency of Statistics.

High Level Concepts

Before diving into the statistics, let’s first visualize and explain some of the high level concepts in labour market statistics.

Chart 1 – Population Breakdown 2014

WAC_3_1

At the highest level, the section of the population that is relevant when looking at labour market statistics is people who are of working age and are able to work. In Kosovo, this population includes all people aged 15 to 64 and is known as the ‘working age population’.

Labour Force and Inactive Populations

At the next level, the working age population can be broken down into two main subgroups – those that are considered in the labour force (i.e. ‘participating’) and those that are ‘inactive’. It is important to note that someone who is ‘inactive’ is not the same as someone who is ‘unemployed’. In Kosovo, to be considered ‘actively looking for work’ (and therefore be classified in the labour force) the following criteria must be met. The person must be:

  • currently available for work, that is, available for paid employment or self- employment within two weeks; and
  • seeking work, that is, have taken specific steps in the previous four weeks to seek paid employment or self-employment.

If either of the above criteria is not met, the person is classified as inactive.

Calculating the Participation Rate

Once the population is classified as either in the labour force or inactive, it is possible to calculate the participation rate, one of the key labour market statistics. The participation rate measures the labour force population (people employed and/or actively looking for work) as a percentage of the working age population.

WAC_E_3_1

In Kosovo, the participation rates in 2014 were as follows:

  • Male Participation Rate (2014): 61.8%
  • Female Participation Rate (2014): 21.4%
  • Overall Participation Rate (2014): 41.6%

Unlike the unemployment rate, described below, the participation rate tends to provide more stable and reliable data than the unemployment rate, as it is not affected by short-term fluctuations and the business cycle.

Employed vs. Unemployed

Analyzing the population further, the ‘labour force’ can be subdivided into two populations – those that are employed and those that are unemployed. In most cases it is obvious whether someone is employed or not, but in some situations it may not be so clear (e.g. when a person is working for the family business in an unpaid capacity). To handle these scenarios, the agency tasked with compiling the labour market statistics in each country typically has a specific definition (or definitions) of what qualifies as employment. In Kosovo, to be classified as ‘employed’ a person must meet the following high-level criteria:

“People who during the reference week performed some work for wage or salary, or profit or family gain, in cash or in kind or were temporarily absent from their jobs.”

In addition, the Kosovo Agency of Statistics includes some more detailed criteria in their methodology that clarifies when work done on family owned farms classifies as employment. This will become important later.

Calculating the Unemployment Rate

Having separated the employed from the unemployed, it is now possible to calculate the unemployment rate. To do this, we divide the number of unemployed people by the total number of people in the labour force.

WAC_E_3_2

In Kosovo, the unemployment rates in 2014 were as follows:

  • Male Unemployment Rate (2014): 33.1%
  • Female Unemployment Rate (2014): 41.6%
  • Overall Unemployment Rate (2014): 35.3%

The unemployment rate is useful as a more immediate indicator of conditions in the economy. The obvious information is provides is an indicator of how many people without a job are currently looking for employment. But, in addition, it also provides information about how much spare capacity an economy has, the risk that inflation may pick up, whether structural issues are keeping people out of work and so on.

Chart 1 – Participation and Unemployment Rates by Gender 2014

What is Next?

In the next article, we will take a look at how the participation rate (for both males and females) in Kosovo compares across the region and internationally. In the meantime, please feel free to play around with the interactive visualization below, which shows the entire working age population of Kosovo broken down into its various subgroups.

Click on the chart below to interact with the data!

sunburst_pic

Sunburst chart created by Festina Ismali

 

 

Corruption in Kosovo: A Comparative Analysis

Cross posted from OpenDataKosovo.org:

Previously in Part I of this series, we looked at corruption in Kosovo from the perspective of Kosovo civil servants, as documented in a United Nations Development Programme (UNDP) report entitled Gender Equality Related Corruption Risks and Vulnerabilities in Civil Service in Kosovo[1].

In Part II we are now going to look at global corruption perception statistics compiled by Transparency International to consider how Kosovo compares internationally.

An International Comparison of Corruption

Transparency International is an organization that works to reduce corruption[2] through increasing the transparency of Governments around the world. Arguably Transparency International’s most well known contribution is the Corruption Perceptions Index (CPI), an index measuring “the perceived levels of public sector corruption worldwide”. In 2014[3] the CPI was calculated by aggregating 12 indices and data sources collected from 11 different independent institutions specializing in governance and business climate analysis over the past 24 months. The 2014 CPI covered 175 countries, including Kosovo.

In addition to the CPI, Transparency International does its own survey and data collection in the form of the Global Corruption Barometer (GCB survey). The GCB survey focuses on the public’s opinion of corruption within their own country, and in 2013 (the latest edition of the GCB available at the time of writing) collected the opinions of over 114,000 people across 107 countries – including Kosovo.

So what did these two reports show?

Results

In the CPI, Kosovo performs poorly, placing 110th out of 175 countries with a score of 33 out of 100 (unchanged from 2013). To give some perspective, Kosovo finished equal 110th with 4 other countries – Albania, Ecuador, Ethiopia, and Malawi. This placed it behind Argentina (107th), Mexico (103rd), China (100th), India (85th) and Greece (69th), countries that are often associated with high levels of corruption. Finally, this was the lowest ranking for any country in the Balkans region (tied with Albania).

Chart 1 – GCB Survey Q6 – Perceptions of Corruption by Institution for 6 Countries

WAC_2_1

The GCB survey, however, shows that the people in Kosovo have a different perception of corruption in several areas to that reported in the CPI. Based on the responses to question 6[4] (see Chart 1) and question 7[5] (see Chart 2) of the GCB survey, people in Kosovo are somewhat more optimistic about the levels of corruption in their country than the low rating on the CPI might indicate. Kosovo scores well in several areas:

  • Only 16% of people reported having paid a bribe in the last 12 months. This placed Kosovo 35th out of the 95 countries that provided a response to question 7.
  • 46% of Kosovars generally believe their public institutions to be corrupt or extremely corrupt. This sounds high but actually puts Kosovo ahead of the US (47%) and only slightly behind Germany (40%). The results for certain institutions were even better:
    • The Military is believed to be corrupt or extremely corrupt by only 8% of those interviewed – only four countries had a lower percentage than Kosovo on this part of question 6.
    • NGOs and Religious bodies were also seen as uncorrupt by large majorities.
    • 44% of people believed public officials and civil servants were corrupt, placing Kosovo ahead of Germany, France and the US, among others.

Chart 2 – GCB Survey Q7 – Reports of Bribes Paid by Institution for 6 Countries

WAC_2_2

But not all the results were positive. Questions 1[6], 4[7] and 5[8] in the GCB survey in particular highlight a more pessimistic outlook:

  • In response to question 1, 66% of Kosovars stated that they believed corruption had increased over the past 2 years, while only 8% believed it had decreased.
  • In response to question 4, 74% of Kosovars stated they believed their Government is run by large entities largely or entirely for their own benefit.
  • In response to question 5, only 11% of Kosovars surveyed believed the actions of their Government in the fight against corruption are effective.

What does all this mean? Why does Kosovo perform so poorly on the CPI, and on some GCB survey questions, but on other questions the perceived level of corruption of people in Kosovo is comparable to some developed nations?

Perceptions vs. Reality

One of the issues when looking at the results of the GCB survey is that the responses to most of these questions are subjective. What constitutes corruption or extreme corruption varies by country and culture based on what people are used to living with. What someone in South Asia or sub-Saharan Africa considers standard practice and harmless may be considered unbelievably corrupt by people in other parts of the world.

These different standards are really highlighted when we compare the percentage of people believing an institution is corrupt with the number of people reporting to have paid a bribe to that institution, using questions 6 and 7 of the GCB survey. There are four institutions that appear as options for both questions, allowing us to make a direct comparison:

  1. Education
  2. Judiciary
  3. Medical and Health, and
  4. Police

In the comparison (see Chart 3), we find numerous examples where the percentage of people that reported paying bribes was higher than the percentage of people who believed the institution was corrupt. The implication of this finding is that significant numbers of people in these countries believe that paying a bribe is not a sign of corruption.

Chart 3 – Comparison of Perceived Corruption with Bribes Paid

WAC_2_3

Kosovo and most developed nations were examples of the opposite case – they generally reported relatively high numbers of people who believed the four comparable institutions were corrupt, and relatively low percentages of people reporting bribes being paid. Bribery, of course, is not the only form of corruption, and this result could simply be an indicator that different forms of corruption are more prevalent in these countries. But it could also be an indicator that people in some countries are particularly cynical about the fidelity of their institutions.

To get a better sense of how concerned people really are about corruption, lets now take a look at some of the responses to other questions in the survey.

Is a Person’s Willingness to Take Action a Better Indicator?

One of the questions asked on the survey that could potentially reveal some further information was question 10 – “Are you willing to get involved in the fight against corruption?” Respondents were then provided with a range of activities, both active and passive, and were requested to indicate whether they would be willing to participate.

At a high level, the responses to this question appear to show an inverse correlation between the value of the CPI for a country and how willing people in that country were to do something active to fight corruption. In other words, the higher the percentage of people willing to do something active to fight corruption, the lower the CPI index for that country (i.e. a higher level of corruption).

Using a statistical model (such as regression), we can check whether this relationship is real and how strong it is. However to do this, we need to consider countries with regimes that punish dissent and crack down on protests and/or organizations that might try to combat corruption. In these countries, you would expect to have a low percentage of people willing to take action against corruption despite corruption being high.

To account for this, we need to have some sort of indicator of how worried people are about speaking out in their country. The best piece of information that we have from the GCB survey that can serve this purpose was the question asking if the respondent would be willing to report corruption.

Using these two pieces of information, we can try to test the following hypotheses:

  1. A high percentage of people willing to take action against corruption in a given country is indicative of a high level of corruption.
  2. A low percentage of people willing to take action against corruption in a given country, but a high level willing to report corruption is indicative of a low level of corruption.
  3. A low percentage of people willing to take action against corruption in a given country, and a low level willing to report corruption is indicative of a high level of corruption in a repressive regime.

Based on these hypotheses, we also expect that there would be no (or very few) cases where there is high percentage of people willing to take action against corruption and a low level of people willing to report corruption.

Building a Model

Using our two pieces of information described above, and with the assumption that the CPI is the most accurate indicator of the true level of corruption within a country[9], we can build a model to predict CPI for each country and test our hypotheses. The formula for this model will be as follows:

Where:

Yi = the actual value of CPI for country i

β0 = a constant

Xi1 = the percentage of people willing to do something active to fight corruption[10] in country i

β1 = a constant applied to Xi1

Xi2 = the percentage of people willing report an incidence of corruption in country i

β2 = a constant applied to Xi2

εi = the residual or error

Using ordinary least squares (OLS) and the data for the 101 countries for which the CPI and the two variables (X1 and X2) described above are provided, the results of the model is as follows:

β0 β1 β2
Coefficient 27.9735 -0.8417 0.8798
Standard Error 4.8427 0.0705 0.0738
R2 66.7%

The first thing to note is that the coefficients support the three hypotheses we mentioned above:

  1. A strongly negative coefficient β1 indicates that the larger the percentage of people willing to do something active to fight corruption, the lower the predicted CPI.
  2. A strongly positive coefficient β2 indicates that the larger the percentage of people willing to report corruption, the higher the predicted CPI.

General Insights

Aside from providing support for our hypotheses, the other thing this model reveals is the countries that are not very well explained by this model. Chart 4 shows the CPI predicted by the model as compared to the actual CPI value for 2014.

Chart 4 – Predicted CPI vs. Actual CPI by Country

WAC_2_4

At a high level, we can split the chart into two parts:

  1. Points below and to the right of the line reflect countries where the actual level of the CPI was lower than the predicted level
  2. Points above and to the left of the line reflect countries where the actual level of the CPI was higher than the predicted level.

Starting with the first group – countries that were more corrupt than the model predicted – these cases appear to fall into two categories:

  • Conflict Affected Countries – In these cases, of which Sudan is the most extreme example, there was typically a low percentage of people willing to do something active to fight corruption, and therefore the CPI was predicted to be significantly higher than it is in reality. This is likely to be due the citizenry of these countries facing more immediate problems. This pattern was seen across Sudan, Afghanistan, Iraq, Libya and South Sudan.
  • Other – In these cases, of which Russia was the best example, there was generally a high percentage of people willing to report corruption (86% for Russia) and a relatively low percentage of people willing to do something active to fight corruption (47% in Russia). As a result the model predicted a relatively high CPI. The explanation for this is not as clear as above, but the evidence would seem to suggest that the people in these countries are either not aware of the high level of corruption present in their country, or that they have a significantly different opinion as to what constitutes corruption.

Contrasting with the above cases, we can also see there are countries above and to the left of the line in Chart 4. This represents countries that were less corrupt than the model predicted. In these cases the responses to the two questions were indicative of a country with a higher level of corruption than actually existed. The following were two interesting cases:

  • Finland – the model was thrown off by a surprisingly low percentage of people willing to report corruption. Of the respondents from Finland, only 65% of people surveyed reported they would be willing to report corruption – a surprisingly low percentage for a country with a CPI value of 89. In fact, Finland and Japan were the only countries with a CPI above 60 that reported a percentage below 80% for this question.
  • The United States – neither of the data points used for the US in the model were hugely abnormal for countries in the same CPI range. 80% of people said they would be willing to report corruption (a little lower than you would expect) and 50% said they be willing to do something active to fight corruption (a little higher than you would expect). Both of these potentially show a slightly higher level of mistrust in government than other developed nations, something that does tie in with the politics of large parts of the US.

Unlike the above examples, Kosovo appeared fairly typical for the model. Let’s now take a deeper look into the results of the model for Kosovo.

Insights for Kosovo

For Kosovo, the model was able to fairly accurately predict the CPI using the two variables described. Kosovo has both a high percentage of people willing to do something active to fight corruption (80%) and a high percentage of people willing to report corruption (84%). As a result, the model predicted a high level of corruption in Kosovo, a CPI of 35, which was just below the actual CPI value of 33.

However, aside from proving the accuracy of the model in this case, these high values reveal important information about the people of Kosovo. It reveals Kosovars do believe corruption is an issue, and that they are willing to do something about it.

Summary

Overall, there are positives and negatives for Kosovo that can be taken from the Transparency International data. On the negative side, the CPI highlights that corruption is a significant issue in Kosovo. Even in a region with consistently low CPI scores (the best performer is Slovenia with a score of 58) Kosovo is a significant underperformer. The most disappointing aspect of this underperformance is that Kosovo has had the significant advantage of 15 years of assistance from various international agencies in setting up infrastructure for good governance.

That said, there is a big positive that comes from the GCB survey data, and it is also potentially an important clue as to the best way forward for Kosovo and the international organizations involved in the region. That positive is that the people of Kosovo appear to be aware of the issues of corruption in their country, and more importantly, they are very willing to take an active role to fight it. Compared to Albania, a country with the same CPI as Kosovo, almost twice the percentage of survey respondents stated they were willing to do something active to fight corruption in Kosovo (80% vs. 44%), and significantly more people said they were willing to report corruption (84% vs. 51%).

What this suggests is that, if harnessed effectively, anti-corruption efforts in Kosovo could be very popular, and therefore powerful. But the right strategies have to be implemented and publicized to garner public support.

Somewhat unsurprisingly, we believe a key strategy has to be raising awareness of how data can be used to reduce corruption and bring about change. This can apply equally to data that is currently collected by government agencies but isn’t publically released, or new datasets that the public can assist in collecting. With the right data and right analysis, these datasets can help to improve governance in numerous ways including:

  • exposing systematic corruption
  • identifying gaps in anti‑corruption controls, and
  • better targeting of anti-corruption efforts.

Using this open data approach also helps reduce reliance on the bravery of individual whistleblowers. Although whistleblowers are often vital in helping to identify incidents and even patterns of corruption, the fact is that, even in developed nations, they will always risk retaliation and other subtler forms of retribution (reduced career prospects, being ostracized by their peers and generally being perceived as untrustworthy).

Overall, what the results of the Transparency International data shows us is that, with better coordination and targeting of anti-corruption efforts, there is the potential to actively involve large numbers of Kosovars. If that can be achieved and funneled into meaningful strategies, the future of Kosovo could be very bright indeed.

Have any suggestions for ways data could be used to fight corruption? Disagree completely? Feel free to leave your thoughts in the comments!

 

[1] Gender Equality Related Corruption Risks and Vulnerabilities in Civil Service Kosovo, United Nations Development Programme. November 2014. Gender Corruption final Eng.pdf

[2] Defined by Transparency International ‘… as “the abuse of entrusted power for private gain”. Corruption can be classified as grand, petty and political, depending on the amounts of money lost and the sector where it occurs.’

[3] The methodology for compiling the CPI is reviewed on a yearly basis with data sources added and removed as needed.

[4] “To what extent do you see the following categories in this country affected by corruption?” – responses of “corrupt” or “extremely corrupt” recorded as a positive response.

[5] “In your contact or contacts with the institutions have you or anyone living in your household paid a bribe in any form in the past 12 months?“

[6] “Over the past 2 years, how has the level of corruption in this country changed?”

[7] “To what extent is this country’s government run by a few big entities acting in their own best interests?”

[8] “How effective do you think your government’s actions are in the fight against corruption?”

[9] By their own admission, Transparency International’s CPI is not a perfect measure of corruption. Corruption by its nature is hidden and so there is no objective measure of the true level of corruption. However, the CPI is currently the most respected measure of corruption available and so we make the assumption that it is also the most accurate for the purposes of constructing this model.

[10] Taken as the average of the percentage of people who said they would take part in a peaceful protest and the percentage of people who said they would join an organization that works to reduce corruption as an active member

Women and Corruption Issues in Kosovo

For those that don’t know, over the past couple of months I have been spending time working with a tech startup/NGO here in Pristina called Open Data Kosovo. The main aim of the organization is to encourage and facilitate the release of data and other information by the government of Kosovo (and related bodies) in order to increase transparency and reduce corruption. So far they have been fantastically successful, getting both national and international media attention, which is all the more impressive when you consider they are only now coming to the end of their first year of existence.

One of the main things I have been working on since joining is putting together some analysis of the various datasets they have been publishing online to see what conclusions can be provided to the public that might help create a more informed discussion of the issues. The first piece has now been published on the Open Data Kosovo website and we are excited to see what kind of feedback we get. If you want to take a look, please click the link below:

More women in leadership would probably reduce corruption, but is there a more effective way? 

Newer posts

© 2019 Brett Romero

Theme by Anders NorenUp ↑