This article on understanding the data is Part II in a series looking at data science and machine learning by walking through a Kaggle competition. Part I can be found here.
Continuing on the walkthrough of data science via a Kaggle competition entry, in this part we focus on understanding the data provided for the Airbnb Kaggle competition.
Reviewing the Data
In any process involving data, the first goal should always be understanding the data. This involves looking at the data and answering a range of questions including (but not limited to):
- What features (columns) does the dataset contain?
- How many records (rows) have been provided?
- What format is the data in (e.g. what format are the dates provided, are there numerical values, what do the different categorical values look like)?
- Are there missing values?
- How do the different features relate to each other?
For this competition, Airbnb have provided 6 different files. Two of these files provide background information (countries.csv and age_gender_bkts.csv), while sample_submission_NDF.csv provides an example of how the submission file containing our final predictions should be formatted. The three remaining files are the key ones:
- train_users_2.csv – This dataset contains data on Airbnb users, including the destination countries. Each row represents one user with the columns containing various information such the users’ ages and when they signed up. This is the primary dataset that we will use to train the model.
- test_users.csv – This dataset also contains data on Airbnb users, in the same format as train_users_2.csv, except without the destination country. These are the users for which we will have to make our final predictions.
- sessions.csv – This data is supplementary data that can be used to train the model and make the final predictions. It contains information about the actions (e.g. clicked on a listing, updated a wish list, ran a search etc.) taken by the users in both the testing and training datasets above.
With this information in mind, an easy first step in understanding the data is reviewing the information provided by the data provider – Airbnb. For this competition, the information can be found here. The main points (aside from the descriptions of the columns) are as follows:
- All the users in the data provided are from the USA.
- There are 12 possible outcomes of the destination country: ‘US’, ‘FR’, ‘CA’, ‘GB’, ‘ES’, ‘IT’, ‘PT’, ‘NL’,’DE’, ‘AU’, ‘NDF’ (no destination found), and ‘other’.
- ‘other’ means there was a booking, but in a country not included in the list, while ‘NDF’ means there was not a booking.
- The training and test sets are split by dates. In the test set, you will predict the destination country for all the new users with first activities after 7/1/2014
- In the sessions dataset, the data only dates back to 1/1/2014, while the training dataset dates back to 2010.
After absorbing this information, we can start looking at the actual data. For now we will focus on the train_users_2.csv file only.
Table 1 – Three rows (transposed) from train_users_2.csv
|Column Name||Example 1||Example 2||Example 3|
|first_device_type||Windows Desktop||iPhone||Windows Desktop|
Looking at the sample of three records above provides us with a few key pieces of information about this dataset. The first is that at least two columns have missing values – the age column and date_first_booking column. This tells us that before we use this data for training a model, these missing values need to be filled or the rows excluded altogether. These options will be discussed in more detail in the next part of this series.
Secondly, most of the columns provided contain categorical data (i.e. the values represent one of some fixed number of categories). In fact 11 of the 16 columns provided appear to be categorical. Most of the algorithms that are used in classification do not handle categorical data like this very well, and so when it comes to the data transformation step, we will need to find a way to change this data into a form that is more suited for classification.
Thirdly, the timestamp_first_active column looks to be a full timestamp, but in the format of a number. For example 20090609231247 looks like it should be 2009-06-09 23:12:47. This formatting will need to be corrected if we are to use the date values.
Now that we have gained a basic understanding of the data by looking at a few example records, the next step is to start looking at the structure of the data.
Country Destination Values
Arguably, the most important column in the dataset is the one the model will try to predict – country_destination. Looking at the number of records that fall into each category can help provide some insights into how the model should be constructed as well as pitfalls to avoid.
Table 2 – Users by Destination
|Destination||Records||% of Total|
Looking at the breakdown of the data, one thing that immediately stands out is that almost 90% of users fall into two categories, that is, they are either yet to make a booking (NDF) or they made their first booking in the US. What’s more, breaking down these percentage splits by year reveals that the percentage of users yet to make a booking increases each year and reached over 60% in 2014.
Table 3 – Users by Destination and Year
For modeling purposes, this type of split means a couple of things. Firstly, the spread of categories has changed over time. Considering that our final predictions will be made against user data from July 2014 onwards, this change provides us with an incentive to focus on more recent data for training purposes, as it is more likely to resemble the test data.
Secondly, because the vast majority of users fall into 2 categories, there is a risk that if the model is too generalized, or in other words not sensitive enough, it will select one of those two categories for every prediction. A key step will be ensuring the training data has enough information to ensure the model will predict other categories as well.
Account Creation Dates
Let’s now move onto the date_account_created column to see how the values have changed over time.
Chart 1 – Accounts Created Over Time
Chart 1 provides excellent evidence of the explosive growth of Airbnb, averaging over 10% growth in new accounts created per month. In the year to June 2014, the number of new accounts created was 125,884 – 132% increase from the year before.
But aside from showing how quickly Airbnb has grown, this data also provides another important insight, the majority of the training data provided comes from the latest 2 years. In fact, if we limited the training data to accounts created from January 2013 onwards, we would still be including over 70% of all the data. This matters because, referring back to the notes provided by Airbnb, if we want to use the data in sessions.csv we would be limited to data from January 2014 onwards. Again looking at the numbers, this means that even though the sessions.csv data only covers 11% of the time period (6 out of 54 months), it still covers over 30% of the training data – or 76,466 users.
Looking at the breakdown by age, we can see a good example of another issue that anyone working with data (whether a Data Scientist or not) faces regularly – data quality issues. As can be seen from Chart 2, there are a significant number of users that have reported their ages as well over 100. In fact, a significant number of users reported their ages as over 1000.
Chart 2 – Reported Ages of Users
So what is going on here? Firstly, it appears that a number of users have reported their birth year instead of their age. This would help to explain why there are a lot of users with ‘ages’ between 1924 and 1953. Secondly, we also see significant numbers of users reporting their age as 105 and 110. This is harder to explain but it is likely that some users intentionally entered their age incorrectly for privacy reasons. Either way, these values would appear to be errors that will need to be addressed.
Additionally, as we saw in the example data provided above, another issue with the age column is that sometimes age has not been reported at all. In fact, if we look across all the training data provided, we can see a large number of missing values in all years.
Table 4 – Missing Ages
|Year||Missing Values||Total Records||% Missing|
When we clean the data, we will have to decide what to do with these missing values.
First Device Type
Finally, one last column that we will look at is the first_device_used column.
Table 5 – First Device Used
The interesting thing about the data in this column is how the types of devices used have changed over time. Windows users have increased significantly as a percentage of all users. iPhone users have tripled their share, while users using ‘Other/unknown’ devices have gone from the second largest group to less than 5% of users. Further, the majority of these changes occurred between 2011 and 2012, suggesting that there may have been a change in the way the classification was done.
Like with the other columns we have reviewed above, this change over time reinforces the presumption that recent data is likely to be the most useful for building our model.
It should be noted that although we have not covered all of them here, having some understanding of all the data provided in a dataset is important for building an accurate classification model. In some cases, this may not be possible due to the presence of a very large number of columns, or due to the fact that the data has been abstracted (that is, the data has been converted into a different form). However, in this particular case, the number of columns is relatively small and the information is easily understandable.
Now that we have taken the first step – understanding the data – in the next piece, we will start cleaning the data to get it into a form that will help to optimize the model’s performance.
Hi Brett, very insightful viz! Do you have code to accompany the plots?