Data Science: A Kaggle Walkthrough – Adding New Data

This article is Part V in a series looking at data science and machine learning by walking through a Kaggle competition. If you have not done so already, you are strongly encouraged to go back and read the earlier parts – (Part I, Part II, Part III and Part IV).

Continuing on the walkthrough, in this part we take the data from sessions.csv that we left aside initially and add it to the transformed and expanded data from Part IV. This part will cover, in brief, all the steps in Parts II – IV.

Understanding the Data

As we did for the user data in training.csv, the first step here is to understand what the data in sessions.csv looks like. Although this file, with over 10 million rows, is too large to display in entirety in Excel[1], we can still open the file using Excel to get an understanding of what columns we have and what at least the first million rows of data looks like. Some sample rows are provided below:

user_id	action	action_type	action_detail	device_type	secs_elapsed
4rvqpxoh3h	campaigns	-unknown-	-unknown-	iPhone	375
4rvqpxoh3h	active	-unknown-	-unknown-	iPhone	728
4rvqpxoh3h	create	-unknown-	-unknown-	iPhone
4rvqpxoh3h	notifications	-unknown-	-unknown-	iPhone	187
4rvqpxoh3h	listings	-unknown-	-unknown-	iPhone	154
4rvqpxoh3h	unavailabilities	-unknown-	-unknown-	iPhone	204
4rvqpxoh3h	index	-unknown-	-unknown-	iPhone	21
4rvqpxoh3h	index	-unknown-	-unknown-	iPhone	886
c8mfesvkv0	confirm_email	click	confirm_email_link	iPad Tablet	1371616
c8mfesvkv0	header_userpic	data	header_userpic	iPad Tablet	8672
c8mfesvkv0	create	submit	create_user	iPad Tablet
xwxei6hdk4	dashboard	view	dashboard	iPhone	1355
xwxei6hdk4	header_userpic	data	header_userpic	iPhone	1246
xwxei6hdk4		message_post	message_post	iPad Tablet
xwxei6hdk4	ask_question	submit	contact_host	iPad Tablet	386
xwxei6hdk4	ask_question	submit	contact_host	iPad Tablet	424
xwxei6hdk4		message_post	message_post	iPad Tablet	0
xwxei6hdk4	confirm_email	click	confirm_email_link	iPhone	46262

As can be seen, the dataset contains records of user actions, with each row representing one action a user took. Every time a user reviewed search results, updated a wish list or updated their account information, a new row was created in this dataset. Although this data is likely to be very useful for our goal of predicting which country a user will make their first booking in, it also complicates the process of combining this data with the data from training.csv, as it will have to be aggregated so that there is one row per user (as opposed to many rows for each user, currently).

Aside from details of the actions taken, there are a couple of interesting fields in this data. The first is device_type – this field contains the type of device used for the specified action. The second interesting field is the secs_elapsed field. This shows us how long (in seconds) was spent on a particular action.

Both of these fields provide us with potentially important information that could help to more accurately predict which country a user will make a first booking in. For example, it is not difficult to imagine that people spending relatively little time to make a booking on a phone are likely to be making bookings in locations closer to home (i.e. the US) than someone spending more time to make a booking on a desktop computer. Of course this is just a theory that needs to be proven, but it is a good reason to ensure we are capturing this information in our final training dataset.

Cleaning and Transforming the Data

Now that we have a basic understanding of the data, we need to undertake the cleaning and transformation steps. Because of the structure of this data (and for the sake of brevity), we are going to do both of these things at the same time.

The first step is to import the data:

# Import sessions data
s_filepath = "./sessions.csv"
sessions = pd.read_csv(s_filepath, header=0, index_col=False)

Extract the primary and secondary devices for each user

Remembering that we need to get the final data into a format that can be merged with the data created in Part IV (i.e. a dataset where one row equals one user), the first piece of information we are going to extract is the primary and secondary device for each user. How do we determine what a user’s primary and secondary devices are? We look at how much time they spent on each device. In short we are going to make the following changes to the data:

One thing to note as we make these transformations is that by aggregating the data this way, we are also implicitly removing the missing values. The code to do this transformation is shown below:

# Determine primary device
print("Determining primary device...")
sessions_device = sessions[['user_id', 'device_type', 'secs_elapsed']]
aggregated_lvl1 = sessions_device.groupby(['user_id', 'device_type'], as_index=False, sort=False).sum()
idx = aggregated_lvl1.groupby(['user_id'], sort=False)['secs_elapsed'].transform(max) == aggregated_lvl1['secs_elapsed']
df_primary = aggregated_lvl1.loc[idx, ['user_id', 'device_type', 'secs_elapsed']].copy()
df_primary.rename(columns = {'device_type':'primary_device', 'secs_elapsed':'primary_secs'}, inplace=True)
df_primary = convert_to_binary(df=df_primary, column_to_convert='primary_device')
df_primary.drop('primary_device', axis=1, inplace=True)

# Determine Secondary device
print("Determining secondary device...")
remaining = aggregated_lvl1.drop(aggregated_lvl1.index[idx])
idx = remaining.groupby(['user_id'], sort=False)['secs_elapsed'].transform(max) == remaining['secs_elapsed']
df_secondary = remaining.loc[idx, ['user_id', 'device_type', 'secs_elapsed']].copy()
df_secondary.rename(columns = {'device_type':'secondary_device', 'secs_elapsed':'secondary_secs'}, inplace=True)
df_secondary = convert_to_binary(df=df_secondary, column_to_convert='secondary_device')
df_secondary.drop('secondary_device', axis=1, inplace=True)

Determine Counts of Actions

The next thing we are going to do is take counts of how many times each action was taken by each user. This is a two-step process. The first step is to determine the count of each action type for each user:

Step 1

Step 2

For you Excel buffs out there, the second step might strike you as something that could be achieved using a pivot table – and you would be right. In fact, the custom function that we use to make this transformation uses a pandas method called ‘pivot’. This is important to note for a couple of reasons. The first is that, with all the talk about new data, people who have worked with data mostly (or entirely) using ‘old technology’ like Excel and SQL are often given the impression that their skills are redundant or not useful in modern data science. As this example shows, the ways of thinking about data that you develop working with Excel and SQL are not only relevant, but often extremely useful.

The second reason is that for people (like me) who do not know all the methods available for pandas dataframes off by heart, being able to identify techniques you have used in other programs and languages provides you with a way to find corresponding methods in new languages. I discovered this method by searching for “pandas pivot”, knowing that this way of manipulating data was likely to have some equivalent in pandas.

Looping Through the Actions Columns

Looking at the examples above, you may have realized that the transformation as shown only works for one action column at a time, but in the data we have three action columns: action, action_type and action_detail.

To handle the multiple action columns, we repeat these steps for each column individually, effectively creating three separate tables. Because we have now created tables where each row represents one user, we can now join (another concept SQL users will be very familiar with) these three tables together on the basis of the user id. The full code for these steps is shown below:

# Count occurrences of value in a column
def convert_to_counts(df, id_col, column_to_convert):
    id_list = df[id_col].drop_duplicates()

    df_counts = df[[id_col, column_to_convert]]
    df_counts['count'] = 1
    df_counts = df_counts.groupby(by=[id_col, column_to_convert], as_index=False, sort=False).sum()

    new_df = df_counts.pivot(index=id_col, columns=column_to_convert, values='count')
    new_df = new_df.fillna(0)

    # Rename Columns
    categories = list(df[column_to_convert].drop_duplicates())
    for category in categories:
       cat_name = str(category).replace(" ", "_").replace("(", "").replace(")", "").replace("/", "_").replace("-", "").lower()
       col_name = column_to_convert + '_' + cat_name
       new_df.rename(columns = {category:col_name}, inplace=True)

    return new_df

# Aggregate and combine actions taken columns
print("Aggregating actions taken...")
session_actions = sessions[['user_id', 'action', 'action_type', 'action_detail']].copy()
columns_to_convert = ['action', 'action_type', 'action_detail']
session_actions = session_actions.fillna('not provided')
first = True

for column in columns_to_convert:
    print("Converting " + column + " column...")
    current_data = convert_to_counts(df=session_actions, id_col='user_id', column_to_convert=column)

    # If first loop, current data becomes existing data, otherwise merge existing and current
    if first:
        first = False
        actions_data = current_data
    else:
        actions_data = pd.concat([actions_data, current_data], axis=1, join='inner')

Combine Data Sets

The final steps are to combine the various datasets we have created into one large dataset. First we combine the two device dataframes (df_primary and df_secondary) to create a device dataframe. Then we combine the device dataframe with the actions dataframe to create a sessions dataframe with all the features we extracted from sessions.csv. Finally, we combine the sessions dataframe with the user data dataframe from Part IV. The code for the various combinations is shown below:

# Merge device datasets
print("Combining results...")
df_primary.set_index('user_id', inplace=True)
df_secondary.set_index('user_id', inplace=True)
device_data = pd.concat([df_primary, df_secondary], axis=1, join="outer")

# Merge device and actions datasets
combined_results = pd.concat([device_data, actions_data], axis=1, join='outer')
df_sessions = combined_results.fillna(0)

# Merge user and session datasets
df_all.set_index('id', inplace=True)
df_all = pd.concat([df_all, df_sessions], axis=1, join='inner')

A Note on Joins

For those that can read a little bit of code and are familiar with joins in SQL, you may be asking why I am using (full) outer joins for the first two combinations, but an inner join for the final step[2].

The first step requires an outer join because not all users have a secondary device. That is, some users only logged onto Airbnb using one device (or at least one type of device). Doing an outer join here ensures that our dataset includes all users regardless of this fact.

The second step could use an inner or an outer join, as both the device and actions datasets should contain all users. In this case we use an outer join just to ensure that if a user is missing from one of the datasets (for whatever reason), we will still capture them. You may also notice that after the second step we fill any missing values with 0s to ensure we do not have any NULL values that may have been generated by these outer joins.

For the third step we use an inner join for a key reason – we want our final training dataset to only include users that also have sessions data. Using an inner join here is an easy way to join the datasets and filter for the users with sessions data in one step.

Wrapping Up

In the first four parts of this series, we looked in detail at some of the various steps in the process of building a model. Although these steps should be distinct thought processes that occur for each model building process, hopefully what this article provides is an insight into how some of these steps can be combined if planned out carefully. In relatively few steps, we have taken a dataset containing 10 million rows of user actions data, cleaned it, extracted a bunch of important information, and added it to our user data, ready for training a model.

The other important thing to take away from this article is how useful ‘old school’ ways of thinking about data still are. For all the talk about unstructured data and NoSQL databases, the fact is that knowing how to work with and manipulate old fashioned columns and rows is still as important as ever. Whether it is joins and aggregation in SQL, pivot tables and VLOOKUPS in Excel, or just the general concept of relational data, not only is that knowledge relevant, but it is often extremely useful.

Next Time

In the next piece, we will finally get to the good stuff and train the algorithm to make the final predictions.

[1] Nope, still doesn’t qualify as ‘Big Data’…

[2] For those that do not understand what I mean by inner and outer joins (and are interested in knowing) – stackoverflow comes to the rescue again with this great illustrated answer.

Data Science: A Kaggle Walkthrough – Adding New Data

Understanding the Data

Cleaning and Transforming the Data

Extract the primary and secondary devices for each user

Determine Counts of Actions

Step 1

Step 2

Looping Through the Actions Columns

Combine Data Sets

A Note on Joins

Wrapping Up

Next Time

Leave a Reply Cancel reply

Archives

Categories

Data Science: A Kaggle Walkthrough – Adding New Data

Understanding the Data

Cleaning and Transforming the Data

Extract the primary and secondary devices for each user

Determine Counts of Actions

Step 1

Step 2

Looping Through the Actions Columns

Combine Data Sets

A Note on Joins

Wrapping Up

Next Time

Leave a Reply Cancel reply

Archives

Categories

Tags