Brett Romero

Data Inspired Insights

Official Release: Visual Analytics

I am proud to announce the release of an application I’ve been working on for the last few months – Visual Analytics. This application is designed to give you a new way to view your Google Analytics data using a range of interactive visualizations, allowing you to get a better understanding of who your users are, how they are getting to your site, and what they are doing when they get there.

For those worried about privacy and personal security, the application has a couple of features that will hopefully ease your mind. Firstly, there is no separate account or login details needed for Visual Analytics, everything is based on your existing Google account, and the login process is completed using Google authentication.



Secondly, the application does not currently store any user data. In fact, the application has no database at all (sensing a theme here?). That means that not only does that mean I can not sell your data to third parties, but that even if someone does manage to hack into the application, there is nothing to steal except my hacky code base.

For those interested in the technical specs, the backend of the application was built using Python and the Flask web framework. To access the data, once you are logged in using your Google credentials, the application makes calls to the Google Analytics API and then uses Pandas to handle the data manipulation (where needed). On the front end, the visualizations are created using D3.js and Highcharts (a big shout out to the Highcharts team and Mike Bostock for their excellent work on these libraries).

Anyway, if you have a Google Analytics account and are interested in getting some interesting insights into your data, take a look and let me know what you think. And please, if you find an issue or a bug, let me know!



Why the ‘boring’ part of Data Science is actually the most interesting

For the last 5 years, data science has been one of the world’s hottest professions, but it is also one of the most poorly defined. This can be seen on any career website, where advertisements for ‘Data Scientist’ positions describe everything from what used to be a simple data analyst role, to technical, PhD-only, research positions working on artificial intelligence or autonomous cars.

However, despite the diversity of roles being labelled ‘data scientist’, there is a common thread that runs through any job involving data and building models. And this is that only around 20% of time will be spent building models, with the other 80% of the time spent understanding, cleaning and transforming data to get it to the point where it can be used for modelling (for an overview of all the steps a Data Scientist goes through, see this series).

For many/most people working in the profession, the time spent cleaning and transforming is seen simply as a price to be paid to get to the interesting part – the modelling. If they could, many people would happily hand off this ‘grunt work’ to someone else. At first glance, it is easy to see why this would be the case – it is the modelling that gets all the headlines. There are very few people that hear about a model predicting cancer in hospital patients and thinks “they must have had some awesome clean data to build that with”.

However, plaudits aside, I am going to make the case that this is backwards. That from a creativity and challenge standpoint, it is often the cleaning and transforming parts of the job that are the most interesting parts of data science.

The creativity of cleaning

Over the past 12 years of working with data, one thing that has become painfully obvious is the unbridled creativity of people when it comes to introducing errors and inconsistencies into data. Typos, missing values, numbers in text fields, text in numerical fields, inconsistent spellings of the same item, and changing number formats (e.g. ever notice how most of continental Europe uses “,” as the decimal point instead of “.”?) are just some of the most common issues one will encounter.

To be fair, it is not only the fault of the person doing the data entry (e.g. an end user of an application). Often, the root of the problem is a poorly designed interface and a lack of data validation. For example, why is a user able to submit text in a field that should only ever contain numbers? Why do I have to guess how everyone else types in “the United States” (US, U.S., USA, U.S.A., United States of America, America, Murica) instead of choosing from a standardized list of countries?

However, even with the most carefully validated forms and data entry interface, data quality issues will continue to exist. People fudge their age, lie about their income, enter fake emails, addresses and names, and some, I assume, make honest typos and mistakes.

So why is dealing with these issues is a good thing? Because the unlimited creativity on the part of the people creating the data quality issues has to be exceeded by the creativity of the person cleaning the data. For every possible type of error that can be found in the data, the data scientist has to develop a method to address that error. And assuming the dataset is more than a few hundred rows, it will have to be a systematic method, as manually correcting the issues becomes impractical.

As a result, the data scientist has to find a way to address the universe of potential errors, and to do so in an automated, systematic way. How do I go through a column of countries that have all been spelt in different ways in order to standardize the country names? Someone got decimal happy and now I have a column where a lot of the numbers have two decimal points instead of one – how can I systematically work out which decimal point is the correct one, and then remove the other decimal point? A bunch of users put their birthday as 1 January 1900, how can I remove those, should I remove them, and if yes, what values should I put there instead?

All of these scenarios are real examples of interesting, challenging problems to solve, and ones that require a high-level of creativity to address.

The creativity of transformation/feature extraction

Once cleaning has been undertaken, typically the next step is to perform transformation and/or feature extraction. These steps are necessary because the data is rarely collected in the form required by the model, and/or there is additional information that can be added to and/or extracted from the data to make the model more effective.

If this sounds like a very open ended task, that’s because it is. Often, the ability to enhance a dataset is limited only by time, and the creativity and knowledge of the data scientist doing the work. Of course, there are diminishing returns, and at some point, it becomes uneconomic to invest additional effort to improve a dataset, but in many cases there are a huge range of options.

Due to the open-ended nature of this step, there are actually two types of creativity required. The first is the creativity to come up with potential new features that can be extracted from the existing dataset (and developing the methods to create those features). The second is identifying other data that could be used to enhance the dataset (and then developing the methods to import and combine it). Again, both of these are challenging and interesting problems to solve.

Making a model is often a mechanical process

Unlike the above, the process of creating the model is a relatively mechanical process. Of course, there are still challenges to overcome, but in most cases, it boils down to choosing an algorithm (or combination of algorithms), then tuning the parameters to improve the results. The issue is that both of these steps are not something that typically involve a lot of creative thinking, but instead involve cycling through a lot of options to see what works.

Even the selection of the algorithm, or combination of algorithms, which might seem relatively open ended, is, in the real world, limited by a range of factors. For a given problem, these factors include:

  • The task at hand – whether it be two-class or multi-class classification, cluster analysis, prediction of a continuous variable, or something else – will reduce the algorithm options. Some algorithms will typically perform better in certain scenarios, while others may simply not be able to handle the task at all.
  • The characteristics of the data often also reduces the options. Larger datasets mean some algorithms will take too long to train to be practical. Datasets with large numbers of features suit some algorithms more than others, while sparse datasets (those with lots of 0 values) will suit other algorithms.
  • An often-overlooked factor is the ability to explain to clients and/or bosses how and why a model is making a prediction. Being able to do this typically puts a significant limit on the complexity of the model (particularly ensembles), and makes simpler (and often less accurate) models more appealing.

After all these factors are taken into account, how many algorithms are left to choose from in a given scenario? Probably not too many.

machine learning cheat sheet

An excellent graphic from SAS summarizing how the algorithm choices in data science are often limited by the problem.

Wrapping Up

Taking all the above into account, the picture that starts to form is one where significant creativity is required to clean and create a good dataset for modelling, followed by a relatively mechanical process to create and tune a model. But if this is the case, why doesn’t everyone think the same way I do?

One of the primary reasons is that in most real-world data science scenarios, the above steps (cleaning, transformation, feature extraction and modelling) are not typically conducted in a strictly linear fashion. Often, building the model and assessing which features were the most predictive will lead to additional work transforming and extracting features. Feature extraction and testing a model will often reveal data quality issues that were missed earlier and cause the data scientist to revisit that step to address those issues.

In other words, in practice everything is interlinked and many data scientists view the various steps in the process of constructing a model (including cleaning and transforming) as one holistic process that they enjoy completing. However, because the cleaning and transforming aspects are the most time consuming, these aspects (data cleaning in particular) are often seen as being the major impediment to a completed project.

This is true – almost all projects could be completed significantly quicker if the data was of a higher quality at the outset. The quick turnaround for most Kaggle competition entries (where relatively clean and standardized data are provided to everyone) can attest to this. But to my fellow data scientists, I would say the following. Data science will always involve working with dirty and underdeveloped data – no matter how good we get at data validation, how clean and intuitive the interface, or how much planning is done on what data points to collect. Embrace the dirt, celebrate the grind, and take pride in creating creative solutions to often complex and challenging problems. If you don’t, no one else will.

The Surprising Complexity of Randomness

Previously, in a walkthrough on building a simple application without a database, I touched on randomness. Randomness and generating random numbers is a surprisingly deep and important area of computer science, and also one that few outside of computer science know much about. As such, for my own benefit as much as yours, I thought I would take a deeper look at the surprising complexity of randomness.

Why do we need randomness?

There can be a number of uses for randomness. But firstly, one thing to note is that when it comes to computers and computer science, randomness is typically represented by random numbers – seemingly random sequences of numbers that can then be used for different purposes. These purposes can range from randomly generating words in a flashcard app or shuffling songs in a playlist, to significantly more high-stakes uses, such as generating random keys for secure logins, data encryption, or randomly shuffling a deck of cards in an online game where large amounts of money are at stake.

How are random numbers created at the moment?

Random numbers come in two types, pseudorandom numbers and true random numbers.

Pseudorandom numbers are numbers that are generated to appear random, but are not truly random. Typically, pseudorandom numbers will be generated using a seed value provided by a user or programmer, which is then passed to an algorithm that uses that value to generate a new number. These algorithms often work by taking the remainder of an equation with includes the seed value and several large numbers.

For example, let’s say we use the following very simple equation to generate a series of random numbers:

R = (387 x S + 217) // 954

R is the random number to be produced
S is the seed value for R
// represents modular division, where the result will be the remainder of the division

Starting with a seed value (S) of 43, the first random number produced by the equation will be:

R = (387 x 43 + 217) // 953

R = 657

To produce the second random number, we then insert 657 as S, back into the equation:

R = (387 x 657 + 217) // 953

R = 25

This process can be repeated as many times as needed, generating an apparently random series of numbers.

While this example is a very simple one, this process of feeding the last random number into the same equation to generate a new random number is common to almost all pseudorandom number generators, and will result in two common attributes, regardless of the complexity.

The first is that if the seed value (S) is the same, the sequence of ‘random’ numbers produced by the algorithm will be exactly the same every time. This means that if you know the equation and the seed value, you can predict the entire sequence of ‘random’ numbers.

The second issue is that, eventually, the pattern will repeat. That is, eventually the formula will generate the same number twice, meaning the whole sequence will start again. And depending on the equation and large values chosen, this could be surprisingly soon.

Creating true random numbers

The reason we have pseudorandom numbers is because generating true random numbers using a computer is difficult. Computers, by design, are excellent at taking a set of instructions and carrying them out in the exact same way, every single time. It is this predictability which makes them so powerful. However, this predictability also makes it complicated to generate true random numbers.

As such, for a computer to create a truly random number, it has to take in some external input from something that is truly random. This external input can be something like key presses and movements of the mouse by a human operator, or network activity on a busy network in an office setting. But it can also be something far more complex such as the effect of atmospheric turbulence on a laser, or measuring the decay of a radioactive isotope.


Generating random numbers using mouse and keyboard inputs

Why does it matter?

This difference between pseudorandom and true random numbers is important, but only in certain settings.

For uses like selecting a random sample when working with data, shuffling a playlist, or triggering events in a video game, it is less important if pseudorandom or true random numbers are used. How true the randomness is, in these cases, will not impact the quality of the outcomes.

In some cases, using pseudorandom numbers may be advantageous. Take for example the process of selecting a random sample for a scientific study. In this case, using pseudorandom numbers allows others to replicate your results by using the same seed value. In video games, being able to trigger the same ‘random’ events is very useful when the game is being tested.

In other cases, using true random numbers is much more important. In applications such as encryption, using true random numbers is particularly important as it helps to ensure that data remains protected. Similarly, for online gambling, gaming companies need to have a very high level of confidence that the way results are being produced in everything from blackjack (how the cards are shuffled), to roulette (where the ball lands) and poker machines (which position the reels stop in) is a truly random process, or they risk someone reverse engineering the algorithm and making a significant profit as a result.

True randomness is not what most people expect

When it comes to true randomness, one of its stranger aspects is that it often behaves differently to people’s expectations. Take the two diagrams below – which one do you think is a random distribution, and which has been deliberately created/adjusted?

randomized dots

Only one of these panels shows a random distribution of dots | Source: Bully for Brontosaurus – Stephen Jay Gould

If you said the right panel, you are in good company, as this is most people’s expectation of what randomness looks like. However, this relatively uniform distribution has been adjusted to ensure the dots are evenly spread. In fact, it is the left panel, with its clumps and voids, that reflects a true random distribution. It is also this tendency for randomness to produce clumps and voids that leads to some unintuitive outcomes.

Take Spotify, the digital music service for example. For years, Spotify listeners have complained about the quality of the playlist shuffle. In fact, the quality of Spotify’s shuffle has been such a topic of discussion, that if you type “Spotify shuffle” into Google, one of the first autocomplete options that will come up is “sucks”. When Spotify looked into these complaints, the most common theme centered on songs from the same artist frequently playing one after the other. In short, people’s expectations of randomness were not matching reality. As Spotify explain in this interesting article, their shuffle was actually random, but they have now adjusted it to better align with what people think of as random – by reducing the randomness and ensuring that songs from a given artist will be spread throughout the playlist.

The gambler’s fallacy

As is also covered in the Spotify article, a great example of this misalignment of people’s expectations with the true nature of randomness is the so-called gambler’s fallacy. What the gambler’s fallacy boils down to is two things:

  1. A belief that independent random events (a flip of a coin, a roll of a dice) have some sort of inherent tendency to revert to the mean. For example, when flipping a coin, a streak of heads makes the likelihood that the next flip will be tails increase so that the eventual distribution will move back towards 50-50.
  2. As a result of belief 1, people tend to underestimate the likelihood of streaks (or clumps) of outcomes. The classic example of this is the person at the roulette table who looks at the list of previous results and sees a run of five black numbers, and believes that the likelihood of the next number being red is now higher as a result. By the way, this is exactly why casinos show the history, to tempt people into betting when they think the odds are in their favor.

To test your own beliefs on the likelihood of streaks, consider a roulette wheel in a casino. Let’s say the casino is open 12 hours a day, and that on average, it gets spun once per minute, giving us 720 spins in a day. Assuming there is a 50% chance of a red number and a 50% chance of a black number (i.e. we are ignoring the green 0 and 00 tiles for simplicity), what do you think the probability is of a streak of 8 or more black or red numbers in a row on a given day?

The answer is over 75%. In other words, on three out of four days, you should expect to see at least one streak of 8 or more black or red numbers during the day. Extending this, there is a 30% chance of a streak of 10 or more and around an 8% chance of a streak of twelve. You can test this and other scenarios using this handy calculator.

What does any of this mean?

In the course of your daily life, not too much. If you are a gambler, you should probably stop, but I am sure I am not the first person to tell you that. If you follow stock pickers, hopefully you will reconsider how much of their ‘skill’ is pure chance, especially when you factor in survivorship bias[1]. Perhaps something here will help you impress your friends at a trivia night.

If none of the above apply however, hopefully this article has introduced you to an interesting and little known area of knowledge with some important and fascinating applications.


[1] Survivorship bias in this context exists because the stock pickers that were not picking the right stocks did not keep writing articles. Over time, this leaves only the people who have been picking the winners (the ‘survivors’) to continue writing, even if their picks were correct purely by chance.


How to create a flashcard app without a database

Last week, I covered how setting up a database may not be necessary when creating an app or visualization, even one that relies on data. This week we are going to walk through an example application that runs off data, but does not need a formal database.

First some background. For the last couple of months, I have been attending some basic Arabic classes to help get around Jordan easier. During a recent class, several of the students were discussing the time they had spent putting together physical cardboard flashcards to help them memorize words. Hearing this, and having played around with creating simple applications and visualizations for a couple of years now, it occurred to me that generating flashcards using a simple application would probably be significantly quicker and easier to do. In addition, if done the right way, it could work on your phone, making it available to you anytime you had your phone and an internet connection.

Perhaps as you are reading this, you are thinking that Arabic is a widely spoken language, surely there is already an app for that? And you would be correct, there are multiple apps for learning Arabic. However, the complication with Arabic is that each country/region uses a significantly different version of the language. In addition to these regional dialects, there is Modern Standard Arabic, which is the formal written version of the language. When it comes to the apps currently available, the version of Arabic being presented is almost always Modern Standard Arabic, as opposed to the Levantine Arabic which is spoken throughout Palestine, Jordan, Lebanon and Syria. Additionally, the apps are quite expensive (up to $10), of questionable accuracy/quality, or both.

To address this problem, and for the challenge and learning opportunity, a few weeks back I sat down over a weekend and put together a simple application that would generate Arabic flashcards (you can see the current version here). The application is based on an Excel spreadsheet with translations that I continue to enter over time based on my notes and the class textbook. Using an Excel spreadsheet in this case is advantageous for two key reasons:

  • It is simple to edit and update with new translations
  • It is a format that almost everyone is familiar with so I can recruit other students and teachers to add translations

With that out of the way, let’s take a look at the high-level process for creating a flashcards app.

1. Collecting the Data

The first step is creating the Excel spreadsheet for our translations. In this case, it is fairly simple and looks something like this:

arabic flashcards

In this spreadsheet, each row represents one word and, through the application, will represent one flashcard. In one column, we have the Arabic script for each word, then in the next column the English translation. In addition, we also have a third version of the word with a column header of ‘transcribed’. This column represents how the Arabic word would look/sound if written in Latin script, something that is commonly used in beginner classes when students cannot yet read Arabic script.[1] Finally, in the last column we have a category column. This will be used to provide a feature where the user can filter the words in the application, allowing them to focus their study on particular sets of words.

A quick note, we also have an ID column, which is not used in the application. It is only included to provide a unique key in our datasets, as good practice.

2. Processing the Data

The next step is to take the data from our spreadsheet, convert it to a format that we can use to generate flashcards in the application, then save it. To do this we will use my favorite Python library, pandas, and the short script shown below.

# -*- coding: utf-8 -*-
import pandas as pd

# Read In Data
df = pd.read_excel("./data.xlsx", header=0)

# Create JSON String
json_string = df.to_json(orient="records", force_ascii=False)
json_string = "var data = " + json_string + ";"

# Write to file
text_file = open("data.js", "w")

What this script in does is read in the file (in this case, data.xlsx) to a pandas dataframe (line 5). After that (line 8), we use the to_json method to output the contents of the dataframe to a JSON string. In line 9 we add some JavaScript to the beginning and end of that JSON string, then in lines 12-14 we save the string as a JavaScript file, data.js.

There are a couple of important things to note here. The first is that when dealing with non-Latin text characters (like Arabic characters), we need to specify that force_ascii=False (the default value is True). If we don’t do this, the script will return an error and/or convert the Arabic letters into a combination of Latin characters representing the Unicode character (i.e. it will look like gibberish).

The second thing to note for those that have not worked with JSON, or key-value stores more generally, is that this is the format that most data comes in when used in programs and applications. It is a highly flexible structure and, as a result, there are many ways we could represent the data shown above. In this case, we are using the ‘records’ format (as specified by pandas), which will look like this:


“english”:”A lot\/Many\/Very”,


If this isn’t making any sense, or you would like to see some of the other possibilities, copy and paste some spreadsheet data into this CSV to JSON convertor. Toggling a few options, it should quickly become obvious how many different ways a given dataset can be represented in JSON format.

3. Building the App

Now that the data is ready, we create the files needed for the flashcards application. In this case, it is only three files, a HTML document (index.html) for the page, a CSS file for the styling, and an additional JavaScript file that will use the data in data.js to create the flashcards and generate the various features of the application. For those that are interested in the full code or want to create your own version, please feel free to checkout/fork the GitHub repo. For those that do not want to get too far into the weeds, there are just a few things I want to highlight about what the code is doing.

Firstly, the filtering and language options in the application are being generated directly from the data. What this means is that as more categories are added to the Excel spreadsheet, or if the languages change (i.e. the headings in the spreadsheet change), as soon as I update the underlying Excel and run the script shown above, all the options in the application will also update accordingly.

Secondly, I added a feature that allows the user to keep score. It is a simple honesty-based system, but I found it does provide some motivation to keep improving, as well as removing an element of self-deception as to how well you are actually doing. Often I would find myself thinking that I was getting almost all of them correct, only to find my correct percentage hovering around 70%.

Finally, a note on randomness. Whether the user is going through the cards unfiltered, or filtering for some category, the application is displaying the flashcards in a random[2] order. This random selection algorithm went through several iterations:

  1. In version 1, the algorithm would simply select four (the number of flashcards presented to the user at one time) random selections from the pool of eligible words.
  2. Upon testing version 1, it was found that, with surprising regularity, the same word would be selected more than once in a group of four flashcards. To address this, in version 2 a condition was added that when randomly selecting a word, it would only be accepted if that word had not already been selected in the given pool of four words.
  3. On further testing, I noticed another annoying issue. As I continually refreshed the four flashcards being displayed, some words would show up repeatedly, while others would take forever to show up, or not show up at all. To avoid this, for version 3, I changed the algorithm again. Now, instead of selecting four words at random, the algorithm instead took the whole list of words, shuffled them in a random order, and ran through the list in the new shuffled order. When the list ran out of words, it took the full list, shuffled it again, and continued.
  4. This was a big improvement. As I refreshed, I got different words, and was able to see all the words before they started repeating. But then I found another issue. In cases where the number of eligible words was not divisible by four, the old shuffled list and the new shuffled list would overlap in a selection of four words. In these cases, there was a possibility that the same word would be repeated. This is a little difficult to visualize, so the illustration below tries to present what was happening using an example list of ten words:

arabic flashcards

To address this, in version 4, a new condition was added. In cases like the example shown above, the algorithm will check the words from the new shuffled list to ensure they are not already selected from the old list. If a word is already selected, it will move that word to the end of the list and instead take the next word on the list. Here is another diagram to show what is happening:

arabic flashcards

4. Finishing Up

Ok, for those stepping through this and creating your own flashcards app, at this point you have copied the code available from the repo, made any changes to the spreadsheet, and rerun the script to refresh the data. For the final step, there are a couple of things that can be done.

If you are only planning to use the app on the same computer as you are using to create the flashcards app, you are done! Just open the index.html file using Chrome, Firefox or Safari (you can try Internet Explorer, but you know…) and you can test and use the app as you would use any website.

If you want to publish your flashcards app online to share with others, by far the easiest way is to use a service such as GitHub pages. I don’t want to turn this into a beginners guide to using git and GitHub, but there is excellent documentation available to help get you started if you would like to do this. You can see my version at the following address:, but there is even an option to redirect it to a domain of your choosing should you have one.

arabic flashcards


I hope this was a helpful guide to how a simple application can be created without a database, even if the application runs on some underlying form of data. Let me know what you think in the comments below!


[1] Because Arabic has many sounds that are difficult to convey in Latin script, this is also why when Arabic is transcribed, you will often find multiple spellings of the same word (e.g. Al-Qaeda vs Al-Qaida).

[2] As will be discussed in a new piece to be written, it is not truly random, and the reasons why are pretty interesting.

Forget SQL or NoSQL – 5 scenarios where you may not need a database at all

A while back, I attended a hackathon in Belgrade as a mentor. This hackathon was the first ‘open data’ hackathon in Serbia and focused on making applications using data that had recently been released by various ministries, government agencies, and independent bodies in Serbia. As we walked around talking to the various teams, one of the things I noticed at the time, was that almost all teams were using databases to manage their data . In most cases, the database being used was something very lightweight like SQLite3, but in some cases more serious databases (MySQL, PostgreSQL, MongoDB) were also being used.

What I have come to realize is that in many cases this was probably completely unnecessary, particularly given the tight timeframe the teams were working towards – a functional prototype within 48 hours. However, even if you have more time to build an application, there are several good reasons that you may not need to worry about using a formal database. These are outlined below.

1. The data is small

Firstly, let’s clarify what I mean when I say ‘small data’. For me, small data is any dataset under 10,000 records (assuming a reasonable number of data points for each record). For many non-data people, 10,000 records may seem quite big, but when using programming languages such as Python or JavaScript, this amount of data is usually very quick and easy to work with. In fact, as Josh Zeigler found, even loading 100,000 records or 15MB of data into a page was possible, completing in as little as 463ms (Safari FTW).

Leaving aside the numbers for a second, the key point here is that in many cases, the data being displayed in an application has far fewer than 10,000 records. If your data is less than 10,000 records, you should probably ask yourself, do you need a database? It is often far simpler, and requires significantly less overhead to simply have your data in a JSON file and load it into the page directly. Alternatively, CSV and Excel files can also be converted to JSON and dumped to a file very quickly and easily using a Python/Pandas script.

ecis visualization

The ECIS Development Tracker uses data from six Worldwide Governance Indicators and two other series over 20 years and 18 countries – a total of almost 3,000 data points and a perfect example of small data.

2. The data is static

Another reason you may not need a database is if you have a reasonable expectation that the data you are using is not going to change. This is often the case where the data is going to be used for read only purposes – for example visualizations, dashboards and other apps where you are presenting information to users. In these cases, again it may make sense to avoid a database, and rely on a flat file instead.

The important point here is that if the data is not changing or being altered, then static files are probably all that is needed. Even if the data is larger, you can use a script to handle any data processing and load the (assumedly) aggregated or filtered results into the page. If your needs are more dynamic (i.e. you want to show different data to different users and do not want to load everything), you may need a backend (something you would need for a database anyway) that extracts the required data from the flat file, but again, a database may be overkill.

kosovo mosaic

The Kosovo Mosaic visualizer – based on data from a survey conducted once every three years – is an example of a case where the data is not expected to change any time soon.

3. The data is simple

One of the big advantages of databases is their ability to store and provide access to complex data. For example, think about representing data from a chain of retail stores on the sale of various products by different sales people. In this case, because there are three related concepts (products, sales people and stores), representing this data without using a database becomes very difficult without a large amount of repetition[1]. In this case, even if the data is small and static, it may simply be better to use a relational database to store the data.

However, in cases where the data can be represented in a table, or multiple unrelated tables, subject to points 1 and 2 above, it may make sense to avoid the overhead of a database.

database schema

If you need a schema diagram like this to describe your data, you can probably skip the rest of this article.

4. The data is available from a good API

I have recently been working on a project to develop an application that is making extensive use of the Google API. While still under development, the app is already quite complex, making heavy use of data to generate charts and tables on almost every page. However, despite this complexity, so far, I have not had to use a database.

One of the primary reasons I have not needed to implement a database is that the Google API is flexible enough for me to effectively use that as a database. Every time I need data to generate a chart or table, the app makes a call to the API (using Python), passes the results to the front end where, because the data is small (the Google API returns a maximum of 10,000 rows in a query), most of the data manipulation is handled using JavaScript on the client side. For the cases where more heavy data manipulation is required, I make use of Python libraries like Pandas to handle the data processing before sending the data to the front end. What this boils down to is a data intensive application that, as yet, still does not need a database.

Of course, this isn’t to say I will not need a database in the future. If I plan to store user settings and preferences, track usage of the application, or collect other meta data, I will need to implement a database to store that information. However, if you are developing an application that will make use of a flexible and reliable API, you may not need to implement your own database.

google apis

Google has APIs available for almost all of its products – most of them with a lot of flexibility and quick response times.

5. The app is being built for a short-term need

While it might seem unusual to build a web app with the expectation that it will not be used six months later, this is a surprisingly common use case. In fact, this is often the expectation for visualizations and other informative pages, or pages built for a specific event.

In these particular use cases, keeping down overhead should be a big consideration, in addition to potential hosting options. Developing these short-term applications without a backend and database means free and easy hosting solutions like that provided by GitHub can be used. Adding a backend or database immediately means a more complex hosting setup is required.

Wrapping up, this is a not an argument against databases…

… it is simply an argument to use the best and simplest tools for a given job. As someone who has worked with a number of different databases throughout their career, I am actually a big user of databases and find most of them intuitive and easy to use. There is also a large number of advantages that only a database can provide, from ensuring data consistency, to facilitating large numbers of users simultaneously making updates, to managing large and complex datasets, there are a number of very good reasons to use a database (SQL or NoSQL, whichever flavor you happen to prefer).

But, as we have covered above, there may be some cases where you do not need these features and can avoid adding an unnecessary complication to your app.


Next week we’ll take a look at a simple app that uses an Excel spreadsheet to generate the data required for the application.


[1] With repetition comes an increased risk of data quality issues

Uber Vs Taxi – A Follow-Up

Hi everyone – welcome to 2017! I hope you all had a good Christmas and New Year’s Eve and are geared up for a big 2017.

Kicking off the year, this week, I happened to stumble on a series of articles written by Hubert Horan, who has spent the last 40 years working in the transportation industry, particularly the management and regulation of airlines. In a four-part series (two pieces were later added to respond to reader comments and look at newer evidence) published at, he takes a critical look at the Uber business model and dispels a bunch of myths.

Some of my longer-time readers may remember a two-piece series I wrote looking at the relative advantages of Uber and traditional taxis (Part I and Part II). This series of articles (links at the bottom) actually expands on many of the points I brought up in those articles, particularly Part II where I took a more critical look at some of Uber’s practices. The TLDR is as follows:

  1. Despite huge expansion across the globe, Uber is continuing to burn through investors’ cash at an unprecedented rate (around $2 billion a year).
  2. Although there have been large increases in revenues, there are no signs to date that Uber’s profitability (currently sitting at around -140%!) is improving due to ‘economies-of-scale’, older markets maturing, or other ‘optimizations’. In fact the only thing that has had a measurable impact on profitability has been cutting driver pay.
  3. Uber’s huge losses are primarily due to one thing – the expansion across the globe is being driven by subsidies. According to Horan, current Uber passengers are only paying around 41% of the cost of their rides due to these subsidies (I do note that no source was provided for this number).
  4. Paying drivers more than regular taxi services is one of the main ways Uber is attracting drivers. However, one of the things that allows Uber to do this is the fact that they have pushed one of the most significant costs of running a taxi onto the drivers – the actual ownership and maintenance of the car. Once the expenses of running and maintaining a car are taken into account, it is not clear that drivers are actually any better off, and in many cases, are probably worse off.
  5. This is something I touched on in my articles – many (most?) Uber drivers are simply not across concepts like depreciation and capital risk. For them net profit is simply ‘my share of fare revenue’ minus ‘gas costs’, which leads to a large proportion of Uber drivers continuing to drive when it is does not make economic sense for them to do so. A big part of Uber’s success has been their ability to take advantage of this ignorance.

So why are investors continuing to pour money into Uber if it isn’t making money and the current business model does not seem to make sense? I have heard two theories raised in response to this question.

The first is that Uber is simply buying time to get self-driving cars on the road, at which point, it can replace (a.k.a. fire) all its ‘driver-partners’ and Uber’s share of fare revenue goes from 30% to 100%. I was actually a believer in this theory until recently when Noah Smith made the counter-intuitive argument that self-driving technology is likely to be terrible for Uber. Why? Because every person with a self-driving car becomes a potential competitor for Uber. By simply renting out their car when they are not using it, they are competing with Uber and can do so at very low cost because they have none of the overheads Uber has. Sure, Uber will have the app, but the app is easy and cheap to recreate (as is evidenced by the 17 Uber clones in most cities already). But even without an app at all, a large portion of the market is going to go through the minimal hassle of calling or texting (or whatever else the kids are doing these days) someone for a ride if the price is even a couple of dollars better. Finally, even if Uber lowers prices to drive (pun intended) those people out of the market, as soon as prices rise again, all those individuals will re-enter the market due to the close to zero cost of doing so.

The second (and more realistic theory in my mind) is that Uber is aiming to drive all its competitors out of business and create a monopoly. Once it has a monopoly, it can lower driver pay and raise fare prices to extract monopoly profits. Uber’s behavior to date (the subsidies are simply predatory pricing with good publicity), as well as comments from prominent investors, would seem to lend credence to this theory. But even this theory has issues, the biggest of which would seem to be that it has a very limited window to operate in due to the imminent arrival of self-driving cars. I am probably more skeptical than most people on how soon self-driving cars will be on the streets of cities (10-20 years, with long haul probably coming sooner), but even if we take the best case scenario for Uber and said it is going to be 20 years before self-driving cars are on the streets of cities, is that going to be long enough to generate the returns needed to justify the huge sums investors have poured into the company? And if this is the plan, why are Uber trying to speed up the introduction of self-driving cars? I don’t have good answers to either of these questions unfortunately.

For those with any interest in this topic, I strongly encourage you to read at least the first 4 parts over at – here are the links:

Part 1 – Understanding the Economics

Part 2 – Understanding Cost Structures

Part 3 – Innovation and Competitive Advantages

Part 4 – Understanding that Monopoly was Always the Goal

Part 5 – Addressing Reader Comments

Part 6 – Further Evidence

Finally, on an anecdotal note, I have recently moved to Amman where Uber operates, along with a local competitor (Careem) and a large local taxi industry. For those that may be thinking that people will probably be happy to pay a little extra for the improved Uber experience, Amman would offer an example of the opposite case. In Amman, Uber and Careem both cost around 1.5-2 times as much as a metered taxi. Either way it is still cheap (a ride from downtown to the western edge of the city would be $2.50-$3.50 in a taxi, $5.50-$6.50 in an Uber), but even with the truly horrible state of most taxis in Amman, and the inconvenience of having to flag one down, this price differential is enough to make the ride sharing portion of the transport business practically non-existent.

ECIS Development Tracker

View the visualization here.

JSONify It – CSV to JSON Converter

Go to JSONify It

For those who have some experience in creating visualizations, particularly online visualizations using JavaScript and libraries such as D3.js, one thing that you will often come across is the need to convert your data. Typically this need will arise because the data you receive or collect will be in a human-friendly format such as an Excel spreadsheet, and in order for you to use it for the visualization you will need that data in JSON format. Annoyingly, this will often be just a one time conversion, meaning writing a stand alone script to do the conversion often seems like overkill.

Handily, there are a number of CSV to JSON converters lying around on the internet for people to use, and most of them work more or less as expected. However, a problem I encountered when building this Procurement Zoomable Treemap visualization, is that you sometimes need the JSON to be nested, and this was not a feature I encountered on any of the online converters.

In order to address my need (and to see if I could pull it off), when I built that visualization I also used Python/Flask/Pandas to build a simple API that generated nested JSON datasets on the fly from an underlying CSV file. Having this allowed me to build the zoomable tree map that could be reconfigured by the user. That is, the user could specify the categories and the order in which the treemap would zoom through.

While this was great, it always felt a little incomplete. Then a few months back, I had some time on my hands and decided to take this API and upgrade it to have a full user interface so that, like many of the online convertors, users could copy and paste data straight into the browser, configure some options, and get a JSON formatted dataset back. The result was JSONify It – a simple but (I hope) easy to use app that is not only very flexible in formatting JSON, but as far as I can tell, is the only CSV to JSON convertor that allows you to nest the JSON by any column (or columns) you specify.

So, for those interested, feel free to take a look, try it out, look at the code, and if you come across any bugs or issues, or would like any further information, please let me know in the comments below.


Can’t Get No Satisfaction – Kosovo Mosaic

Last week, one of the cooler projects I have worked on since I started with Open Data Kosovo finally got released to the public. The project was to visualize data collected by a survey called Kosovo Mosaic. This survey is run every three years across all 38 municipalities in Kosovo and asks citizens, amongst other things, how satisfied they are with a range of services the municipality and government provides. 2015 was the fifth installment of the survey.

Our job[1] was to work with the Kosovo Mosaic data and come up with a way to visualize it so that people who are not into spreadsheets and coding can interact with it and get something from it. Our solution, using D3.js and Highcharts, was to provide users with a (hopefully) easy to understand interface to explore the data and find their own interesting conclusions.

Obviously the data presented in this visualization will appeal mostly to those who have some connection to Kosovo. However, even for those that do not, I thought you may be interested in seeing how we approached the problem and whether it works for you. If you do have any thoughts, things you like, things we could have done better, please feel free to leave a comment.

You can access the interactive visualization by clicking the picture below:

Kosovo Mosaic

[1] Special thanks goes to Vullkan Halili who did most of the work on this project.

Data Science: A Kaggle Walkthrough – Creating a Model

This article is Part VI in a series looking at data science and machine learning by walking through a Kaggle competition. If you have not done so already, you are strongly encouraged to go back and read the earlier parts – (Part I, Part II, Part III, Part IV and Part V).

Continuing on the walkthrough, in this part we build the model that will predict the first booking destination country for each user based on the dataset created in the earlier parts.

Choosing an Algorithm

The first step to building a model is to decide what type of algorithm to use. Below we look at some of the options.

Decision Tree

Arguably the most well known algorithm, and one of the simplest conceptually. The decision tree works in a similar manner to the decision tree that you might create when trying to understand which decision to make based on a range of variables.

The goal of the decision tree algorithm used for classification problems (like the one we are looking at) is to create one of these decision trees to classify records into a set number of categories. To do this, it starts with all the records in the training dataset and looks through all the features until it finds the one that allows it to most ‘cleanly’ split the records according to their categories. For example, if you are using daily weather data to try and determine whether it will rain the following day (i.e. there are two categories, ‘it does rain’ and ‘it does not rain’), the algorithm will look for a feature that best splits the records (in this case representing days) into those two categories. When it finds that feature, and the value to split on, it creates one point (‘decision node’) on the decision tree. It then takes each subpopulation and does the same thing again, building up a tree until either all the records are correctly classified, or the number in each subpopulation becomes too small to split. Below is an example decision tree using the described weather data to predict if it will rain tomorrow or not (thanks to Graham Williams’ excellent Rattle package for R):

The way to interpret the above tree is to start at the top. The first criteria the algorithm splits on is the humidity at 3pm. Starting with 100% of the records, if the the humidity at 3pm is less than 71, as it is the case for 93% of the records, we move to the left and find the next decision node. If the humidity at 3pm is greater than or equal to 71, we move to the right, which takes us to a leaf node where the model predicts that there will be rain tomorrow (‘yes’). We can see from the numbers in the node that this represents 7% of all records, and that 74% of the records that reach this node are correctly classified.

The first thing to note is that the model does not accurately predict whether it will rain tomorrow for all records, and in some leaf nodes, it is only slightly better than a coin toss. This is not necessarily a bad thing. The biggest problem that data scientists have with decision trees is the classic problem of overfitting. In the example above, parameters have been set to stop model splitting once the population of records at a given node gets too small (minimum split) and when a certain number of splits have occurred (‘maximum depth’). These values have been set at values to prevent the tree from growing to large. The reason for this is that if the tree gets too large, it will start modelling random noise and hence will not work for data not in the training dataset (it will not ‘generalize’ well).

To picture what this means, imagine extending the example decision tree above further until the model starts splitting out single records using criteria like ‘Humidty3pm = 54’ and ‘Humidty3pm = 31’. That type of decision node may work for this particular training data because there is a specific record that meet that criteria, but it is highly unlikely that it represents any predictive ability and so is unlikely to be accurate if applied to other data.

All this discussion of overfitting with decision trees does however raise an important problem. That problem is how do you know how large you should grow the tree. How do you set the parameters to avoid overfitting but still have an accurate model? The truth is that is is extremely difficult to know how to set the parameters. Set them too conservatively and the model will lose too much predictive power. Set them too aggressively and the model will start overfitting the data.

Seeing the Forest for the Trees

Given the limitations of decisions trees and the risk of overfitting, it may be tempting to think “why bother?” Fortunately, methods have been found to reduce the risk of overfitting and increase predictive power of decisions trees and the two most popular methods both have the same basic premise – to train multiple trees.

One of the most well known algorithms that utilizes decision trees is the ‘random forest’ algorithm. As the name suggests, the algorithm constructs a large number of different trees (as defined by the user) by randomly selecting the features that can be used to build each tree (as opposed to using all the features for each tree). Typically, the trees in a random forest also have the parameters set to ensure each tree will also be relatively shallow, meaning that the algorithm creates a large number of shallow decision trees (decision bonsai?). Once the trees are constructed, each tree is used to predict the outcome for a new record, with these multiple predictions then serving as votes, with a majority rules approach applied.

Another algorithm which has become almost the default algorithm of choice for Kagglers, and is the type of the model we will use, uses a method called ‘boosting’, which means it builds trees iteratively such that each tree ‘learns’ from earlier trees. To do this the algorithm builds a first tree – again typically a shallower tree than if you were going to use a one tree approach – and makes predictions using that tree. Then the algorithm finds the records that are misclassified by that tree, and assigns a higher weight of importance to those records than the records that were correctly classified. The algorithm then builds a new tree with these new weightings. This whole process is repeated as many times as specified by the user. Once the specified number of trees have been built, all the trees built during this process are used to classify the records, with a majority rules approach used to determine the final prediction.

It should be noted that this methodology (‘boosting’) can actually be applied to many classification algorithms, but has really grown popular with the decision tree based implementation. It should also be noted there are different implementations of this algorithm even just using trees. In this case, we will be using the very popular XGBoost algorithm.

Alternative Models

So far we have only covered decision trees and decision tree-based algorithms. However, there are a range of different algorithms that can be used for classification problems. Given this is supposed to be a short blog series, I will not go into too much detail on each algorithm here. But if you want more information on these algorithms, or other algorithms that I haven’t covered here, there is a growing amount of information online. I also strongly recommend the Data Science specialization offered by John Hopkins University, for free, on Coursera.

K-Nearest Neighbors

The K-nearest neighbor algorithms are arguably one of the simplest algorithms in concept. The algorithm classifies a given object by looking at the classification of the k most similar records[1] and seeing how those records are classified. This type of algorithm is called a lazy learner because during the training phase, it essentially just stores the data provided. Only when a new object needs to be classified does the algorithm start looking through the data to try to find the closest matches.

Neural Networks

As the name suggests, these algorithms simulate biological networks by creating a series of nodes and connecting them together. A neural network typically consists of three layers; an input layer, a hidden layer (although there can be multiple hidden layers) and an output layer.

A model is trained by passing records through the network and weights adjusted at each node continually adjusted to ensure that the record ends up at the right ‘output node’.

Support Vector Machines

This type of algorithm, commonly used for text classification problems, is arguably the most difficult to visualize. At the simplest level, the algorithm tries to draw straight lines (or planes for classifications with more than 2 features) that best separate the classes provided. Although this sounds like a fairly simplistic approach to classifying objects, it becomes far more powerful due to the transformations (sometimes called a ‘kernel trick’) the algorithm can apply to the data before drawing these lines/planes. The mathematics behind this are far too complex to go into here, but the Wikipedia page has some nice visuals to help picture how this is working. In addition, this video provides a nice example of how a Support Vector Machine can separate classes using this kernel trick:

Creating the Model

Back to the modelling – now that we know what algorithm we are using (XGBoost algorithm for those skipping ahead), let talk about the approach.

Cross Validation

As mentioned in regards to decision trees, one of the keys risks when creating models of any type is the risk of overfitting. One of the primary ways data scientists will guard against overfitting is to estimate the accuracy of their models on data that was not used to train the model. To do this they typically use a method called cross validation. There are different methods for doing cross validation, but the method we will employ is called k-fold cross validation.

k-fold cross validation involves splitting the training data into k subsets (where k is greater than or equal to 2), training the model using k – 1 of those subsets, then running the model on the subset that was not used in the training process. Because all of the data used in the cross validation process is training data, the correct classification for each record is known and so the predicted category can be compared to the actual category. Once all folds have been completed, the average score across all folds is taken as an estimate of how the model will perform on other data. An example of a 3-fold cross validation is shown below:

Parameter Tuning

As you may have realized from the earlier description of the XGBoost algorithm – there are quite a few options (parameters) that we need to define to build the model. How many trees to build? How deep should each tree be? How much extra weight will be attached to each misclassified record? Tuning these parameters to get the best results from the model is often one of the most time consuming things that data scientists do. Fortunately, the process can be automated to a large degree so that we do not have to sit there rerunning the model repeatedly and noting down the results. Even better, using the Scikit-Learn package, we can merge the parameter tuning and cross validation steps into one, allowing us to search for the best combination of parameters while using k-fold cross validation to verify the results.

Training the Model

In order to train the model (using cross validation and parameter tuning as outlined above), we first need to define our training dataset – remembering that we previously combined the training and test data to simplify the cleaning and transforming process. To feed these into the model, we also need to split the training data into the three main components – the user IDs (we don’t want to use these for training as they are randomly generated), the features to use for training (X), and the categories we are trying to predict (y).

# Prepare training data for modelling
df_train.set_index('id', inplace=True)
df_train = pd.concat([df_train['country_destination'], df_all], axis=1, join='inner')

id_train = df_train.index.values
labels = df_train['country_destination']
le = LabelEncoder()
y = le.fit_transform(labels)
X = df_train.drop('country_destination', axis=1, inplace=False)

Now that we have our training data ready, we can use GridSearchCV to run the algorithm with a range of parameters, then select the model that has the highest cross validated score based on the chosen measure of a performance (in this case accuracy, but there are a range of metrics we could use based on our needs).

# Grid Search - Used to find best combination of parameters
XGB_model = xgb.XGBClassifier(objective='multi:softprob', subsample=0.5, colsample_bytree=0.5, seed=0)
param_grid = {'max_depth': [3, 4, 5], 'learning_rate': [0.1, 0.3], 'n_estimators': [25, 50]}
model = grid_search.GridSearchCV(estimator=XGB_model, param_grid=param_grid, scoring='accuracy', verbose=10, n_jobs=1, iid=True, refit=True, cv=3), y)
print("Best score: %0.3f" % model.best_score_)
print("Best parameters set:")
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Please note that running this step can take a significant amount of time. Running the algorithm with 25 trees takes around 2.5 minutes for each cross validation on my Macbook Pro. Running the script above with all the options specified will likely take well over an hour.

Making the Predictions

Now that we have trained a model based on the best parameters, the next step is to use the model to make predictions for the records in the testing dataset. Again we need to extract the testing data out of the combined dataset we created for the cleaning and transformation steps, and again we need to separate the main components for the model. After these steps, we use the model created in the previous step to make the predictions.

# Prepare test data for prediction
df_test.set_index('id', inplace=True)
df_test = pd.merge(df_test.loc[:,['date_first_booking']], df_all, how='left', left_index=True, right_index=True, sort=False)
X_test = df_test.drop('date_first_booking', axis=1, inplace=False)
X_test = X_test.fillna(-1)
id_test = df_test.index.values

# Make predictions
y_pred = model.predict_proba(X_test)

As you may have noted from the code above, we have used the predict_proba method instead of the usual predict method. This is done because of the way Kaggle will assess the results for this particular competition. Rather than just assessing one prediction for each user, Kaggle will assess up to 5 predictions for each user. In order to maximize the score, we will use the predicted probabilities that predict_proba produces to select the 5 best predictions. Finally, we will write these results to a file that will be created in the same folder as the script.

#Taking the 5 classes with highest probabilities
ids = []  #list of ids
cts = []  #list of countries
for i in range(len(id_test)):
    idx = id_test[i]
    ids += [idx] * 5
    cts += le.inverse_transform(np.argsort(y_pred[i])[::-1])[:5].tolist()

#Generate submission
print("Outputting final results...")
sub = pd.DataFrame(np.column_stack((ids, cts)), columns=['id', 'country'])
sub.to_csv('./submission.csv', index=False)

For those that wish to, you should be able to submit the file produced from this script on Kaggle. The competition is now finished and you will not receive an official position on the leaderboard, but your results will be processed and you will be told where you would have finished.

Wrapping Up

Those that are more experienced with data science may realize this series, as lengthy as it is, does not even scratch the surface of a lot of topics related to data science. Unsupervised learning, association rules mining, text analytics and deep learning are all topics that have not been covered at all. Unfortunately, the full scope of data science and machine learning are not something that can be covered in a blog. That said, I did have two goals for those reading these blog articles.

Firstly, I hope that this series demystifies some aspects of data science for those that currently see it as a black box. Although one can spend their career working in data science and still not master all aspects, even a cursory understanding of how machine learning algorithms work can help provide understanding as to what sort of questions machine learning can help to answer, and what sort of questions are problematic.

Secondly, I hope this series encourages some of you to dig deeper, to learn more about this topic. Machine learning is a rapidly growing field that is expanding to every aspect of life. This includes, recommendation engines on websites, astronomy – where it helps to identify stars and planets, the pharmaceutical industry – where it is being used to predict which molecular structures that are likely to produce useful drugs, and maybe most famously, in training self‑driving cars to drive in the real world. Whatever your primary interest, there is likely to be some machine learning applications being developed or being used already.

[1] There are a range of metrics that can be used to do this. For available metrics in the Scikit Learn package, see here.

Full script:

import pandas as pd
import numpy as np
import xgboost as xgb

from sklearn import cross_validation, decomposition, grid_search
from sklearn.preprocessing import LabelEncoder

# Functions #
# Remove outliers
def remove_outliers(df, column, min_val, max_val):
    col_values = df[column].values
    df[column] = np.where(np.logical_or(col_values<=min_val, col_values>=max_val), np.NaN, col_values)

    return df

# Home made One Hot Encoder
def convert_to_binary(df, column_to_convert):
    categories = list(df[column_to_convert].drop_duplicates())

    for category in categories:
        cat_name = str(category).replace(" ", "_").replace("(", "").replace(")", "").replace("/", "_").replace("-", "").lower()
        col_name = column_to_convert[:5] + '_' + cat_name[:10]
        df[col_name] = 0
        df.loc[(df[column_to_convert] == category), col_name] = 1

    return df

# Count occurrences of value in a column
def convert_to_counts(df, id_col, column_to_convert):
    id_list = df[id_col].drop_duplicates()

    df_counts = df.loc[:,[id_col, column_to_convert]]
    df_counts['count'] = 1
    df_counts = df_counts.groupby(by=[id_col, column_to_convert], as_index=False, sort=False).sum()

    new_df = df_counts.pivot(index=id_col, columns=column_to_convert, values='count')
    new_df = new_df.fillna(0)

    # Rename Columns
    categories = list(df[column_to_convert].drop_duplicates())
    for category in categories:
        cat_name = str(category).replace(" ", "_").replace("(", "").replace(")", "").replace("/", "_").replace("-", "").lower()
        col_name = column_to_convert + '_' + cat_name
        new_df.rename(columns = {category:col_name}, inplace=True)

    return new_df

# Cleaning #
# Import data
print("Reading in data...")
tr_filepath = "./train_users_2.csv"
df_train = pd.read_csv(tr_filepath, header=0, index_col=None)
te_filepath = "./test_users.csv"
df_test = pd.read_csv(te_filepath, header=0, index_col=None)

# Combine into one dataset
df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True)

# Change Dates to consistent format
print("Fixing timestamps...")
df_all['date_account_created'] = pd.to_datetime(df_all['date_account_created'], format='%Y-%m-%d')
df_all['timestamp_first_active'] = pd.to_datetime(df_all['timestamp_first_active'], format='%Y%m%d%H%M%S')
df_all['date_account_created'].fillna(df_all.timestamp_first_active, inplace=True)

# Remove date_first_booking column
df_all.drop('date_first_booking', axis=1, inplace=True)

# Fixing age column
print("Fixing age column...")
df_all = remove_outliers(df=df_all, column='age', min_val=15, max_val=90)
df_all['age'].fillna(-1, inplace=True)

# Fill first_affiliate_tracked column
print("Filling first_affiliate_tracked column...")
df_all['first_affiliate_tracked'].fillna(-1, inplace=True)

# Data Transformation #
# One Hot Encoding
print("One Hot Encoding categorical data...")
columns_to_convert = ['gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked', 'signup_app', 'first_device_type', 'first_browser']

for column in columns_to_convert:
    df_all = convert_to_binary(df=df_all, column_to_convert=column)
    df_all.drop(column, axis=1, inplace=True)

# Feature Extraction #
# Add new date related fields
print("Adding new fields...")
df_all['day_account_created'] = df_all['date_account_created'].dt.weekday
df_all['month_account_created'] = df_all['date_account_created'].dt.month
df_all['quarter_account_created'] = df_all['date_account_created'].dt.quarter
df_all['year_account_created'] = df_all['date_account_created'].dt.year
df_all['hour_first_active'] = df_all['timestamp_first_active'].dt.hour
df_all['day_first_active'] = df_all['timestamp_first_active'].dt.weekday
df_all['month_first_active'] = df_all['timestamp_first_active'].dt.month
df_all['quarter_first_active'] = df_all['timestamp_first_active'].dt.quarter
df_all['year_first_active'] = df_all['timestamp_first_active'].dt.year
df_all['created_less_active'] = (df_all['date_account_created'] - df_all['timestamp_first_active']).dt.days

# Drop unnecessary columns
columns_to_drop = ['date_account_created', 'timestamp_first_active', 'date_first_booking', 'country_destination']
for column in columns_to_drop:
    if column in df_all.columns:
        df_all.drop(column, axis=1, inplace=True)

# Add data from sessions.csv #
# Import sessions data
s_filepath = "./sessions.csv"
sessions = pd.read_csv(s_filepath, header=0, index_col=False)

# Determine primary device
print("Determing primary device...")
sessions_device = sessions.loc[:, ['user_id', 'device_type', 'secs_elapsed']]
aggregated_lvl1 = sessions_device.groupby(['user_id', 'device_type'], as_index=False, sort=False).aggregate(np.sum)
idx = aggregated_lvl1.groupby(['user_id'], sort=False)['secs_elapsed'].transform(max) == aggregated_lvl1['secs_elapsed']
df_primary = pd.DataFrame(aggregated_lvl1.loc[idx , ['user_id', 'device_type', 'secs_elapsed']])
df_primary.rename(columns = {'device_type':'primary_device', 'secs_elapsed':'primary_secs'}, inplace=True)
df_primary = convert_to_binary(df=df_primary, column_to_convert='primary_device')
df_primary.drop('primary_device', axis=1, inplace=True)

# Determine Secondary device
print("Determing secondary device...")
remaining = aggregated_lvl1.drop(aggregated_lvl1.index[idx])
idx = remaining.groupby(['user_id'], sort=False)['secs_elapsed'].transform(max) == remaining['secs_elapsed']
df_secondary = pd.DataFrame(remaining.loc[idx , ['user_id', 'device_type', 'secs_elapsed']])
df_secondary.rename(columns = {'device_type':'secondary_device', 'secs_elapsed':'secondary_secs'}, inplace=True)
df_secondary = convert_to_binary(df=df_secondary, column_to_convert='secondary_device')
df_secondary.drop('secondary_device', axis=1, inplace=True)

# Aggregate and combine actions taken columns
print("Aggregating actions taken...")
session_actions = sessions.loc[:,['user_id', 'action', 'action_type', 'action_detail']]
columns_to_convert = ['action', 'action_type', 'action_detail']
session_actions = session_actions.fillna('not provided')
first = True

for column in columns_to_convert:
    print("Converting " + column + " column...")
    current_data = convert_to_counts(df=session_actions, id_col='user_id', column_to_convert=column)

    # If first loop, current data becomes existing data, otherwise merge existing and current
    if first:
        first = False
        actions_data = current_data
        actions_data = pd.concat([actions_data, current_data], axis=1, join='inner')

# Merge device datasets
print("Combining results...")
df_primary.set_index('user_id', inplace=True)
df_secondary.set_index('user_id', inplace=True)
device_data = pd.concat([df_primary, df_secondary], axis=1, join="outer")

# Merge device and actions datasets
combined_results = pd.concat([device_data, actions_data], axis=1, join='outer')
df_sessions = combined_results.fillna(0)

# Merge user and session datasets
df_all.set_index('id', inplace=True)
df_all = pd.concat([df_all, df_sessions], axis=1, join='inner')

# Building Model #
# Prepare training data for modelling
df_train.set_index('id', inplace=True)
df_train = pd.concat([df_train['country_destination'], df_all], axis=1, join='inner')

id_train = df_train.index.values
labels = df_train['country_destination']
le = LabelEncoder()
y = le.fit_transform(labels)
X = df_train.drop('country_destination', axis=1, inplace=False)

# Training model
print("Training model...")

# Grid Search - Used to find best combination of parameters
XGB_model = xgb.XGBClassifier(objective='multi:softprob', subsample=0.5, colsample_bytree=0.5, seed=0)
param_grid = {'max_depth': [3, 4], 'learning_rate': [0.1, 0.3], 'n_estimators': [25, 50]}
model = grid_search.GridSearchCV(estimator=XGB_model, param_grid=param_grid, scoring='accuracy', verbose=10, n_jobs=1, iid=True, refit=True, cv=3), y)
print("Best score: %0.3f" % model.best_score_)
print("Best parameters set:")
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

# Make predictions #
print("Making predictions...")

# Prepare test data for prediction
df_test.set_index('id', inplace=True)
df_test = pd.merge(df_test.loc[:,['date_first_booking']], df_all, how='left', left_index=True, right_index=True, sort=False)
X_test = df_test.drop('date_first_booking', axis=1, inplace=False)
X_test = X_test.fillna(-1)
id_test = df_test.index.values

# Make predictions
y_pred = model.predict_proba(X_test)

#Taking the 5 classes with highest probabilities
ids = [] #list of ids
cts = [] #list of countries
for i in range(len(id_test)):
    idx = id_test[i]
    ids += [idx] * 5
    cts += le.inverse_transform(np.argsort(y_pred[i])[::-1])[:5].tolist()

#Generate submission
print("Outputting final results...")
sub = pd.DataFrame(np.column_stack((ids, cts)), columns=['id', 'country'])
« Older posts

© 2020 Brett Romero

Theme by Anders NorenUp ↑