Brett Romero

Data Inspired Insights

Tag: pandas

Official Release: Visual Analytics

I am proud to announce the release of an application I’ve been working on for the last few months – Visual Analytics. This application is designed to give you a new way to view your Google Analytics data using a range of interactive visualizations, allowing you to get a better understanding of who your users are, how they are getting to your site, and what they are doing when they get there.

For those worried about privacy and personal security, the application has a couple of features that will hopefully ease your mind. Firstly, there is no separate account or login details needed for Visual Analytics, everything is based on your existing Google account, and the login process is completed using Google authentication.

 

 

Secondly, the application does not currently store any user data. In fact, the application has no database at all (sensing a theme here?). That means that not only does that mean I can not sell your data to third parties, but that even if someone does manage to hack into the application, there is nothing to steal except my hacky code base.

For those interested in the technical specs, the backend of the application was built using Python and the Flask web framework. To access the data, once you are logged in using your Google credentials, the application makes calls to the Google Analytics API and then uses Pandas to handle the data manipulation (where needed). On the front end, the visualizations are created using D3.js and Highcharts (a big shout out to the Highcharts team and Mike Bostock for their excellent work on these libraries).

Anyway, if you have a Google Analytics account and are interested in getting some interesting insights into your data, take a look and let me know what you think. And please, if you find an issue or a bug, let me know!

 

 

How to create a flashcard app without a database

Last week, I covered how setting up a database may not be necessary when creating an app or visualization, even one that relies on data. This week we are going to walk through an example application that runs off data, but does not need a formal database.

First some background. For the last couple of months, I have been attending some basic Arabic classes to help get around Jordan easier. During a recent class, several of the students were discussing the time they had spent putting together physical cardboard flashcards to help them memorize words. Hearing this, and having played around with creating simple applications and visualizations for a couple of years now, it occurred to me that generating flashcards using a simple application would probably be significantly quicker and easier to do. In addition, if done the right way, it could work on your phone, making it available to you anytime you had your phone and an internet connection.

Perhaps as you are reading this, you are thinking that Arabic is a widely spoken language, surely there is already an app for that? And you would be correct, there are multiple apps for learning Arabic. However, the complication with Arabic is that each country/region uses a significantly different version of the language. In addition to these regional dialects, there is Modern Standard Arabic, which is the formal written version of the language. When it comes to the apps currently available, the version of Arabic being presented is almost always Modern Standard Arabic, as opposed to the Levantine Arabic which is spoken throughout Palestine, Jordan, Lebanon and Syria. Additionally, the apps are quite expensive (up to $10), of questionable accuracy/quality, or both.

To address this problem, and for the challenge and learning opportunity, a few weeks back I sat down over a weekend and put together a simple application that would generate Arabic flashcards (you can see the current version here). The application is based on an Excel spreadsheet with translations that I continue to enter over time based on my notes and the class textbook. Using an Excel spreadsheet in this case is advantageous for two key reasons:

  • It is simple to edit and update with new translations
  • It is a format that almost everyone is familiar with so I can recruit other students and teachers to add translations

With that out of the way, let’s take a look at the high-level process for creating a flashcards app.

1. Collecting the Data

The first step is creating the Excel spreadsheet for our translations. In this case, it is fairly simple and looks something like this:

arabic flashcards

In this spreadsheet, each row represents one word and, through the application, will represent one flashcard. In one column, we have the Arabic script for each word, then in the next column the English translation. In addition, we also have a third version of the word with a column header of ‘transcribed’. This column represents how the Arabic word would look/sound if written in Latin script, something that is commonly used in beginner classes when students cannot yet read Arabic script.[1] Finally, in the last column we have a category column. This will be used to provide a feature where the user can filter the words in the application, allowing them to focus their study on particular sets of words.

A quick note, we also have an ID column, which is not used in the application. It is only included to provide a unique key in our datasets, as good practice.

2. Processing the Data

The next step is to take the data from our spreadsheet, convert it to a format that we can use to generate flashcards in the application, then save it. To do this we will use my favorite Python library, pandas, and the short script shown below.

# -*- coding: utf-8 -*-
import pandas as pd

# Read In Data
df = pd.read_excel("./data.xlsx", header=0)

# Create JSON String
json_string = df.to_json(orient="records", force_ascii=False)
json_string = "var data = " + json_string + ";"

# Write to file
text_file = open("data.js", "w")
text_file.write(json_string)
text_file.close()

What this script in does is read in the file (in this case, data.xlsx) to a pandas dataframe (line 5). After that (line 8), we use the to_json method to output the contents of the dataframe to a JSON string. In line 9 we add some JavaScript to the beginning and end of that JSON string, then in lines 12-14 we save the string as a JavaScript file, data.js.

There are a couple of important things to note here. The first is that when dealing with non-Latin text characters (like Arabic characters), we need to specify that force_ascii=False (the default value is True). If we don’t do this, the script will return an error and/or convert the Arabic letters into a combination of Latin characters representing the Unicode character (i.e. it will look like gibberish).

The second thing to note for those that have not worked with JSON, or key-value stores more generally, is that this is the format that most data comes in when used in programs and applications. It is a highly flexible structure and, as a result, there are many ways we could represent the data shown above. In this case, we are using the ‘records’ format (as specified by pandas), which will look like this:

[

{
“id”:1,
“arabic”:”كتير”,
“english”:”A lot\/Many\/Very”,
“transcribed”:”kteer”,
“category”:”Basics”
},
{
“id”:2,
“arabic”:”عَن”,
“english”:”About”,
“transcribed”:”3an”,
“category”:”Basics”
},…

]

If this isn’t making any sense, or you would like to see some of the other possibilities, copy and paste some spreadsheet data into this CSV to JSON convertor. Toggling a few options, it should quickly become obvious how many different ways a given dataset can be represented in JSON format.

3. Building the App

Now that the data is ready, we create the files needed for the flashcards application. In this case, it is only three files, a HTML document (index.html) for the page, a CSS file for the styling, and an additional JavaScript file that will use the data in data.js to create the flashcards and generate the various features of the application. For those that are interested in the full code or want to create your own version, please feel free to checkout/fork the GitHub repo. For those that do not want to get too far into the weeds, there are just a few things I want to highlight about what the code is doing.

Firstly, the filtering and language options in the application are being generated directly from the data. What this means is that as more categories are added to the Excel spreadsheet, or if the languages change (i.e. the headings in the spreadsheet change), as soon as I update the underlying Excel and run the script shown above, all the options in the application will also update accordingly.

Secondly, I added a feature that allows the user to keep score. It is a simple honesty-based system, but I found it does provide some motivation to keep improving, as well as removing an element of self-deception as to how well you are actually doing. Often I would find myself thinking that I was getting almost all of them correct, only to find my correct percentage hovering around 70%.

Finally, a note on randomness. Whether the user is going through the cards unfiltered, or filtering for some category, the application is displaying the flashcards in a random[2] order. This random selection algorithm went through several iterations:

  1. In version 1, the algorithm would simply select four (the number of flashcards presented to the user at one time) random selections from the pool of eligible words.
  2. Upon testing version 1, it was found that, with surprising regularity, the same word would be selected more than once in a group of four flashcards. To address this, in version 2 a condition was added that when randomly selecting a word, it would only be accepted if that word had not already been selected in the given pool of four words.
  3. On further testing, I noticed another annoying issue. As I continually refreshed the four flashcards being displayed, some words would show up repeatedly, while others would take forever to show up, or not show up at all. To avoid this, for version 3, I changed the algorithm again. Now, instead of selecting four words at random, the algorithm instead took the whole list of words, shuffled them in a random order, and ran through the list in the new shuffled order. When the list ran out of words, it took the full list, shuffled it again, and continued.
  4. This was a big improvement. As I refreshed, I got different words, and was able to see all the words before they started repeating. But then I found another issue. In cases where the number of eligible words was not divisible by four, the old shuffled list and the new shuffled list would overlap in a selection of four words. In these cases, there was a possibility that the same word would be repeated. This is a little difficult to visualize, so the illustration below tries to present what was happening using an example list of ten words:

arabic flashcards

To address this, in version 4, a new condition was added. In cases like the example shown above, the algorithm will check the words from the new shuffled list to ensure they are not already selected from the old list. If a word is already selected, it will move that word to the end of the list and instead take the next word on the list. Here is another diagram to show what is happening:

arabic flashcards

4. Finishing Up

Ok, for those stepping through this and creating your own flashcards app, at this point you have copied the code available from the repo, made any changes to the spreadsheet, and rerun the script to refresh the data. For the final step, there are a couple of things that can be done.

If you are only planning to use the app on the same computer as you are using to create the flashcards app, you are done! Just open the index.html file using Chrome, Firefox or Safari (you can try Internet Explorer, but you know…) and you can test and use the app as you would use any website.

If you want to publish your flashcards app online to share with others, by far the easiest way is to use a service such as GitHub pages. I don’t want to turn this into a beginners guide to using git and GitHub, but there is excellent documentation available to help get you started if you would like to do this. You can see my version at the following address: https://vladimiriii.github.io/arabic-flashcards/, but there is even an option to redirect it to a domain of your choosing should you have one.

arabic flashcards

 

I hope this was a helpful guide to how a simple application can be created without a database, even if the application runs on some underlying form of data. Let me know what you think in the comments below!

 

[1] Because Arabic has many sounds that are difficult to convey in Latin script, this is also why when Arabic is transcribed, you will often find multiple spellings of the same word (e.g. Al-Qaeda vs Al-Qaida).

[2] As will be discussed in a new piece to be written, it is not truly random, and the reasons why are pretty interesting.

5 Things I Learned in 2015

2015 has been an interesting year in many respects. A new country[1], a new language, a new job, and plenty of new experiences – both at work and in life in general. To get into the year-end spirit, I thought I would list out 5 key things I learned this year.

1. I Love Pandas

Yes, those pandas as well, who doesn’t? But I knew that well before 2015. The pandas I learned to love this year is a data analysis library for the programming language Python. “Whoa, slow down egg head” I hear you say. For those that are not regular coders, what that means is that pandas provides a large range of ways for people writing Python code to interact with data that makes life very easy.

Reading from and writing to Excel, CSV files and JSON (see lesson number 2) is super easy and fast. Manipulating large datasets in table like structures (dataframes) – check. Slicing, dicing, aggregating – check, check and check. In fact, as a result of pandas, I have almost entirely stopped using R[2]. All the (mostly basic) data manipulation for which I used to use R, I now use Python. Of course R still has an important role to play, particularly when it comes to complex statistical analysis, but that does not tend to come up all that regularly.

2. JSON is Everywhere

JSON, JavaScript Object Notation for the uninitiated, is a data interchange format that has become the default way of transferring data online. Anytime you are seeing data displayed on a webpage, including all the visualizations on this website, JSON is the format the underlying data is in.

JSON has two big advantages that have led to its current state of dominance. The first is that, as the name suggests, it is native to JavaScript – the key programming language, alongside HTML, that is interpreted by the browser you are reading this on. The second is that JSON is an extremely flexible way of representing data.

However, as someone who comes from a statistics and data background, as opposed to a technology background, JSON can take a while to get used to. The way data is represented in JSON is very different to the traditional tables of data that most people are used to seeing. Gone are the columns and rows, replaced with key-value pairs and lots of curly brackets – “{“ and “}”. If you are interested in seeing what it looks like, there are numerous CSV to JSON convertors online. This one even has a sample dataset to play with.

If you do bother to take a look at some JSON, you will note that it is also much more verbose than your standard tabular format. A table containing 10 columns by 30 rows – something that could easily fit into one screen on a spreadsheet – runs to 300+ lines of JSON, depending on how it is structured. That does not make it easy to get an overview of the data for a human reader, but that overlooks what JSON is designed for – to be read by computers. The fact that a human can read it at all is seen as one of JSON’s strengths.

For those interested in working with data (or any web based technology), knowing how to read and manipulate JSON is becoming as important as knowing how to use a spreadsheet.

3. Free Tools are Great

There are some people working for software vendors who will read this and be happy I have a very small audience. Having worked in the public sector, for a large corporate and now for a small NGO, one thing I have been pleasantly surprised by in 2015 is the number and quality of free tools available online.

For general office administration there are office communicator applications (Slack), task management tools (Trello) and Google’s free replacements for Excel, Word and PowerPoint. For version control and code management there is GitHub. For data analysis, the aforementioned Python and R are both free and open source. For data storage, there is a huge range of free database technologies available, in both SQL (PostgreSQL, MySQL, SQLite3) and NoSQL (MongoDB, Redis, Cassandra) variations.

To be fair to my previous larger employers and my software-selling friends, most of these tools/applications do have significant catches. Many operate on a ‘freemium’ model. This means that for individuals and small organizations with relatively few users, the service is free (or next to free), but costs quickly rise when you need larger numbers of users and/or want access to additional features, typically the types of features larger organizations need. Many of the above also provide no tech support or guarantees, meaning that executives have no one to blame if the software blows up. If you are responsible for maintaining the personal data of millions of clients, that may not be a risk you are willing to take.

For small business owners and entrepreneurs however, these tools are great news. They bring down barriers to entry for small businesses and make their survival more dependent on the quality of the product rather than how much money they have. That is surely only a good thing.

4. Blogging is a Full Time Job

Speaking of starting a business, a common dream these days is semi-retiring somewhere warm and writing a blog. My realization this year from running a blog (if only part time) is just how difficult it is to get any traction. Aside from being able to write reasonably well, there are two main hurdles that anyone planning to become a full time blogger needs to overcome – note that I have not come close to accomplishing either of these:

  1. You have to generate large amounts of good quality content – at least 2-3 longer form pieces a week if you want to maintain a consistent audience. That may seem easy, but after you have quickly bashed out the 5-10 article ideas you have been mulling over, the grind begins. You will often be writing things that are not super interesting to you. You will often not be happy with what you have written. You will quickly realize that your favorite time is the time immediately after you have finished an article and your least favorite is when you need to start a new piece.
  2. You will spend more time marketing your blog than writing. Yep, if you want a big audience (big enough to generate cash to live on) you will need to spend an inordinate amount of time:
    • cold emailing other blogs and websites, asking them to link to your blog (‘generating back links’ in blogspeak)
    • ensuring everything on your blog is geared towards your blog showing up in peoples’ Google search results (Search Engine Optimization or SEO)
    • promoting yourself on Facebook
    • building a following on Twitter
    • contributing to discussions on Reddit and LinkedIn to show people you are someone worth listening to, and
    • writing guest blogs for other sites.

None of this is easy. Begging strangers for links, incorporating ‘focus words’ into your page titles and headings, posting links on Facebook to something you spend days writing, only to find you get one like (thanks Mum!). Meanwhile, some auto-generated, barely readable click-bait trash from ‘viralnova’ or ‘quandly’ (yes, I am deliberately not linking to those sites) is clocking up likes in the 5 figures. It can be downright depressing.

Of course, there are an almost infinite number of people out there offering their services to help with these things (I should know, they regularly comment on my articles telling me how one weird trick can improve my ‘on page SEO’). The problem is, the only real help they can give you is adding more things to the list above. On the other hand, if you are thinking about paid promotion (buying like’s or a similar strategy) I’d recommend watching this video:

Still want to be a blogger? You’re welcome.

5. Do not be Afraid to Try New Things

One of the things that struck me in 2015 is how attached people get to doing things a certain way. To a large degree this makes sense, the more often you use/do something, the better you get at it. I am very good at writing SQL and using Excel – I have spent most of the last 10 years using those two things. As a result, I will often try to use those tools to solve problems because I feel most comfortable using them.

Where this becomes a problem is when you start trying to shoehorn problems into tools not just because you are comfortable with the tool, but to avoid using something you are less comfortable with. As you have seen above, two of the best things I learned this year were two concepts that were completely foreign to a SQL/Excel guy like me. But that is part of what made learning them so rewarding. I gained a completely new perspective on how data can be structured and manipulated and, even though I am far from an expert in those new skills, I now know they are available and which sorts of problems they are useful for.

So, do not be afraid to try new things, even if the usefulness of that experience is not immediately apparent. You never know when that skill might come in handy.

 

Happy New Year to everyone, I hope you have a great 2016!

 

[1] Or ‘Autonomous Province’ depending on your political views

[2] R is another programming language designed specifically for statistical analysis, data manipulation and data mining.

© 2018 Brett Romero

Theme by Anders NorenUp ↑