This article is Part V in a series looking at data science and machine learning by walking through a Kaggle competition. If you have not done so already, you are strongly encouraged to go back and read the earlier parts – (Part I, Part II, Part III and Part IV).
Continuing on the walkthrough, in this part we take the data from sessions.csv that we left aside initially and add it to the transformed and expanded data from Part IV. This part will cover, in brief, all the steps in Parts II – IV.
2015 has been an interesting year in many respects. A new country[1], a new language, a new job, and plenty of new experiences – both at work and in life in general. To get into the year-end spirit, I thought I would list out 5 key things I learned this year.
1. I Love Pandas
Yes, those pandas as well, who doesn’t? But I knew that well before 2015. The pandas I learned to love this year is a data analysis library for the programming language Python. “Whoa, slow down egg head” I hear you say. For those that are not regular coders, what that means is that pandas provides a large range of ways for people writing Python code to interact with data that makes life very easy.
Reading from and writing to Excel, CSV files and JSON (see lesson number 2) is super easy and fast. Manipulating large datasets in table like structures (dataframes) – check. Slicing, dicing, aggregating – check, check and check. In fact, as a result of pandas, I have almost entirely stopped using R[2]. All the (mostly basic) data manipulation for which I used to use R, I now use Python. Of course R still has an important role to play, particularly when it comes to complex statistical analysis, but that does not tend to come up all that regularly.
2. JSON is Everywhere
JSON, JavaScript Object Notation for the uninitiated, is a data interchange format that has become the default way of transferring data online. Anytime you are seeing data displayed on a webpage, including all the visualizations on this website, JSON is the format the underlying data is in.
JSON has two big advantages that have led to its current state of dominance. The first is that, as the name suggests, it is native to JavaScript – the key programming language, alongside HTML, that is interpreted by the browser you are reading this on. The second is that JSON is an extremely flexible way of representing data.
However, as someone who comes from a statistics and data background, as opposed to a technology background, JSON can take a while to get used to. The way data is represented in JSON is very different to the traditional tables of data that most people are used to seeing. Gone are the columns and rows, replaced with key-value pairs and lots of curly brackets – “{“ and “}”. If you are interested in seeing what it looks like, there are numerous CSV to JSON convertors online. This one even has a sample dataset to play with.
If you do bother to take a look at some JSON, you will note that it is also much more verbose than your standard tabular format. A table containing 10 columns by 30 rows – something that could easily fit into one screen on a spreadsheet – runs to 300+ lines of JSON, depending on how it is structured. That does not make it easy to get an overview of the data for a human reader, but that overlooks what JSON is designed for – to be read by computers. The fact that a human can read it at all is seen as one of JSON’s strengths.
For those interested in working with data (or any web based technology), knowing how to read and manipulate JSON is becoming as important as knowing how to use a spreadsheet.
3. Free Tools are Great
There are some people working for software vendors who will read this and be happy I have a very small audience. Having worked in the public sector, for a large corporate and now for a small NGO, one thing I have been pleasantly surprised by in 2015 is the number and quality of free tools available online.
For general office administration there are office communicator applications (Slack), task management tools (Trello) and Google’s free replacements for Excel, Word and PowerPoint. For version control and code management there is GitHub. For data analysis, the aforementioned Python and R are both free and open source. For data storage, there is a huge range of free database technologies available, in both SQL (PostgreSQL, MySQL, SQLite3) and NoSQL (MongoDB, Redis, Cassandra) variations.
To be fair to my previous larger employers and my software-selling friends, most of these tools/applications do have significant catches. Many operate on a ‘freemium’ model. This means that for individuals and small organizations with relatively few users, the service is free (or next to free), but costs quickly rise when you need larger numbers of users and/or want access to additional features, typically the types of features larger organizations need. Many of the above also provide no tech support or guarantees, meaning that executives have no one to blame if the software blows up. If you are responsible for maintaining the personal data of millions of clients, that may not be a risk you are willing to take.
For small business owners and entrepreneurs however, these tools are great news. They bring down barriers to entry for small businesses and make their survival more dependent on the quality of the product rather than how much money they have. That is surely only a good thing.
4. Blogging is a Full Time Job
Speaking of starting a business, a common dream these days is semi-retiring somewhere warm and writing a blog. My realization this year from running a blog (if only part time) is just how difficult it is to get any traction. Aside from being able to write reasonably well, there are two main hurdles that anyone planning to become a full time blogger needs to overcome – note that I have not come close to accomplishing either of these:
You have to generate large amounts of good quality content – at least 2-3 longer form pieces a week if you want to maintain a consistent audience. That may seem easy, but after you have quickly bashed out the 5-10 article ideas you have been mulling over, the grind begins. You will often be writing things that are not super interesting to you. You will often not be happy with what you have written. You will quickly realize that your favorite time is the time immediately after you have finished an article and your least favorite is when you need to start a new piece.
You will spend more time marketing your blog than writing. Yep, if you want a big audience (big enough to generate cash to live on) you will need to spend an inordinate amount of time:
cold emailing other blogs and websites, asking them to link to your blog (‘generating back links’ in blogspeak)
ensuring everything on your blog is geared towards your blog showing up in peoples’ Google search results (Search Engine Optimization or SEO)
promoting yourself on Facebook
building a following on Twitter
contributing to discussions on Reddit and LinkedIn to show people you are someone worth listening to, and
writing guest blogs for other sites.
None of this is easy. Begging strangers for links, incorporating ‘focus words’ into your page titles and headings, posting links on Facebook to something you spend days writing, only to find you get one like (thanks Mum!). Meanwhile, some auto-generated, barely readable click-bait trash from ‘viralnova’ or ‘quandly’ (yes, I am deliberately not linking to those sites) is clocking up likes in the 5 figures. It can be downright depressing.
Of course, there are an almost infinite number of people out there offering their services to help with these things (I should know, they regularly comment on my articles telling me how one weird trick can improve my ‘on page SEO’). The problem is, the only real help they can give you is adding more things to the list above. On the other hand, if you are thinking about paid promotion (buying like’s or a similar strategy) I’d recommend watching this video:
Still want to be a blogger? You’re welcome.
5. Do not be Afraid to Try New Things
One of the things that struck me in 2015 is how attached people get to doing things a certain way. To a large degree this makes sense, the more often you use/do something, the better you get at it. I am very good at writing SQL and using Excel – I have spent most of the last 10 years using those two things. As a result, I will often try to use those tools to solve problems because I feel most comfortable using them.
Where this becomes a problem is when you start trying to shoehorn problems into tools not just because you are comfortable with the tool, but to avoid using something you are less comfortable with. As you have seen above, two of the best things I learned this year were two concepts that were completely foreign to a SQL/Excel guy like me. But that is part of what made learning them so rewarding. I gained a completely new perspective on how data can be structured and manipulated and, even though I am far from an expert in those new skills, I now know they are available and which sorts of problems they are useful for.
So, do not be afraid to try new things, even if the usefulness of that experience is not immediately apparent. You never know when that skill might come in handy.
Happy New Year to everyone, I hope you have a great 2016!
[1] Or ‘Autonomous Province’ depending on your political views
[2] R is another programming language designed specifically for statistical analysis, data manipulation and data mining.
There was an interesting Excel problem I encountered a while back which I thought I would share as the solution ended up being quite well received and even got me a nice Swiss dinner.
Problem
The problem arose out of a reasonably common scenario – a friend of mine had begun working in a new workplace where they had been using a MS Word template to collect the results of survey they had been undertaking. Being more of a data focused person, she managed to convince them to migrate the template to MS Excel as this would allow easier aggregation and analysis of the data. However, in convincing them to migrate, she had to design the Excel template in a manner more or less mirroring the Word template, which meant merged cells, data in various columns and rows on the spreadsheet, and generally no easy way to tabulate the data across multiple surveys.
After the surveys starting coming in from various areas and she had spent an inordinate amount of time manually copying and pasting the results into a table, she asked me if there was a better way to do this.
Solution
After a couple of false starts, I came to a point where I thought the only way to do this was going to be a macro. The issues with using a macro are:
if it breaks, it is likely that no one else will be able to fix it
it isn’t easy for others to determine what exactly the macro is doing, and
moving a macro across multiple spreadsheets and workbooks can get messy.
Asides from those issues, non-technical people tend to get a little squeamish around macros.
I ended up stumbling across this solution almost by pure chance as I happened to be doing something at work that resulted in me using an Excel function that I hadn’t used very often: INDIRECT. What INDIRECT does is takes any string (either typed directly into the formula, referenced from another cell, or some combination of both) and tries to read it as a cell reference.
For example – if you type “A3” into a cell, then reference that cell (“=A1” for example), the result will be that both the original cell and the cell with the reference will now read “A3”. But what if you wanted the cell referencing the original cell to actually use the value in that cell as a reference to look up a third value in cell A3? That is where INDIRECT comes in:
In this simple example we have used it to reference a cell in the same sheet, but we can also use INDIRECT to build a reference to a different sheet or workbook. And that’s where this is going. All the templates were the same – the same cell reference for each data point every time – but the name of the sheet (or workbook if they haven’t been added together yet) was always different. So what if we use an INDIRECT formula to dynamically look up the name of the sheet, which the user inputs? We can then use that value to complete the cell references for all the cells looking up the information on that sheet. Here is an example of what that would look like:
Once you have the basic formula working there are 3 steps left:
Wrap the formula in an IFERROR clause ( =IFERROR(formula, “”) ) so that if the ‘Sheet Name’ column is blank, you get blank data fields instead of a ton of #REF errors.
Manually ‘program’ the first row. You will have to ensure all the right cells are being looked up in the right columns.
Drag the formula down for as many rows as you like.
Once this is done, users of the workbook will be able to move a completed template into this workbook, enter the sheet name in the table, and all the results should immediately populate in the table. No Macros, no mess, and a convenient and simple way to convert information from an irregularly formatted template into tabular data.
Still confused? Please feel free to download this example file:
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.