Data Inspired Insights

Tag: Excel

Forget SQL or NoSQL – 5 scenarios where you may not need a database at all

A while back, I attended a hackathon in Belgrade as a mentor. This hackathon was the first ‘open data’ hackathon in Serbia and focused on making applications using data that had recently been released by various ministries, government agencies, and independent bodies in Serbia. As we walked around talking to the various teams, one of the things I noticed at the time, was that almost all teams were using databases to manage their data . In most cases, the database being used was something very lightweight like SQLite3, but in some cases more serious databases (MySQL, PostgreSQL, MongoDB) were also being used.

Continue reading

JSONify It – CSV to JSON Converter

Go to JSONify It

For those who have some experience in creating visualizations, particularly online visualizations using JavaScript and libraries such as D3.js, one thing that you will often come across is the need to convert your data. Typically this need will arise because the data you receive or collect will be in a human-friendly format such as an Excel spreadsheet, and in order for you to use it for the visualization you will need that data in JSON format. Annoyingly, this will often be just a one time conversion, meaning writing a stand alone script to do the conversion often seems like overkill.

Continue reading

Data Science: A Kaggle Walkthrough – Adding New Data

This article is Part V in a series looking at data science and machine learning by walking through a Kaggle competition. If you have not done so already, you are strongly encouraged to go back and read the earlier parts – (Part I, Part II, Part III and Part IV).

Continuing on the walkthrough, in this part we take the data from sessions.csv that we left aside initially and add it to the transformed and expanded data from Part IV.  This part will cover, in brief, all the steps in Parts II – IV.

Continue reading

5 Things I Learned in 2015

2015 has been an interesting year in many respects. A new country[1], a new language, a new job, and plenty of new experiences – both at work and in life in general. To get into the year-end spirit, I thought I would list out 5 key things I learned this year.

1. I Love Pandas

Yes, those pandas as well, who doesn’t? But I knew that well before 2015. The pandas I learned to love this year is a data analysis library for the programming language Python. “Whoa, slow down egg head” I hear you say. For those that are not regular coders, what that means is that pandas provides a large range of ways for people writing Python code to interact with data that makes life very easy.

Reading from and writing to Excel, CSV files and JSON (see lesson number 2) is super easy and fast. Manipulating large datasets in table like structures (dataframes) – check. Slicing, dicing, aggregating – check, check and check. In fact, as a result of pandas, I have almost entirely stopped using R[2]. All the (mostly basic) data manipulation for which I used to use R, I now use Python. Of course R still has an important role to play, particularly when it comes to complex statistical analysis, but that does not tend to come up all that regularly.

2. JSON is Everywhere

JSON, JavaScript Object Notation for the uninitiated, is a data interchange format that has become the default way of transferring data online. Anytime you are seeing data displayed on a webpage, including all the visualizations on this website, JSON is the format the underlying data is in.

JSON has two big advantages that have led to its current state of dominance. The first is that, as the name suggests, it is native to JavaScript – the key programming language, alongside HTML, that is interpreted by the browser you are reading this on. The second is that JSON is an extremely flexible way of representing data.

However, as someone who comes from a statistics and data background, as opposed to a technology background, JSON can take a while to get used to. The way data is represented in JSON is very different to the traditional tables of data that most people are used to seeing. Gone are the columns and rows, replaced with key-value pairs and lots of curly brackets – “{“ and “}”. If you are interested in seeing what it looks like, there are numerous CSV to JSON convertors online. This one even has a sample dataset to play with.

If you do bother to take a look at some JSON, you will note that it is also much more verbose than your standard tabular format. A table containing 10 columns by 30 rows – something that could easily fit into one screen on a spreadsheet – runs to 300+ lines of JSON, depending on how it is structured. That does not make it easy to get an overview of the data for a human reader, but that overlooks what JSON is designed for – to be read by computers. The fact that a human can read it at all is seen as one of JSON’s strengths.

For those interested in working with data (or any web based technology), knowing how to read and manipulate JSON is becoming as important as knowing how to use a spreadsheet.

3. Free Tools are Great

There are some people working for software vendors who will read this and be happy I have a very small audience. Having worked in the public sector, for a large corporate and now for a small NGO, one thing I have been pleasantly surprised by in 2015 is the number and quality of free tools available online.

For general office administration there are office communicator applications (Slack), task management tools (Trello) and Google’s free replacements for Excel, Word and PowerPoint. For version control and code management there is GitHub. For data analysis, the aforementioned Python and R are both free and open source. For data storage, there is a huge range of free database technologies available, in both SQL (PostgreSQL, MySQL, SQLite3) and NoSQL (MongoDB, Redis, Cassandra) variations.

To be fair to my previous larger employers and my software-selling friends, most of these tools/applications do have significant catches. Many operate on a ‘freemium’ model. This means that for individuals and small organizations with relatively few users, the service is free (or next to free), but costs quickly rise when you need larger numbers of users and/or want access to additional features, typically the types of features larger organizations need. Many of the above also provide no tech support or guarantees, meaning that executives have no one to blame if the software blows up. If you are responsible for maintaining the personal data of millions of clients, that may not be a risk you are willing to take.

For small business owners and entrepreneurs however, these tools are great news. They bring down barriers to entry for small businesses and make their survival more dependent on the quality of the product rather than how much money they have. That is surely only a good thing.

4. Blogging is a Full Time Job

Speaking of starting a business, a common dream these days is semi-retiring somewhere warm and writing a blog. My realization this year from running a blog (if only part time) is just how difficult it is to get any traction. Aside from being able to write reasonably well, there are two main hurdles that anyone planning to become a full time blogger needs to overcome – note that I have not come close to accomplishing either of these:

  1. You have to generate large amounts of good quality content – at least 2-3 longer form pieces a week if you want to maintain a consistent audience. That may seem easy, but after you have quickly bashed out the 5-10 article ideas you have been mulling over, the grind begins. You will often be writing things that are not super interesting to you. You will often not be happy with what you have written. You will quickly realize that your favorite time is the time immediately after you have finished an article and your least favorite is when you need to start a new piece.
  2. You will spend more time marketing your blog than writing. Yep, if you want a big audience (big enough to generate cash to live on) you will need to spend an inordinate amount of time:
    • cold emailing other blogs and websites, asking them to link to your blog (‘generating back links’ in blogspeak)
    • ensuring everything on your blog is geared towards your blog showing up in peoples’ Google search results (Search Engine Optimization or SEO)
    • promoting yourself on Facebook
    • building a following on Twitter
    • contributing to discussions on Reddit and LinkedIn to show people you are someone worth listening to, and
    • writing guest blogs for other sites.

None of this is easy. Begging strangers for links, incorporating ‘focus words’ into your page titles and headings, posting links on Facebook to something you spend days writing, only to find you get one like (thanks Mum!). Meanwhile, some auto-generated, barely readable click-bait trash from ‘viralnova’ or ‘quandly’ (yes, I am deliberately not linking to those sites) is clocking up likes in the 5 figures. It can be downright depressing.

Of course, there are an almost infinite number of people out there offering their services to help with these things (I should know, they regularly comment on my articles telling me how one weird trick can improve my ‘on page SEO’). The problem is, the only real help they can give you is adding more things to the list above. On the other hand, if you are thinking about paid promotion (buying like’s or a similar strategy) I’d recommend watching this video:

Still want to be a blogger? You’re welcome.

5. Do not be Afraid to Try New Things

One of the things that struck me in 2015 is how attached people get to doing things a certain way. To a large degree this makes sense, the more often you use/do something, the better you get at it. I am very good at writing SQL and using Excel – I have spent most of the last 10 years using those two things. As a result, I will often try to use those tools to solve problems because I feel most comfortable using them.

Where this becomes a problem is when you start trying to shoehorn problems into tools not just because you are comfortable with the tool, but to avoid using something you are less comfortable with. As you have seen above, two of the best things I learned this year were two concepts that were completely foreign to a SQL/Excel guy like me. But that is part of what made learning them so rewarding. I gained a completely new perspective on how data can be structured and manipulated and, even though I am far from an expert in those new skills, I now know they are available and which sorts of problems they are useful for.

So, do not be afraid to try new things, even if the usefulness of that experience is not immediately apparent. You never know when that skill might come in handy.

 

Happy New Year to everyone, I hope you have a great 2016!

 

[1] Or ‘Autonomous Province’ depending on your political views

[2] R is another programming language designed specifically for statistical analysis, data manipulation and data mining.

Excel Tips – Array Functions

Excel’s array functions probably rival pivot tables for the title of most misunderstood and underutilized features available to Excel users. That is a shame as they are powerful tools that can be used to simply and elegantly address some of the key problems that users will regularly encounter. Although there is an almost endless list of ways in which array functions can be used, below I discuss three scenarios that I regularly encountered in which array functions can be useful.

Ensuring Formula Consistency

One of the simplest and most useful ways to utilize array functions is to ensure the consistency of formulae in large spreadsheets. There are numerous infamous examples where simple formula errors have undermined the credibility of analyses done in Excel (including a recent popular economics paper) and although this method (or any method for that matter) can’t prevent all errors, it can help to minimize the places where things can go wrong.

This method can be applied to any situation where you have a column or row in which you need to apply the same formula repeatedly (anywhere you would normally drag the formula across or down). To use an array function in this situation, select the full range of cells you need the formula to apply to and then create the formula in the same way you would normally, except replacing the single cell references with references to the range of cells instead. When the formula is ready, press Control + Shift + Enter to confirm it [1]. A simple example summing two columns is shown below:

sample_2_1

Using this method ensures that the formulae are consistent and also that individual formula in the column or row cannot be modified. You can test how this works yourself – once you have created an array formula (or opened the example file at the bottom of this page) try to delete or modify one of the individual cell formulas. You should be presented with an error like the following:

sample_2_2

This error prevents any modifications that users may inadvertently make, including the deletion of a row or column in the dataset.

Transposing Datasets

Occasionally the need will arise to transpose data (convert data that runs across a row into data that runs down a column, or vice versa). In my experience, these cases typically involve time series data running across a spreadsheet being transposed to run down the spreadsheet to make it easier to view.

The method here is to highlight the cells where you want the transposed results to display and enter the function “=TRANSPOSE( “, then highlight the values to be transposed. Note, that if the size of the ranges selected do not match (e.g. the number of columns to be transposed is greater than the number of rows in the range selected to display the results) the extra values will be dropped, or in the reverse case, the extra cells in the result set will be display as errors. An example is provided below:

sample_2_3

The big advantage of doing this (as opposed to simply copying and pasting the set with the transpose option selected) is that it maintains a live connection. Any updates in the original dataset will be reflected in the transposed dataset.

SUMIF with two conditions[2]

One of the primary reasons I started learning about array functions was this exact problem – how do I do a SUMIF with two conditions? This may seem like an esoteric requirement but, as the datasets you are working with become more complex, the need to summarize the data by multiple criteria becomes increasingly useful. Looking at an example, let’s imagine we have a small dataset of campaign contributions that contains the name, gender, city and the amount contributed for 10 individuals. This data is shown below:

sample_2_4

Now let’s imagine you want to identify the total contributions from all females in Boston. Broadly speaking, there are three ways you can do this:

  1. Manually select all the values and add them up
  2. Create a pivot table [3]
  3. Use an array formula

Often, the actual dataset is too big for method 1 (any dataset with more than 10 entries probably falls into this category), and a pivot table can be overkill when you only need to derive one value. In these cases, method 3 can be the simplest option.

To create our SUMIF with two criteria, we create a formula that actually looks like a SUMIF in some ways. Essentially we create two nested IF statements inside a SUM function. Our final formula is going to be as follows (if you are recreating this, don’t forget to use Control + Shift + Enter):

sample_2_5

The formula works by creating an array of TRUEs and FALSEs based on the criteria specified in the IF statements (i.e. that the value in the gender column = “F” and the value in the City column = “Boston”). If both conditions are met, the formula captures the corresponding value in the specified range (the contributions column in this case); otherwise, the FALSE value (zero) is captured. Once there is an array of values and zeroes, the SUM function will sum up all the values to provide the answer.

Finally, this structure can also be used for a range of functions including:

  • MAX – replace SUM with MAX
  • MIN – replace SUM with MIN and the false value (zero) with a value larger than the maximum value in the dataset
  • COUNT – set the TRUE value to 1 instead of a range
  • AVERAGE – use the COUNT and SUM functions described above to create an average

Still confused? Please feel free to download this example file: 

 

[1] This combination confirms to Excel that the formula is an array formula. If you simply press enter, the formula will not work as expected. Completed array formulas will display in the formula bar surrounded by {}, but you do not manually type these characters in.


[2] Since writing this article I have become aware of two new(ish) Excel functions that have effectively replaced array functions for this purpose. SUMIFS and COUNTIFS are built in functions available from the 2007 version of Excel onwards which allow the user to sum or count records and specify multiple criteria.


[3] Pivot tables are very useful for a range of purposes, but will be covered in a separate blog entry

Excel Tips – Template to Table

There was an interesting Excel problem I encountered a while back which I thought I would share as the solution ended up being quite well received and even got me a nice Swiss dinner.

Problem

The problem arose out of a reasonably common scenario – a friend of mine had begun working in a new workplace where they had been using a MS Word template to collect the results of survey they had been undertaking. Being more of a data focused person, she managed to convince them to migrate the template to MS Excel as this would allow easier aggregation and analysis of the data. However, in convincing them to migrate, she had to design the Excel template in a manner more or less mirroring the Word template, which meant merged cells, data in various columns and rows on the spreadsheet, and generally no easy way to tabulate the data across multiple surveys.

After the surveys starting coming in from various areas and she had spent an inordinate amount of time manually copying and pasting the results into a table, she asked me if there was a better way to do this.

Solution

After a couple of false starts, I came to a point where I thought the only way to do this was going to be a macro. The issues with using a macro are:

  • if it breaks, it is likely that no one else will be able to fix it
  • it isn’t easy for others to determine what exactly the macro is doing, and
  • moving a macro across multiple spreadsheets and workbooks can get messy.

Asides from those issues, non-technical people tend to get a little squeamish around macros.

I ended up stumbling across this solution almost by pure chance as I happened to be doing something at work that resulted in me using an Excel function that I hadn’t used very often: INDIRECT. What INDIRECT does is takes any string (either typed directly into the formula, referenced from another cell, or some combination of both) and tries to read it as a cell reference.

For example – if you type “A3” into a cell, then reference that cell (“=A1” for example), the result will be that both the original cell and the cell with the reference will now read “A3”. But what if you wanted the cell referencing the original cell to actually use the value in that cell as a reference to look up a third value in cell A3? That is where INDIRECT comes in:

sample_1_1

In this simple example we have used it to reference a cell in the same sheet, but we can also use INDIRECT to build a reference to a different sheet or workbook. And that’s where this is going. All the templates were the same – the same cell reference for each data point every time – but the name of the sheet (or workbook if they haven’t been added together yet) was always different. So what if we use an INDIRECT formula to dynamically look up the name of the sheet, which the user inputs? We can then use that value to complete the cell references for all the cells looking up the information on that sheet. Here is an example of what that would look like:

sample_1_2

Once you have the basic formula working there are 3 steps left:

  1. Wrap the formula in an IFERROR clause ( =IFERROR(formula, “”) ) so that if the ‘Sheet Name’ column is blank, you get blank data fields instead of a ton of #REF errors.
  2. Manually ‘program’ the first row. You will have to ensure all the right cells are being looked up in the right columns.
  3. Drag the formula down for as many rows as you like.

Once this is done, users of the workbook will be able to move a completed template into this workbook, enter the sheet name in the table, and all the results should immediately populate in the table. No Macros, no mess, and a convenient and simple way to convert information from an irregularly formatted template into tabular data.

Still confused? Please feel free to download this example file: 

© 2021 Brett Romero

Theme by Anders NorenUp ↑