Data Inspired Insights

Month: November 2020

Pandas: Basic data interrogation

This article is part of a series of practical guides for using the Python data processing library pandas. To see view all the available parts, click here.

Once we have our data in a pandas DataFrame, the basic table structure in pandas, the next step is how do we assess what we have? If you are coming from Excel or R Studio, you are probably used to being able to look at the data any time you want. In python/pandas, we don’t have a spreadsheet to work with, and we don’t even have an equivalent of R Studio (although Jupyter notebooks are a similar concept), but we do have several tools available that can help you get a handle on what your data looks like.

DataFrame Dimensions

Perhaps the most basic question is how much data do I actually have? Did I successfully load in all the rows and columns I expected or are some missing? These questions can be answered with the shape method:

import pandas as pd

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=';')

print(df.shape)

(1599, 12)

shape returns a tuple (think of it as a list that you can’t alter) which tells you the number of rows and columns, 1599 and 12 respectively in this example. You can also use len to get the number of rows:

print(len(df))

1599

Using len is also slightly quicker than using shape, so if it is just the number of rows you are interested in, go with len.

Another dimension we might also be interested in is the size of the table in terms of disk space . For this we can use the memory_usage method:

print(df.memory_usage())

Index                     128
fixed acidity           12792
volatile acidity        12792
citric acid             12792
residual sugar          12792
chlorides               12792
free sulfur dioxide     12792
total sulfur dioxide    12792
density                 12792
pH                      12792
sulphates               12792
alcohol                 12792
quality                 12792
dtype: int64

This tells us the space, in bytes, each column is taking up on the disk. If we want to know the total size of the DataFrame, we can take a sum, and then to get the number into a more usable metric like kilobytes (kB) or megabytes (MB), we can divide by 1024 as many times as needed.

print(df.memory_usage().sum() / 1024)  # Size in kB

150.03125

Lastly, for these basic dimension assessments, we can generate a list of the data types of each column. This can be a very useful early indicator that your data has been read in correctly. If you have a column that you believe should be numeric (i.e. a float64 or an int64) but it is listed as object (pandas speak for categorical data), it may be a sign that something has not been interpreted correctly:

print(df.dtypes)

fixed acidity           float64
volatile acidity        float64
citric acid             float64
residual sugar          float64
chlorides               float64
free sulfur dioxide     float64
total sulfur dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
dtype: object

Viewing some sample rows

After we have satisfied ourselves that we have the expected volume of data in our DataFrame, we might want to look at some actual rows of data. Particularly for large datasets, this is where the head and tail methods come in handy. As the names suggest, head will return the first n rows of the DataFrame and tail will return the last n rows of the DataFrame (n is set to 5 by default for both).

print(df.head())

Aside from this basic use, we can use head and tail in some very useful ways with some alterations/additions. First off, we can set n to what ever value we want, showing as many or few rows as desired:

print(df.head(10))

We can also combine it with sort_values to see the top (or bottom) n rows of data sorted by a column, or selection of columns:

print(df.sort_values('fixed acidity', ascending=False).head(10))

Finally, if we have a lot of columns, too many to display all of them in Jupyter or the console, we can combine head/tail with transpose to inspect all the columns for a few rows:

print(df.head().transpose())

Summary Statistics

Moving on, the next step is typically some exploratory data analysis (EDA). EDA is a very open ended process so no one can give you an explicit set of instructions on how to do it. Each dataset is different and to a large extent, you just have to allow your curiosity to run wild. However, there are some tools we can take advantage of in this process.

Describe

The most basic way to summarize the data in your DataFrame is the describe method. This method, by default, gives us a summary of all the numeric fields in the DataFrame, including counts of values (which exclude null values), the mean, standard deviation, min, max and some percentiles.

df.describe()

This is nice, but let’s talk about what isn’t being shown. Firstly, by default, any non-numeric and date fields are excluded. In the dataset we are using in this example we don’t have any non-numeric fields, so let’s add a couple of categorical fields (they will just have random letters in this case), and a date field:

import string
import random


df['categorical'] = [random.choice(string.ascii_letters) for i in range(len(df))]
df['categorical_2'] = [random.choice(string.ascii_letters) for i in range(len(df))]
df['date_col'] = pd.date_range(start='2020-11-01', periods=len(df))

df.describe()

As we can see, the output didn’t change. But we can use some of the parameters for describe to address that. First, we can set include='all' to include all datatypes in the summary:

df.describe(include='all')

Now for the categorical columns it tells us some useful numbers about the number of unique values and which value is the most frequent. But the way it is handling the date column is like a categorical value. We can also change that so it treats it as numeric value by setting the datetime_as_numeric parameter to True:

df[['date_col]].describe(datetime_as_numeric=True)

pandas_summary Library

Building on top of the kind of summaries that are produced by describe, some very talented people have developed a library called pandas_summary. This library is purely designed to generate informative summaries of pandas DataFrames. First though we need to do some quick setup (you may need to install the library using pip):

from pandas_summary import DataFrameSummary

dfs = DataFrameSummary(df)

Now let’s take a look at two ways we can use this new DataFrameSummary object. The first one is columns_stats. This is similar to what we saw previously with describe, but with one useful addition: the number and percent of missing values in each column:

dfs.columns_stats

Secondly, my personal favorite, by selecting a column from we can look at an individual column to get some really detailed statistics, plus a histogram thrown in for numeric fields:

dfs['fixed acidity']

Seaborn

Seaborn is a statistical data visualization library for python with a full suite of charts that you should definitely look into if you have time, but for today we are going to look at just one very nice feature – pairplot. This function will generate a pairwise correlation plots for all the columns in your DataFrame with literally one line:

import seaborn as sns

sns.pairplot(df, hue="quality")

The colors of the plot will be determined by the column you select for the hue parameter. This allows you to see how the values in that column are impacted by the two features in each pairwise plot , but also along the axis where you would have a plot of a feature against itself, we get the distributions of that variable for different values in your hue column.

Note, if you have a lot of columns, be aware that this type of chart will become less useful, and will also likely take a lot of time to render.

Wrapping Up

Exploratory data analysis (EDA) should be an open ended and flexible process that never really ends. However, when we are first trying to understand the basic dimensions of a new dataset and what it contains, there are some common methods we can employ such as shape and describe and dtypes, and some very useful third party libraries such as pandas_summary and seaborn. While this explainer does not provide a comprehensive list of methods and techniques, hopefully it has provided you with somewhere to get started.

Pandas: Reading in JSON data

This article is part of a series of practical guides for using the Python data processing library pandas. To see view all the available parts, click here.

When we are working with data in software development or when the data comes from APIs, it is often not provided in a tabular form. Instead it is provided in some combination of key-value stores and arrays broadly denoted as JavaScript Object Notation (JSON). So how do we read this type of non-tabular data into a tabular format like a pandas DataFrame?

Understanding Structures

The first thing to understand about data stored in this form is that there are effectively infinite ways to represent a single dataset. For example, take a simple dataset, shown here in tabular form:

iddepartmentfirst_namelast_name
1 SalesJohnJohnson
2SalesPeterPeterson
3SalesPaulaPaulson
4HRJamesJameson
5HRJenniferJensen
6AccountingSusanSusanson
7AccountingClareClareson

Let’s look at some of the more common ways this data might be represented using JSON:

1. “Records”

A list of objects, with each object representing a row of data. The column names are the keys of each object.

[
  {
    "department": "Sales",
    "first_name": "John",
    "id": 1,
    "last_name": "Johnson"
  },
  {
    "department": "Sales",
    "first_name": "Peter",
    "id": 2,
    "last_name": "Peterson"
  },
  {
    "department": "Sales",
    "first_name": "Paula",
    "id": 3,
    "last_name": "Paulson"
  },
  ...
]

2. “List”

An object where each key is a column, with the values for that column stored in a list.

{
  "id": [
    1,
    2,
    3,
    4,
    5,
    6,
    7
  ],
  "department": [
    "Sales",
    "Sales",
    "Sales",
    "HR",
    "HR",
    "Accounting",
    "Accounting"
  ],
  "first_name": [
    "John",
    "Peter",
    "Paula",
    "James",
    "Jennifer",
    "Susan",
    "Clare"
  ],
  "last_name": [
    "Johnson",
    "Peterson",
    "Paulson",
    "Jameson",
    "Jensen",
    "Susanson",
    "Clareson"
  ]
}

3. “Split”

An object with two keys, one for the column names, the other for the data which is a list of lists representing rows of data.

{
  "columns": [
    "id",
    "department",
    "first_name",
    "last_name"
  ],
  "data": [
    [
      1,
      "Sales",
      "John",
      "Johnson"
    ],
    [
      2,
      "Sales",
      "Peter",
      "Peterson"
    ],
  ...
  ]
}

Creating a DataFrame

So how do we get this data into a pandas DataFrame given that it could come in different forms? The key is knowing what structures pandas understands. In the first two cases above (#1 Records and #2 List), pandas understands the structure and will automatically convert it to a DataFrame for you. All you have to is pass the structure to the DataFrame class:

list_data = {
  "id": [
    1,
    2,
    3,
    4,
    5,
    6,
    7
  ],
  "department": [
    "Sales",
    "Sales",
    "Sales",
    "HR",
    "HR",
    "Accounting",
    "Accounting"
  ],
  "first_name": [
    "John",
    "Peter",
    "Paula",
    "James",
    "Jennifer",
    "Susan",
    "Clare"
  ],
  "last_name": [
    "Johnson",
    "Peterson",
    "Paulson",
    "Jameson",
    "Jensen",
    "Susanson",
    "Clareson"
  ]
}
df = pd.DataFrame(list_data)
print(df)

	id 	department     first_name 	last_name
0 	1 	Sales 	       John 	        Johnson
1 	2 	Sales 	       Peter 	        Peterson
2 	3 	Sales 	       Paula 	        Paulson
3 	4 	HR 	       James 	        Jameson
4 	5 	HR 	       Jennifer 	Jensen
5 	6 	Accounting     Susan 	        Susanson
6 	7 	Accounting     Clare 	        Clareson

Initializing a DataFrame this way also gives us a couple of options. We can load in only a selection of columns:

df = pd.DataFrame(list_data, columns=['id', 'department])
print(df)

	id 	department
0 	1 	Sales
1 	2 	Sales
2 	3 	Sales
3 	4 	HR
4 	5 	HR
5 	6 	Accounting
6	7 	Accounting

We can also define an index. However, if one of the fields in your dataset is what you want to set as the index, it is simpler to do so after you load the data into a DataFrame:

df = pd.DataFrame(list_data).set_index('id')
print(df)

    department first_name last_name
id                                 
1        Sales       John   Johnson
2        Sales      Peter  Peterson
3        Sales      Paula   Paulson
4           HR      James   Jameson
5           HR   Jennifer    Jensen
6   Accounting      Susan  Susanson
7   Accounting      Clare  Clareson

Acceptable Structures

What are the acceptable structures that pandas recognizes? Here are the ones I have found so far:

Records

[
  {
    "department": "Sales",
    "first_name": "John",
    "id": 1,
    "last_name": "Johnson"
  },
  {
    "department": "Sales",
    "first_name": "Peter",
    "id": 2,
    "last_name": "Peterson"
  },
  {
    "department": "Sales",
    "first_name": "Paula",
    "id": 3,
    "last_name": "Paulson"
  },
  ...
]

List

{
  "id": [
    1,
    2,
    3,
    4,
    5,
    6,
    7
  ],
  "department": [
    "Sales",
    "Sales",
    "Sales",
    "HR",
    "HR",
    "Accounting",
    "Accounting"
  ],
  "first_name": [
    "John",
    "Peter",
    "Paula",
    "James",
    "Jennifer",
    "Susan",
    "Clare"
  ],
  "last_name": [
    "Johnson",
    "Peterson",
    "Paulson",
    "Jameson",
    "Jensen",
    "Susanson",
    "Clareson"
  ]
}

Dict

{
  "id": {
      0: 1,
      1: 2,
      2: 3,
      3: 4,
      4: 5,
      5: 6,
      6: 7
  },
  "department": {
    0: "Sales",
    1: "Sales",
    2: "Sales",
    3: "HR",
    4: "HR",
    5: "Accounting",
    6: "Accounting"
  },
  "first_name": {
    0: "John",
    1: "Peter",
    2: "Paula",
    3: "James",
    4: "Jennifer",
    5: "Susan",
    6: "Clare"
  },
  "last_name": {
    0: "Johnson",
    1: "Peterson",
    2: "Paulson",
    3: "Jameson",
    4: "Jensen",
    5: "Susanson",
    6: "Clareson"
  }
}

Matrix

For this one you will have to pass the column names separately.

[
    [
        1,
        "Sales",
        "John",
        "Johnson"
    ],
    [
        2,
        "Sales",
        "Peter",
        "Peterson"
    ],
    [
        3,
        "Sales",
        "Paula",
        "Paulson"
    ],
    ...
]

Inconsistencies

What happens if the data is in one of these formats, but has some inconsistencies? For example, what if we have something that looks like this?

[ 
  {
    "department": "Sales",
    "first_name": "John",
    "id": 1,
    "last_name": "Johnson",
    "extra_field": "woops!"
  },
  {
    "department": "Sales",
    "first_name": "Peter",
    "id": 2,
    "last_name": "Peterson"
  },
  {
    "department": "Sales",
    "first_name": "Paula",
    "id": 3,
    "last_name": "Paulson"
  },
  ...
]

Fortunately, pandas is fairly robust to these types of inconsistencies, in this case creating an extra column and filling the remaining rows with NaN (null values):

    id  department  first_name  last_name  extra_field
0   1   Sales 	    John 	Johnson    woops!
1   2   Sales 	    Peter 	Peterson   NaN
2   3   Sales 	    Paula 	Paulson    NaN
3   4   HR 	    James 	Jameson    NaN
4   5   HR 	    Jennifer 	Jensen     NaN
5   6   Accounting  Susan       Susanson   NaN
6   7   Accounting  Clare       Clareson   NaN

Something important to note is that, depending on the structure and where the inconsistency occurs in the structure, the inconsistency can be handled differently. It could be an additional column, an additional row, or in some cases it may be ignored completely. The key is, as always, to check your data has loaded as expected.

Explicit Methods

There are two more methods for reading JSON data into a DataFrame: DataFrame.from_records and DataFrame.from_dict. DataFrame.from_records expects data in the ‘Records’ or ‘Matrix’ formats shown above, while DataFrame.from_dict will accept data in either the Dict or List structures. These methods are more explicit in what they do and have several potential advantages:

Clarity

When working with a team or in a situation where other people are going to review your code, being explicit can help them to understand what you are trying to do. Passing some unknown structure to DataFrame and knowing/hoping it will interpret it correctly is using a little too much ‘magic’ for some people. For the sanity of others, and yourself in 6 months when you are trying to work out what you did, you might want to consider the more explicit methods.

Strictness

When writing code that is going to be reused, maintained and/or run automatically, we want to write that code in a very strict way. That is, it should not keep working if the inputs change. Using DataFrame could lead to situations where the input data format changes, but is read in anyway and instead breaks something else further down the line. In a situation like this, someone will likely have the unenviable task of following the trail through the code to work out what changed.

Using the more explicit methods is more likely to cause the error to be raised where the problem actually occurred: reading in data which is no longer in the expected format.

Options

The more explicit methods give you more options for reading in the data. For example DataFrame.from_records gives you options to limit the number of rows to read in. DataFrame.from_dict allows you to specify the orientation of the data. That is, are the lists of values representative of columns or rows?

Coercion

In some cases, your data will not play nice and the generic DataFrame method will not correctly interpret your data. Using the more explicit method can help to resolve this. For example, if your objects are in a column of a DataFrame (i.e. a pandas Series) instead of a list, using DataFrame will give you a DataFrame with one column:

records_data = pd.Series([
  {
    "department": "Sales",
    "first_name": "John",
    "id": 1,
    "last_name": "Johnson",
    "test": 0
  },
  {
    "department": "Sales",
    "first": "Peter",
    "id": 2,
    "name": "Peterson"
  },
  {
    "dept": "Sales",
    "firstname": "Paula",
    "sid": 3,
    "lastname": "Paulson"
  },
  {
    "dept": "HR",
    "name": "James",
    "pid": 4,
    "last": "Jameson"
  }
])
print(pd.DataFrame(records_data))

Using the more explicit DataFrame.from_records gives you the expected results:

records_data = pd.Series([
  {
    "department": "Sales",
    "first_name": "John",
    "id": 1,
    "last_name": "Johnson",
    "test": 0
  },
  {
    "department": "Sales",
    "first": "Peter",
    "id": 2,
    "name": "Peterson"
  },
  {
    "dept": "Sales",
    "firstname": "Paula",
    "sid": 3,
    "lastname": "Paulson"
  },
  {
    "dept": "HR",
    "name": "James",
    "pid": 4,
    "last": "Jameson"
  }
])
print(pd.DataFrame.from_records(records_data))

Wrapping Up

We’ve looked at how we can quickly and easily convert JSON format data into tabular data using the DataFrame class and the more explicit DataFrame.from_records and DataFrame.from_dict methods. The downside is this only works if the data is in one of a few structures. The upside is most of the data you will encounter will be in one of these formats, or something that is easily converted into these formats.

If you want to play around with converting data between delimited format and various JSON formats, I can recommend trying an app I built a while back: JSONifyit.

Pandas: Reading in tabular data

This article is part of a series of practical guides for using the Python data processing library pandas. To see view all the available parts, click here.

To get started with pandas, the first thing you are going to need to understand is how to get data into pandas. For this guide we are going to focus on reading in tabular data (i.e. data stored in a table with rows and columns). If you don’t have some data available but want to try some things out, a great place to get some data to play with is the UCI Machine Learning Repository.

Delimited Files

One of the most common ways you will encounter tabular data, particularly data from an external source or publicly available data, is in the form of a delimited file such as comma separated values (CSV), tab separated values (TSV), or separated by some other character. To import this data so you can start playing with it, pandas gives you the read_csv function with a lot of options to help manage different cases. But let’s start with the very basic case:

import pandas as pd

df = pd.read_csv('path/to/file.csv')

# Show the top 5 rows to make sure it read in correctly
print(df.head())

Running this code imports the pandas library (as pd), uses the read_csv function to read in the data and stores it as a pandas DataFrame called df, then prints the top 5 rows using the head method. Note that the path to the file that you want to import can be a path to a file on your computer, or it can be a URL (web address) for a file on the internet. As long as you have internet access (and permission to access the file) it will work like you have the file downloaded and saved already.

When reading the data, unless specified, read_csv will attempt to automatically detect what the delimiting character is (e.g. “,” for CSV). In most cases this works fine, but in cases where it doesn’t, you can use the sep parameter to specify what char to use. For example, if your file is separated with “;” you might do something like:

import pandas as pd

df = pd.read_csv('path/to/file.csv', sep=';')

# Show the top 5 rows to make sure it is correct
print(df.head())

OK, what if your file has some other junk above and/or below the actual data like this:

We have two options for working around this, the header parameter and the skiprows parameter:

import pandas as pd

df_1 = pd.read_csv('path/to/file.csv', header=7)
df_2 = pd.read_csv('path/to/file.csv', skiprows=7)

# Both DataFrames produce the same result
print(df_1.head())
print(df_2.head())

These are equivalent because setting header=7 tells read_csv to look in row 7 (remember the row numbers are 0 indexed) to find the header row, then assume the data starts from the next row. On the other hand, setting skiprows=7 tells read_csv to ignore the first 7 rows (so rows 0 to 6), then it assumes the header row is the first row after the ignored rows.

Other Useful read_csv Parameters

There are dozens of other parameters to help you read in your data to handle a range of strange cases, but here are a selection of parameters I have found most useful to date:

ParameterDescription
skipfooterSkip rows at the end of the file (note: you will need to set engine=’python’ to use this)
index_colColumn to set as the index (the values in this column will become the row labels)
nrowsNumber of rows to read in (useful for reading in a sample of rows from a large file)
usecolsA list of columns to read (can use the column names or the 0 indexed column numbers)
skip_blank_linesSkip empty rows instead of reading in as NaN (empty values)

For the full list of available parameters, checkout out the official documentation. One thing to note is that although there a lot of the parameters available for read_csv, many are focused on helping correctly format and interpret data as it is being read in – for example, interpretation of dates, interpretation of boolean values, and so on. In many/most cases these are things that can be addressed after the data is in a pandas DataFrame, and in some cases, handling these types of formatting and standardization steps explicitly after reading in the data can make it easier to understand for the next person that reads your code.

Excel Data

Pandas also has a nice handy wrapper for reading in Excel data read_excel. Instead of writing your data to a CSV, then reading it in, now you can read directly from the Excel file itself. This function has many of the same parameters as read_csv, with options to skip rows, read in a sample of rows and/or columns, specify a header row and so on.

Databases

If your data is in a tabular/SQL database, like PostgreSQL, MySQL, Bigquery or something similar, your job gets a little bit more complicated to setup, but once that setup is done, it becomes really simple to repeatedly query data (using SQL) from that database directly into a DataFrame where you can do what you want with it.

The first step is to create a connection to the database holding the data. It is beyond the scope of this particular guide, but the library you will almost certainly need to use will be SQLAlchemy or in some cases a library created by the creator of the database (for example, Google has Bigquery API library called google-cloud-bigquery).

Once you have connected to your database, pandas provides three functions for you to extract data into a DataFrame: read_sql_table, read_sql_query and read_sql. The last of these, read_sql, is what’s called a ‘convenience wrapper’ around read_sql_table and read_sql_query – the functionality of both the underlying functions can be accessed from read_sql. But let’s look at the two underlying functions individually to see what the differences are and what options we have.

read_sql_table is a function we can use to extract data from a table in a SQL database. The function requires two parameters table_name – the name of the table you want to get the data from; and con – the location of the database the table is in. With these two parameters, all data from the specified table (i.e. SELECT * FROM table_name) will be returned as a DataFrame:

df = pd.read_sql_table(table_name='table_name', con='postgres:///db_name')  

read_sql_table does also give you the option to specify a list of columns to be extracted using the columns parameter.

read_sql_query on the other hand allows you to specify the query you want to run against the database.

query = """
    SELECT column_1
        , column_2
        , column_3
    FROM table_name
    WHERE column_4 > 10
"""
df = pd.read_sql_query(query, 'postgres:///db_name')  

Obviously writing your own query gives you a lot more flexibility to extract exactly what you need. However, also consider the potential upside in terms of processing efficiency. Doing aggregations and transformations in a database, in almost all cases, will be much faster than doing it in pandas after it is extracted. As a result, some careful query planning can save a lot of time and effort later on.

Other Data Sources

Pandas also has functions for reading in data from a range of other sources, including HTML tables, to SPSS, Stata, SAS and HDF files. We won’t go into them here, but being aware that these options exist is often all you really need to know. If a case arises where you need to read data from these sources, you can always refer to the documentation.

Wrapping Up

We’ve looked at how we can use pandas to read in data from various sources of tabular data, from delimited files and Excel, to databases, to some other more uncommon sources. While these functions often have many parameters available, remember most of them will be unnecessary for any given dataset. These functions are designed to work with the minimum parameters provided (e.g. just a file location) in most cases. Also remember that once you have the data in a DataFrame, you will have a tonne of options to fix and change the data as needed – you don’t need to do everything in one function.

© 2020 Brett Romero

Theme by Anders NorenUp ↑