Data Inspired Insights

Tag: python (Page 1 of 2)

Pandas: Where and Mask

This article is part of a series of practical guides for using the Python data processing library pandas. To see view all the available parts, click here.

In this guide we are going to look at three ways to handle a scenario where you want to update the values in a column based on whether some condition is true or not. This condition could be applied based on the same column you want to update, a different column, or a combination of columns.

The three methods (or two methods and a function we will look at are:

  1. where – a method of the pandas.DataFrame class.
  2. mask – a method of the pandas.DataFrame class and inverse of where
  3. numpy.where – a function in the numpy library.

Reading in a dataset

If you don’t have a dataset you want to play around with, University of California Irvine has an excellent online repository of datasets that you can play with. For this explainer we are going to be using the Wine Quality dataset. If you want to follow along, you can import the dataset as follows:

import pandas as pd

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=';')

The where method

The where method is a method that becomes useful when we want to update the values in a column based on whether a condition is True. For example, in our sample dataset, if let’s say we wanted to update the quality rating of the wine if the value in the alcohol column (assumed to be alcohol by volume) is over 10%. We are worried about the wine being too strong so we want to update the quality to 0 so we don’t accidentally select one of these wines. As per the filtering and segmenting guide we could do something like this:

df.loc[df["alcohol"] >= 10, "quality"] = 0

This works, but the problem is that we overwrite our original data. That means we would first have to copy the original column, then run this line on the new column.

The alternative is to use where:

df["quality_updated"] = df["quality"].where(df["alcohol"] < 10, 0)
df

The slightly unintuitive aspect of where, at least for me, is that it updates the rows that do not satisfy the condition, i.e. those that evaluate to False. Accordingly, we have to reverse our condition to df['alcohol'] < 10.

The mask method

The mask method is the reverse of where. It is the same concept but now it updates the rows that satisfy the condition, i.e. the rows that evaluate to True. For me this makes more sense as we are targeting the rows that we want to update. Reproducing the code from above:

df["quality_updated"] = df["quality"].mask(df["alcohol"] >= 10, 0)
df

numpy where function

An alternative to both of the above, which has actually been my preference for a while now is the numpy version of where. For those that aren’t aware, numpy is actually a dependency of pandas, so anywhere you have pandas you should have numpy available. As per the where and mask methods, numpy.where intakes a condition as the first parameter, but then has two parameters, a value if true parameter and a value if false parameter. Let’s take a look at an example that simply reproduces what we were doing above:

df["quality_updated"] = np.where(df["alcohol"] >= 10, 0, df["quality"])
df

So in this example, it becomes slightly less concise than the earlier examples because we have to explicitly state the alternative value. But this can often end up being an advantage when we don’t want to keep the original values. Consider the following example:

import numpy as np

df["sugar_level"] = np.where(df["residual sugar"] > 2, "Sweet", "Dry")
df[["residual sugar", "sugar_level"]]

In this case we have defined a new categorical column with two values based on the residual sugar column. Are there other ways this could be done? Definitely. But are any going to be as clear to a reader as to what your intention is?

Stepping it up

In the cases above we have used very simple conditions to update values. But as was covered in the advanced booleans guide, there is nothing to prevent us from getting more complicated and creating a conditional based on multiple columns. Additionally, we don’t have to provide static values as the true and false values, we can use other columns, or even calculations using other columns.

For example, let’s say we are planning are dinner party and we want to select some wines. But we are fussy and we have several requirements. We don’t like sweet wine, so want to penalize wines that have too much residual sugar. We don’t like strong wines so we want to penalize wines that have too much alcohol.

In this case, for the sake of readability, we separate out the condition into a separate line and save it to a variable called mask, then provide mask as a parameter to numpy.where:

sugar_cutoff = 4
alcohol_cutoff = 10
mask = ((df["residual sugar"] < sugar_cutoff) & (df["alcohol"] < alcohol_cutoff))
penalty = 0.75
df["quality_updated"] = np.where(mask, df["quality"], df["quality"] * penalty)
df[["residual sugar", "alcohol", "quality", "quality_updated"]].sort_values("quality_updated", ascending=False).head(20)

Wrapping Up

In this guide we covered several ways to update and/or create new columns by applying if/else logic to an existing column (or columns) in a DataFrame. We looked at the where and mask DataFrame methods, as well as the numpy.where function, all of which are have slightly different approaches to the same problem. Finally, although these methods can seem simple, we saw how they can be stepped up very quickly to apply complicated logic in a clear and concise way.

Pandas: Append and Concat

This article is part of a series of practical guides for using the Python data processing library pandas. To see view all the available parts, click here.

In this guide we will look at a few methods we can use to add pandas DataFrames together vertically, stacking them one on top of the other. This will include two pandas methods concat and append, and a third way where we make use of some simple python methods. This last method can often be much faster than working with DataFrames directly, especially if we want to repeatedly append one row at a time to a DataFrame.

If you are looking at joining tables, or adding two tables together horizontally, try the guide on joining tables.

Continue reading

Pandas: Joining tables

This article is part of a series of practical guides for using the Python data processing library pandas. To see view all the available parts, click here.

One of the most fundamental concepts in data science and data work in general is joining two tables together based on some shared column or index. In SQL it is a JOIN. In Excel it is INDEX-MATCH or VLOOKUP. In pandas, two methods are available to join tables together: merge and join. We will look at both of those methods in this guide.

Continue reading

Pandas: How to Pivot data

This article is part of a series of practical guides for using the Python data processing library pandas. To see view all the available parts, click here.

When I was starting out with pandas, I was coming from an Excel and SQL background. Having spent a solid 8 years with Excel as my primary data munging and modeling tool, I was very comfortable using pivot tables, a tool I found extremely powerful and later discovered are strangely controversial. My workflow started to involve pivot tables so regularly that my SQL queries were often written to extract data in a format that would make it simpler to aggregate in a pivot table.

Naturally, when I started learning pandas, one of the first things I wanted to learn was “how can I recreate the functionality of an Excel pivot table in pandas”? In this guide we will look at several ways to do just that.

Continue reading

Pandas: Advanced Aggregation

This article is part of a series of practical guides for using the Python data processing library pandas. To see view all the available parts, click here.

Building on the basic aggregation guide, in this guide we will look at some more advanced ways we can aggregate data using pandas. We are going to cover three techniques:

  1. Aggregating using different methods at the same time, for example, summing one column and taking the average of another.
  2. Defining and using custom aggregation functions which we can use to calculate aggregates that are not available “out of the box”.
  3. The transform method which can be used to do some very useful things with aggregated values.
Continue reading

Pandas: Aggregation

This article is part of a series of practical guides for using the Python data processing library pandas. To see view all the available parts, click here.

A fundamental tool for working in pandas and with tabular data more generally is the ability to aggregate data across rows. Thankfully pandas gives us some easy-to-use methods for aggregation, which includes a range of summary statistics such as sums, min and max values, means and medians, variances and standard deviations, or even quantiles. In this guide we will walk through the basics of aggregation in pandas, hopefully giving you the basic building blocks to go on to more complex aggregations.

Continue reading

Pandas: SettingWithCopyWarning

This article is part of a series of practical guides for using the Python data processing library pandas. To see view all the available parts, click here.

For many users starting out with pandas, a common and frustrating warning that pops up sooner or later is the following:

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.

Try using .loc[row_indexer,col_indexer] = value instead

To the uninitiated, it can be hard to know what it means or if it even matters. In this guide, we’ll walk through what the warning means, why you are seeing it, and what you can do to avoid it.

Continue reading

Pandas: Advanced booleans

This article is part of a series of practical guides for using the Python data processing library pandas. To see view all the available parts, click here.

In other sections in this series, we’ve looked at how we can use booleans (a value that is either True or False) in pandas. Specifically, we’ve looked at how a list or array of booleans can be used to filter a DataFrame. In those examples we generated lists of booleans using simple comparisons like “are the values in the fixed acidity column > 12?” However, simple comparisons like this are only one of many ways we can create booleans. In this guide we are going to look at a range of methods that allow us to do more complex comparisons, while also making our code more concise and easier to understand.

Continue reading

Pandas: Filtering and segmenting

This article is part of a series of practical guides for using the Python data processing library pandas. To see view all the available parts, click here.

One of the most common ways you will interact with a pandas DataFrame is by selecting different combinations of columns and rows. This can be done using the numerical positions of columns and rows in the DataFrame, column names and row indices, or by filtering the rows by applying some criteria to the data in the DataFrame. All of these options (and combinations of them) are available, so let’s dig in!

Continue reading

Pandas: Basic data interrogation

This article is part of a series of practical guides for using the Python data processing library pandas. To see view all the available parts, click here.

Once we have our data in a pandas DataFrame, the basic table structure in pandas, the next step is how do we assess what we have? If you are coming from Excel or R Studio, you are probably used to being able to look at the data any time you want. In python/pandas, we don’t have a spreadsheet to work with, and we don’t even have an equivalent of R Studio (although Jupyter notebooks are a similar concept), but we do have several tools available that can help you get a handle on what your data looks like.

Continue reading
« Older posts

© 2021 Brett Romero

Theme by Anders NorenUp ↑