Add Category Column to pandas DataFrame with cut

This post explains how to add a category column to a pandas DataFrame with cut().

cut makes it easy to categorize numerical values in buckets.

Let’s look at a a DataFrame of people and categorize them into “child”, “teenager”, and “adult” buckets based on their age.

Simple example

Suppose you have the following DataFrame.

Here’s the code to create the DataFrame.

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {"first_names": ["Todd", "Juan", "Maria", "Hi", "Sal"], "age": [12, 13, 19, 20, 42]}
)

Add an age_bins column to the DataFrame that categorizes the people into these buckets: child, teenager, or adult.

df["age_bins"] = pd.cut(
    x=df["age"],
    bins=[1, 12, 19, np.inf],
    labels=["child", "teenager", "adult"],
)

Here’s what the resulting DataFrame contains.

The rightmost value is inclusive in the bins argument, so the buckets are 1-12, 13-19, and 20-infinity.

dtypes

Let’s inspect the dtypes of the resulting DataFrame.

df.dtypes

first_names      object
age               int64
age_bins       category
dtype: object

You can see that age_bins is a category column.

When to use cut

You can use cut when building features for model training.

You may want to train a machine learning model based on bucket labels, not actual numerical values. cut makes it really easy for you to transform a numerical value into categories.

Conclusion

cut is a convenient way to convert a numerical column into different value buckets.

You’ll find it especially useful when feature engineering.

MungingData

Piles of precious data

Add Category Column to pandas DataFrame with cut

Simple example

dtypes

When to use cut

Conclusion

Registration