Pandas #

Warning: This post hasn't been updated for over a year. The information may be out of date.

Most important notes #

axis: {0 or ‘index’ (rows), 1 or ‘columns’}, default 0

(Maybe) Useful resources #

Import Data #

Dict/JSON #

Doc: pandas.DataFrame.from_dict — pandas 1.3.5 documentation

For dict like {"id1": data1, "id2": data2, ...}:

df = pd.DataFrame.from_dict(
    data_dict,
    orient='index',
    columns=["data"],   # otherwise very troublesome to rename later
)
df.index = df.index.set_names(["ID"])

For dict like: {"row1": [...], "row2": [...], ...}, and JSON files:

df = pd.DataFrame.from_dict(
    data_json
).transpose()

Excel #

Doc: pandas.read_excel — pandas 2.0.0 documentation

Extra dependency: openpyxl

Read sheet:

import pandas as pd

df = pd.read_excel(
    os.path.join("data", "data.xlsx"),  # read ./data/data.xlsx
    sheet_name="Sheet 1",
    skiprows=1,                         # if there are pre-data rows
    header=[0,1],                       # if there are multiple (combined) header rows
    nrows=100,                          # only read first x rows
    usecols=["ID", "Rating", "Date"],   # only import certain columns
    index_col="ID",            # set index as soon as import
)

Read names of all sheets: (credit)

import pandas as pd

xl = pd.ExcelFile(os.path.join("data", "data.xlsx"))
xl.sheet_names           # list of sheet names

df = xl.parse("Sheet 1") # if need sheets

Read all sheets: (credit)

This is significantly slower than the pd.ExcelFile method above. Only use when you actually need every single sheet.

import pandas as pd

dfs = pd.read_excel(file_name, sheet_name=None)
dfs.keys() # similar result as xl.sheet_names

Custom `usecol` function for fuzzy column names #

See Python #Match case (Switch)

CSV/TSV #

Pandas #

Doc: pandas.read_csv — pandas 2.0.0 documentation

import pandas as pd

names_list = ["col 1", "col 2", "col 3"]

df = pd.read_csv(
    os.path.join("data", "data.csv"), # read ./data/data.csv
    # sep='\t',                       # if tsv
    # quotechar="'",                  # if see extra quotation marks after import
    header=None,
    index_col=[0, 1],                 # can use names if given
    names=names_list,
    usecols=[0, 1, 2, 3],             # can use names if given
    dtype={"Phone number": "str"},    # read columns as type
    nrows=100,                        # only read first 100 rows
)

CSV module #

See Python #CSV.

ODS #

`pandas-ods-reader` #

Extra dependency: pandas-ods-reader, ezodf, and lxml

The newest code on GitHub supports skiprows, but not nrows. However, the pip version does not support skiprows yet. To get it, download and use the development version of the package.

This is by far the fastest method, if you know the sheet name.

import pandas as pd
from pandas_ods_reader import read_ods

df = read_ods(
    os.path.join("data", "data.ods"),   # read ./data/data.ods
    sheet=2,                            # get 2nd sheet (1 based), default 1
    # or
    sheet="sheet 1",                    # get sheet by name
    skiprows=1,                         # only works if you have the development version
    columns=["ID", "Rating", "Date"]    # this is NOT usecols!! it renames the columns, and truncates the sheet if list is shorter than # of columns
)

# Simulate usecols callable
usecols = pd.array([usecols_fn(col) for col in df.columns], dtype="boolean")
df = df.loc[:, usecols]

`pd.read_excel` #

Extra dependency: odfpy

Since odfpy does not have row selection method when loading files, nrows option happens after the whole file is loaded into memory, unlike when dealing with other types of tables. This causes long loading time when the file is huge. See pandas-dev/pandas Issue #53185.

Slightly faster (and more robust) method:

import pandas as pd

ods = pd.ExcelFile(os.path.join("data", "data.ods"))
ods.sheet_names   # get all sheet names

df = ods.parse("sheet 1")

Documented method:

import pandas as pd

df = pd.read_excel(
    os.path.join("data", "data.ods"), # read ./data/data.ods
)

Stata dta #

`pd.read_stata` #

Convert (all) categorical to strings (objects):

This will use up significantly more memory compared to the second method below.

import pandas as pd

data_path = os.path.join("data", "data.dta")

df = pd.read_stata(
    data_path,
    convert_categoricals=False,
)

with pd.io.stata.StataReader(data_path) as sr:
    value_labels = sr.value_labels()

for col in df:
    if col in value_labels:
        df[col].replace(value_labels[col], inplace=True)

`pd.io.stata.StataReader` #

When getting ValueError: Value labels for column col_x are not unique. These cannot be converted to pandas categoricals. The repeated labels are: Not applicable, I found this alternative way of doing the import by excessive digging into the (not documented any more) methods.

It may be easier if you have access to Stata and can do the cleaning up in Stata first. See: python - Stata to Pandas: even if there are repeated Value Labels? - Stack Overflow

More detailed explanation: python - Loading STATA file: Categorial values must be unique - Stack Overflow

Docs:

Step 1: read the Stata file with the io reader.

import pandas as pd

df_reader = pd.io.stata.StataReader(
    os.path.join("data", "data.dta"),
)

Step 2: Verify the repeated labels.

print(df_reader.value_labels()["col_x"])

# {...
#  -1: 'Not applicable',
#  ...
#  6: 'Not applicable'}

Step 3: Replace the repeated value with something else.

df_reader.value_labels()["col_x"][6] = "Not applicable (1)"

Step 4: Read the file from reader.

df = df_reader.read()

Done!

Since it is said in code that Using StataReader as a context manager is the only supported method, this is the context manager version:

import pandas as pd

data_path = os.path.join("data", "data.dta")

with pd.io.stata.StataReader(path) as sr:
    value_labels = sr.value_labels()
    # print(value_labels["col_x"])
    # ...
    value_labels["col_x"][6] = "Not applicable (1)"
    # ...
    df = sr.read()

Export DataFrame #

Excel/ODS #

Extra dependency: openpyxl or odfpy

df.to_excel(os.path.join("data", "data.xlsx"))
df.to_excel(os.path.join("data", "data.ods"))

# or if the df is too big
df.iloc[:1000, :1000].to_excel(os.path.join("data", "data.xlsx"))

CSV/Plain text #

Ref: Python, Pandas : write content of DataFrame into text File - Stack Overflow

# CSV
df.to_csv(os.path.join("data", "data.csv"))

# text method 1
df.to_csv(os.path.join("data", "data.txt"))

# text method 2
with open(os.path.join("data", "data.txt"), "w") as f:
    f.write(df.to_string())

Debugging #

Import #

If get UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position xx: invalid start byte, try: (credit)

df = pd.read_csv(
    os.path.join("data", "data.csv"), # read ./data/data.csv
    encoding="cp1252",
)

Calculation #

If get TypeError: unhashable type: 'numpy.ndarray', it means one (or more) of the cells contain list value (instead of single value).

Get a subset of all rows containing list-value cells:

df_subset = df[df["Average"].apply(lambda x: isinstance(x, np.ndarray))]

Index #

If get ValueError: Index contains duplicate entries, cannot reshape, it means there are duplicate indices sets (rows where all indices are identical).

First check the rows contain same data, then drop rows with duplicate indices.

PerformanceWarning: DataFrame is highly fragmented could happen after a concat, when no index is present in all sub-dfs, but somehow there are still repeated indices. Try the following:

df = df.reset_index()
df = df.drop(["index"], axis=1)

Convert series to frame with title #

sr.name = "Column"
df = sr.to_frame()

Type conversion upon merge or join #

Ref: python - How to prevent Pandas from converting my integers to floats when I merge two dataFrames? - Stack Overflow

df = df.astype('Int64')

Observe things #

Descriptive statistics #

Docs:

Basic descriptive table:

df.describe()

# with multiple index
df.groupby("ID").min().describe()

Quantiles:

df.quantile([.1,.25,.5,.75,.9,.99])

# with multiple index
df.groupby("ID").min().quantile([.1,.25,.5,.75,.9,.99])

Custom everything: (quantile results could be different from quantile)

from functools import partial

q_90 = partial(pd.Series.quantile, q=0.25)
q_90.__name__ = "90%"
q_99 = partial(pd.Series.quantile, q=0.75)
q_99.__name__ = "99%"

df.agg(["count", "mean", "std", "min", "median", q_90, q_99, "max"])

Calculate bivariate correlation coefficients and p-values matrix #

Ref: python - pandas columns correlation with statistical significance - Stack Overflow

import numpy as np
import pandas as pd
from scipy.stats import pearsonr

# Correlation coefficients: (same)
corr = df.corr() 
corr = df.corr(method=lambda x, y: pearsonr(x, y)[0]) 

# p-values:
p_val = df.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(len(df.columns))
# p_val = p_val.round(decimals=4)

Calculate mode by any level of MultiIndex #

Ref: python - GroupBy pandas DataFrame and select most common value - Stack Overflow

So far I haven’t found any way to get this mode df merge/join/concat/etc with the original df, except first pivot the original df.

mode could return a list of values. Make sure to check the type of your modes before proceeding to other tasks.

# mode of each ID
df.groupby(["ID"]).agg(lambda x: x.mode())

# mode of each ID x each date
df.groupby(["ID","Date"]).agg(lambda x: x.mode())

Calculate sum of all columns or rows #

# sum of all rows, by column
df.sum(axis=0)

# sum of all columns, by row
df.sum(axis=1)

Count length of levels of MultiIndex #

Ref: python - Hierarhical Multi-index counts in Pandas - Stack Overflow

df.groupby(level=["ID"]).size()
df.groupby(level=["ID","Date"]).size()

# example output:
# ID
# 01    240
# 02    240
# 03    240
#      ... 
# 57    211
# 59    211
# 60    211
# Length: 60, dtype: int64

Count number of values in columns #

df["Ratings"].value_counts()

# If all columns are similar
df.apply(lambda col: col.value_counts(), axis="index")

# Count changes between two columns
df.query[["2014-2015", "2015-2016"]].value_counts()
df.query('`2014-2015`!=`2015-2016`')[["2014-2015", "2015-2016"]].value_counts()

Elaborated use case: track changes between every two columns #

Refs:

Also see: Python #Loop

Track changes of any (string/categorical) value across every two columns

import pandas as pd
from itertools import tee, islice, chain

def p_i_n(some_iterable):
    prevs, items, nexts = tee(some_iterable, 3)
    prevs = chain([None], prevs)
    nexts = chain(islice(nexts, 1, None), [None])
    return zip(prevs, items, nexts)

def value_count_across(df):
    count_across_list = []
    cols_labels = ["Before", "After"]

    for p,i,n in p_i_n(df.columns):
        if None is p:
            print("Start of list")
        print("Processing", i)
        if None is n:
            continue
        count = df[[i, n]].value_counts()
        count.index = count.index.set_names(cols_labels)
        count = count.to_frame().sort_values(cols_labels)
        
        col_name = i + "-" + n.split("-")[1]
        count = count.rename(columns={0: col_name})
        
        count_across_list.append(count)

    count_across_df = count_across_list[0].join(
        count_across_list[1:],
        how="outer",
    ).astype('Int64').sort_values(cols_labels).reset_index()

    mask_changed = (count_across_df["Before"] != count_across_df["After"])
    count_across_df_changed = count_across_df[mask_changed].set_index(cols_labels)
    count_across_df_changed.loc[("Total","Changed"),:] = count_across_df_changed.sum(axis="index")
    
    count_across_df_unchanged = count_across_df[~mask_changed].set_index(cols_labels)
    count_across_df_unchanged.loc[("Total","Unchanged"),:] = count_across_df_unchanged.sum(axis="index")
    
    count_across_df = count_across_df.set_index(cols_labels)
    count_across_df.loc[("Total","Total"),:] = count_across_df.sum(axis="index")
    count_across_df.loc[("Total","Changed"),:] = count_across_df_changed.loc[("Total","Changed"),:]
    count_across_df.loc[("Total","Unchanged"),:] = count_across_df_unchanged.loc[("Total","Unchanged"),:]

    return count_across_df

Track changes of any (ordinal categorical/numerical) value across every two columns:

import pandas as pd
from itertools import tee, islice, chain

def p_i_n(some_iterable):
    prevs, items, nexts = tee(some_iterable, 3)
    prevs = chain([None], prevs)
    nexts = chain(islice(nexts, 1, None), [None])
    return zip(prevs, items, nexts)

def value_compare_across(df):
    count_better_list = []
    count_worse_list = []
    count_same_list = []

    for p,i,n in p_i_n(df.columns):
        if None is p:
            print("Start of list")
        print("Processing", i)
        if None is n:
            continue
    
        better = sum(df[n] > df[i])
        count_better_list.append(better)
        
        worse = sum(df[n] < df[i])
        count_worse_list.append(worse)
        
        same = sum(df[n] == df[i])
        count_same_list.append(same)

    count_list = [count_better_list, count_worse_list, count_same_list]
    count_df = pd.DataFrame(
        count_list,
        columns = df.columns[1:],
        index=["Increased", "Decreased", "Same"]
    )
    count_df.loc["Changed"] = count_df.loc[["Increased","Decreased"]].sum(axis="index")
    count_df.loc["Total"] = count_df.loc[["Changed","Same"]].sum(axis="index")

    return count_df

Find changes within groups (indices) #

Ref: python - Pandas: Compare rows within groups in a dataframe and create summary rows to mark / highlight different entries in group - Stack Overflow

# need to reset index if already has groups as index
df_mark = (df.reset_index().groupby(['ID'])
             .nunique()
             .gt(1)
             .replace({True: 'X', False: ''})
           )

changed_mask = df_mark["Rating"]=="X"

Assign ranks/observation count to rows #

Ref: python - Number rows within group in increasing order in a pandas dataframe - Stack Overflow

Doc: pandas.DataFrame.rank — pandas 2.2.1 documentation

df["Obs"] = df.groupby(["ID", "Year"])["Date"].rank(method="first", ascending=True).astype('Int64')

Use rank instead of staggered dates to pivot:

pd.pivot(
    df,
    index=["ID", "Year"],
    columns="Obs",
    values=["Rating", "Date"],
)

Count/find missing values #

Ref: python - How to check if any value is NaN in a Pandas DataFrame - Stack Overflow

# Check the whole df
df.isnull().values.any()

# Check each column
df.isnull().sum(axis="index")

# Check each row
df.isnull().sum(axis="columns")

# For a column, list rows with missing value
# https://stackoverflow.com/a/47377251/10668706
df[df["Bad column"].isnull()]

# For a row, list columns with missing value
# https://stackoverflow.com/a/57141046/10668706
df.loc["Bad_ID", df.isnull().any()]

Display full width of a DataFrame #

Ref:

# method 1
with pd.option_context('display.max_colwidth', None):
  display(df)

# method 2
print(df.to_string())

Find the n-th largest and smallest number #

Ref: python - How to extract the n-th maximum/minimum value in a column of a DataFrame in pandas? - Stack Overflow

# full list of largest n and smallest n
df["col"].nlargest(3)
df["col"].nsmallest(3)

# only the single point
df["col"].nlargest(3).iloc[[-1]]
df["col"].nsmallest(3).iloc[[-1]]

List all index values #

df.index
# Result:
# Index(...)
# MultiIndex(...)

# For MultiIndex, if only want some level
df.index.get_level_values(0)
df.index.get_level_values("Time")
# Result:
# Index(...)

Furthur conversion: (also see: #By column value (boolean indexing))

df.index.values
# Result:
# array(...)

df.index.to_list()
# Result:
# [...]

List all columns #

Ref: python - How to show all columns’ names on a large pandas dataframe? - Stack Overflow

For huge databases, best to refer to its data manual.

# Cut off after 1,000 columns
print(df.columns.tolist())

# Shows everything, slow:
with pd.option_context("display.max_columns", None):
    print(df.head())

List all unique values of a column #

Ref: python - print the unique values in every column in a pandas dataframe - Stack Overflow

print(df["Status Code"].unique())

To sort the unique values, use: (dropna in case there is NaN)

sorted(df["Status Code"].dropna().unique())

List rows with duplicated values of a column #

Ref: How do I get a list of all the duplicate items using pandas in python? - Stack Overflow

df[df.duplicated(["ID"], keep=False)].sort_values("ID")

List rows with duplicated values of indices #

df[df.index.duplicated(keep='first')]

Select columns (filter) #

By name #

# must use double brackets
df_subset = df[['col1']]
df_subset = df[['col1', 'col2', ...]]

# contains
# https://stackoverflow.com/a/11531402
df_subset = df[df.columns.str.contains("hello")]

By name with RegEx #

Credit: python - How to select columns from dataframe by regex - Stack Overflow

rating_cols = df.filter(regex=("rating_*")).columns

# then when need to use
df[rating_cols]                     # returns a small df
df[rating_cols]["rating_2021_1"]    # returns a series

By column type #

Doc: pandas.DataFrame.select_dtypes — pandas 2.1.1 documentation

# See type of all columns
df.dtypes

# Only get columns of float type
df_subset = df.select_dtypes("float")

Select rows (filter) #

By index #

Doc: pandas.DataFrame.loc — pandas 2.1.0 documentation

df.loc["A0001"]

By column value (boolean indexing) #

Doc: Indexing and selecting data # Boolean indexing — pandas 2.2.1 documentation

# only keep rows with certain value for a col
df_subset = df[df["Status Code"]=="A"]

# only keep rows without certain value for a col
df_subset = df[df["Status Code"]!="B"]

# only keep rows with a certain starting…
# https://stackoverflow.com/a/17958424/10668706
df_subset = df[df["Name"].str.startswith("One d", na=False)]
# …or without
df_subset = df[~df["Name"].str.startswith("One d", na=False)]

# especially, for nan values:
df_subset = df[df["Status Code"].isna()]
df_subset = df[df["Status Code"].notna()]

# combining multiple conditions
mask_1 = df["Status Code"].isna()
mask_2 = ~df["Name"].str.startswith("One d", na=False)
df_subset = df[mask_1 * mask_2]

Get MultiIndex as a list from boolean masks: (ref, also see my comment)

import numpy as np
import pandas as pd

mask_1 = df["Status Code"].isna()
index_list = np.array(mask_1.loc[lambda item: item].index.to_list())

# all first indices
index_list.T[0]

# all second indices
index_list.T[1]

By column value (query) #

Docs:

When the value of some cells are lists, pandas will raise ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). In that case, try this method instead:
Credit: python - How do I select rows from a DataFrame based on column values? - Stack Overflow

Also see: Calculate average of columns with many filters (conditions)

# only keep rows with certain value for a col
# https://stackoverflow.com/a/50697599
df_subset = df.query('`Status Code` == "A"')

# only keep rows with certain value for a col
df_subset = df.query('`Status Code` != "B"')

# no backticks for single word columns
df_subset = df.query('Date != 2015')

# multiple conditions
df_subset = df.query('Date != 2015 & `Status Code` == "A"')

For multi-level columns: (ref)

First backtick-quote the tuple, then single-quote string column names (except for numerical columns).

df_subset = df.query("`('Category', 1)` > `('Category', 2)`")

To use variable as column name: (ref, the accepted method doesn’t work in this case)

for i in names_list:
    search = df[[i]].query("{}<5".format(i)).index.to_list()
    print(search)
    break

Also see: (not tested)

By value from another DataFrame #

Ref: python - pandas - filter dataframe by another dataframe by row elements - Stack Overflow

The core idea is to get a list of index (from another DataFrame), and use the index list to filter the original DataFrame.

This extra column is not within the original DataFrame, the only thing they need to have in common is the index.

With boolean indexing:

i_q_0_258 = df_q_sum[(df_q_sum['q'] < 258) & (df_q_sum['q'] >0)].index

# old way of subsetting
df_q_0_258  = df[df.index.isin(i_q_0_258)]
# new way of subsetting
df_q_0_258 = df.loc[i_q_0_258]

len(df_q_0_258)
# 41

With query:

i_q_0_258 = df_q_sum.query('q < 258 & q > 0').index

# old way of subsetting
df_q_0_258  = df[df.index.isin(i_q_0_258)]
# new way of subsetting
df_q_0_258 = df.loc[i_q_0_258]

len(df_q_0_258)
# 41

By cell value in any column or any row #

Credit: python - Test if any column of a pandas DataFrame satisfies a condition - Stack Overflow

# boolean matrix with same shape as df
(df[df.columns]<=5.0)

(df[df.columns]<=5.0).any(axis="columns")   # list of rows
(df[df.columns]<=5.0).any(axis="index")     # list of columns

(df[df.columns]<=5.0).any(axis="columns").sum() # number of rows
(df[df.columns]<=5.0).any(axis="index").sum()   # number of columns

By cell type in column #

Credit: python - Select row from a DataFrame based on the type of the object(i.e. str) - Stack Overflow

# only keep rows with certain types of value
df_subset = df[df["Rating"].apply(lambda x: isinstance(x, float))]

# or the opposite
df_subset = df[df["Rating"].apply(lambda x: not isinstance(x, float))]

Select triangle of DataFrame #

python - Melt the Upper Triangular Matrix of a Pandas Dataframe - Stack Overflow

corr = df.corr()

# k=1 for dropping diagonal, k=0 for keeping it
keep = np.triu(corr, k=1).astype('bool').reshape(corr.size)
# or for tril, k=-1 for dropping diagonal, k=0 for keeping it
keep = np.tril(corr, k=-1).astype('bool').reshape(corr.size)

corr_flat = corr.stack()[keep].reset_index()
corr_flat.columns = ["x1","x2","Coef"]

After this, can do a sns.kdeplot(data=corr_flat,x="Coef") to see the kernel density of all the coefficients.

Groupby #

Doc: pandas.DataFrame.groupby — pandas 2.1.1 documentation

Calculate average of columns with many filters (conditions) #

import pandas as pd

# Condition: A=Yes or B=Yes
# For and conditions, use &
df_subset = hse_micro.query('A == "Yes" | B == "Yes"')

# For every possible combination of C and D,
# get average of AGE and YEAR
df_subset.groupby([["C", "D"]])[["AGE", "YEAR"]].mean()

Groupby column names (custom `groupby` function) #

Use case: get average columns across multiple columns with similar names.

Ref: python - Automatically grouping multiple columns with similar names in a pandas data frame - Stack Overflow

Step 0: Consider list of groups

groups = ["AAA", "BBB", ...]
len(groups)
# 18

Step 1: Write groupby function

def groupby_fn(col):
    groups = ["AAA", "BBB", ...]

    col_group = col.replace("Rating_", "").split("0")[0]

    if col_group in groups:
        return col_group    # Mark col as col_group
    else:
        return "Others"     # Mark col as Others

Step 2: Get columns to group by (optional)

cols = df.filter(regex=("Rating_*")).columns

Step 3: Groupby

df_grouped = df.groupby(groupby_fn, axis=1).mean()

# Or if used Step 2:
df_grouped = df[cols].groupby(groupby_fn, axis=1).mean()

# Check no col was marked as Others, otherwise check groupby function
len(df_grouped.columns)
# 18

Done!

Column operations #

Convert number to range categories #

Use case: one column of various numbers => one column of ranges in words.

Method 1: empty column #

Ref: python - How to add an empty column to a dataframe? - Stack Overflow

df["q_groups"] = ""

# Note: this used to be: df[(df["q_sum"]>=434)]["q_groups"]
# But newer versions of pandas will raise warning
df.loc[(df["q_sum"]>=434), "q_groups"] = "Q>=434"
df.loc[(df["q_sum"]>=289) & (df["q_sum"]<434), "q_groups"] = "289<=Q<434"
df.loc[(df["q_sum"]>=258) & (df["q_sum"]<289), "q_groups"] = "258<=Q<289"
df.loc[(df["q_sum"]<258), "q_groups"] = "Q<258"

Method 2: boolean columns #

Step 1: get range booleans (credit)

The new column names here will be the name of ranges.

df["Q>=434"] = (df["q_sum"] >= 434)
df["289<=Q<434"] = (df["q_sum"] >= 289) & (df["q_sum"] < 434)
df["258<=Q<289"] = (df["q_sum"] >= 258) & (df["q_sum"] < 289)
df["Q<258"] = (df["q_sum"] < 258)

Step 2: make backup df (skip if no other columns are boolean)

q_groups = df.copy(deep=True)
q_groups = df.drop(["q_sum", ...], axis=1) # drop all other columns

Step 3: make one column (credit)

q_groups_col = q_groups.eq(True).dot(q_groups.columns + ',').str.rstrip(',')

Step 4: merge back

df["q_groups"] = q_groups_col

Done!

Convert type of columns #

Inspect column types:

df.info()

To numeric:

# float
df["Rating"] = df["Rating"].astype(float)

# auto
df["Rating"] = pd.to_numeric(df["Rating"])

# int: useful when doing boolean -> int
# https://stackoverflow.com/a/27362540
df["Dummy"] = df["Dummy"].astype(int)
df["Dummy"] = df["Dummy"].map({'True': 1,'False' :0 })

# Int64
# If get error: TypeError: float() argument must be a string or a real number, not 'NAType'
# float
df["Rating"] = df["Rating"].astype("Int64")

To string:

df = df.astype("string")

df = df.astype({"col1": "string", "col2": "float", ... })  # need quotes
df = df.astype(dict(col1="string", col2="float", ...))     # no quotes

To category:

from pandas.api.types import CategoricalDtype

rate_type = CategoricalDtype(
    categories=["No rating", "Bad", "Good", "Outstanding"],
    ordered=True,
)

df["Rating"] = df["Rating"].astype(rate_type)

To list: (ref)

import ast

df["Lists"] = df["Lists"].map(ast.literal_eval)

With inline functions:

# datetime
df["Date"] = [np.datetime64(cell) for cell in df["Date"]]
df["Date"] = [cell.strftime("%Y") for cell in df["Date"]]

# string
df["ID"] = [str(val) for val in df["ID"]]

Combine multiple columns #

Refs:

data["Address"] = data["Address Line 1"].fillna("")

for col in ["Address Line 2", "Address Line 3", "Address Line 4", "Address Line 5", "Postcode"]:
    data["Address"] = data["Address"].astype(str) + " " + data[col].fillna("").astype(str)

Divide columns by columns #

# Divide one column by another column
# https://stackoverflow.com/a/34540647
df["percent"] = df["numerator"].div(df["denominator"], axis=0)

# Divide all elements in a df by all elements in another df
# To avoid errors caused by NA
df_perc = df_numerator.truediv(df_denominator)

# Take the sum of every row first
df_perc = df_numerator.sum(axis="columns").truediv(df_denominator.sum(axis="columns"))

Fill NA with certain values #

Using giving fixed value:

# https://stackoverflow.com/a/38134049/10668706
df[["a", "B"]] = df[["a", "B"]].fillna(value="Something")

Using existing values:

# Caution: This means filling along the rows
df = df.ffill(axis="columns")

# Caution: This means filling along the column
# https://stackoverflow.com/a/63190278
df[["Some_col"]] = df[["Some_col"]].groupby("ID").apply(lambda y: y.ffill(axis="index"))

Get column-wise delta (difference) #

Docs:

# numerical changes
df_diff = df.diff(axis=1)             # first column empty
df_diff = df.diff(periods=2, axis=1)  # first two columns empty

# percentage change
df_diff_pct = df.pct_change(axis=1)             # first column empty
df_diff_pct = df.pct_change(periods=2, axis=1)  # first two columns empty

Reorder columns #

Ref: python - How to change the order of DataFrame columns? - Stack Overflow

# Take the last column to first
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]

Index operations #

Set index #

From no index, or discard all old index (warning: their values will be lost).

df = df.set_index(["ID", ...])

Add extra index columns besides existing index: (ref)

df = df.set_index([df.index, "Status", "Level", ...])

Unset index (return to data columns) #

# return all indices back to normal columns
df = df.reset_index()

# return only selected indices
df = df.reset_index(level=["Status", "Level", ...])

Drop redundant levels (index or column) #

df = df.droplevel("Unused", axis="index")

df = df.droplevel("Unused", axis="columns")

DataFrame operations #

Combine two DataFrames #

Docs:

Concat (same columns) #

(Stack):

df_AB = pd.concat([df_A, df_B])

Make all column names upper case or lower case: (credit) (also see: lower case for cell values)

# should also work on pd.merge, not tested

df_AB = pd.concat(
    [df_A.rename(columns=str.lower),
    df_B.rename(columns=str.lower)]
)

df_AB = pd.concat(
    [df_A.rename(columns=str.upper),
    df_B.rename(columns=str.upper)]
)

Join (same indices) #

Always use index to join. Do not use on="Data col" when performing join, pandas will raise ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat.
If you do not want to modify the original df, use df_AB = df_A.set_index("col").join(df_B.set_index("col")).

You can join single-indexed DataFrame with MultiIndex DataFrame using this commend, but it must be approached very carefully. When the single shared index has different values, aka, outer does not equal inner, the join function returns an error.
Possible methods:
Make the MultiIndex DataFrame single-indexed first. First use commands like df_multi_index.index.set_names and df_multi_index.reset_index(level=[<all non-shared indices>]).
Avoid how="outer". If the MultiIndex DataFrame is left, use how="left" and on="<shared_single_index>".

With two parts:

df_AB = df_A.join(
    df_B,                       # Or [df_B, df_C, ...]
    how="outer",                # Keep all different index values
    lsuffix="A", rsuffix="B",   # if get ValueError: columns overlap 
)

If want to select only one column of the left df, will return AttributeError: 'Series' object has no attribute 'join'. In this case, do:

df_AB = df_A[["Rating"]].join(
    df_B,
    how="inner",
)

With a single list of DataFrames: (index is ID) (credit)

import pandas as pd
from functools import reduce

df = reduce(lambda x, y: pd.merge(x, y, on="ID", how="outer"), df_list)

Drop things #

Rows with duplicate indices:

Doc: Duplicate Labels — pandas 2.1.0 documentation

Credit: python - Remove pandas rows with duplicate indices - Stack Overflow

df.index.is_unique
# False

# Look at rows of all duplicated index
df[df.index.duplicated(keep=False)]

# Drop all rows with duplicated index but the first instance
df = df[~df.index.duplicated(keep='first')]

Columns:

df = df.drop(["drop", ...], axis=1)

Rows:

df = df.drop(["drop", ...], axis=0)

Wide to long (melt/unpivot) #

Docs:

Time series wide to long:

df_long = df.melt(
    id_vars="ID",
    var_name="year",
    value_name="variable",
)

Long to wide (pivot) #

For example, say you have a very long MultiIndex DataFrame like this (asterisk means index column):

                       VALUE
ID      CODE   LEVEL
*A0001* *A-1*  *B-1*   96.0
               *B-2*   5.0
        *A-2*  *B-1*   18.0
               *B-2*   18.0
               *B-3*   1.0
        *A-3*  ...
...

But want you really want is one ID per row, with everything in the row.

Step 1:

df_unstack = df.stack().unstack(level=['CODE', 'LEVEL'])

Which has almost identical effects as (but this one returns only one index)

df_unstack = pd.pivot(
    df.reset_index(),
    index="ID",
    columns=["CODE", "LEVEL"],
    values="VALUE",
)

Refer to index debugging when in trouble.

Result:

            CODE  A-1        A-2          ...
            LEVEL B-1    B-2 B-1   B-3    ...
ID      (unnamed)
*A0001* *VALUE*    96.0  5.0  18.0  18.0 
*A0002* *VALUE*   505.0  5.0 182.0 191.0 
*A0003* *VALUE*   275.0  5.0 156.0 159.0 
*A0004* *VALUE*   245.0  5.0  72.0  72.0 
*A0005* *VALUE*   377.0  5.0 196.0 196.0

Step 2:

df_unstack.columns = ['_'.join(col) for col in df_unstack.columns]

If want to do this to index

This code also works for index:

df.index = ['_'.join(ind) for ind in df.index]

But need to make sure all values in the indices are strings. If not, try first convert:

df = df.reset_index()

df["Level"] = [str(val) for val in df["Level"]]
# or if values are timestamps
df["Date"] = [val.strftime("%Y-%m-%d") for val in df["Date"]]

Result:

                  A-1_B-1 A-1_B-2 A-2_B-1 ...
ID      (unnamed)
*A0001* *VALUE*    96.0   5.0      18.0
*A0002* *VALUE*   505.0   5.0     182.0
*A0003* *VALUE*   275.0   5.0     156.0
*A0004* *VALUE*   245.0   5.0      72.0
*A0005* *VALUE*   377.0   5.0     196.0

Step 3:

If you have used pd.pivot, skip this step.

df_unstack.index = df_unstack.index.set_names(["ID", "drop"])   # give the second index a name
df_unstack = df_unstack.reset_index(level=["drop"])             # un-index
df_unstack = df_unstack.drop("drop",axis=1)                     # remove

Result:

      A-1_B-1 A-1_B-2 A-2_B-1 ...
ID
A0001  96.0   5.0      18.0
A0002 505.0   5.0     182.0
A0003 275.0   5.0     156.0
A0004 245.0   5.0      72.0
A0005 377.0   5.0     196.0

Done!

Loop over rows #

Docs:

Refs:

row["new_col"] does not work within the loops.
Doc:
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

for ind, row in df.iterrows():
    print("\033[1m" + "ind:" + "\033[0m" + "\n")
    print(ind)              # A0001
    
    print("\n" + "\033[1m" + "row:" + "\033[0m" + "\n")
    print(row)              # pd.series
    
    # print(row["column1"])   # value of df.loc["A0001"]["column1"]
    
    print("\n" + "\033[1m" + "cells:" + "\033[0m" + "\n")
    for cell in row.items():
        print(cell[0])      # column name
        print(cell[1])      # cell value

    break

Loop over columns #

Method 1: enumerate

Ref: python - How to iterate over columns of a pandas dataframe - Stack Overflow

for i, column in enumerate(df):
    print(i, column)         # 0-index
    print(column)            # column name
    print(df[column])        # values of the whole column
# 0, col_1; 1, col_2; ...

# completely the same:
# for i, column in enumerate(df.columns)

Method 2: items

Doc: pandas.DataFrame.items — pandas 2.2.1 documentation

for label, content in df.items():
    print(label)             # column name
    print(content)           # values of the whole column
    break

Method 3: apply

df.apply(lambda x: print(x.name), axis="index")

df.apply(func, axis="index")

Loop over list of DataFrames #

Ref: python - Pandas Loop through list of Data Frames and Change Index - Stack Overflow

Things would fail or stop operating after first iteration if use for df in list. Write instead:

for i, df in enumerate(df_list):
    # ...

Rename things #

Index columns:

df.index = df.index.set_names(["ID"])

Normal columns:

df = df.rename(columns={"A": "a", "B": "c"})

# or (no quotation marks)
df = df.rename(columns=dict(A="a", B="c"))

# or just replace everything
df.columns = list_of_column_names

By column index (for None columns that are otherwise helpless): (ref)

df.columns.values[0] = "Safe"

Batch rename columns:

# with string operation
df = df.rename(columns=lambda x: x.replace("_FINAL", ""))
df = df.rename(columns=lambda x: x+"_PERC")

# with regex
# https://stackoverflow.com/a/26500249/10668706
import re

df = df.rename(columns=lambda x: re.sub(" $", "", x))

Custom regex rename function: (inspired by ref, also see Python #Match case (Switch))

import re

class RegexEqual(str):
    def __eq__(self, pattern):
        return bool(re.match(pattern, self))

def rename_fn(col):
    match RegexEqual(col):
        case "[A-z\s]*Date":
            return "Date"
        case "Previous Rating": # Checks something more specific
            return "Previous"
        case "[A-z\s]*Rating":  # Then a fuzzy check
            return "Rating"
    return col                  # otherwise could get None columns

Find & replace in all cells #

Credit: python - How to replace a value in pandas, with NaN? - Stack Overflow

df["rating"] = df["rating"].replace("No ratings", np.NaN)

Using dictionary: (credit)

replace = {"No ratings": np.NaN}

df["rating"].map(replace)
# Or (slower, but also works in a loop)
df["rating"].replace(replace, inplace=True)

Sort things #

Rows by values in any combination of columns or indices:

df = df.sort_values(["ID","level"])

# Use different sorting methods
df = df.sort_values(["ID","level"], ascending=[True, False])

Rows by index values:

df = df.sort_index()

Columns by column names: (credit)

df = df.reindex(sorted(df.columns), axis=1)

Pandas #

Most important notes #

(Maybe) Useful resources #

Import Data #

Dict/JSON #

Excel #

Custom usecol function for fuzzy column names #

CSV/TSV #

Pandas #

CSV module #

ODS #

pandas-ods-reader #

pd.read_excel #

Stata dta #

pd.read_stata #

pd.io.stata.StataReader #

Export DataFrame #

Excel/ODS #

CSV/Plain text #

Debugging #

Import #

Calculation #

Index #

Convert series to frame with title #

Type conversion upon merge or join #

Observe things #

Descriptive statistics #

Calculate bivariate correlation coefficients and p-values matrix #

Calculate mode by any level of MultiIndex #

Calculate sum of all columns or rows #

Count length of levels of MultiIndex #

Count number of values in columns #

Elaborated use case: track changes between every two columns #

Find changes within groups (indices) #

Assign ranks/observation count to rows #

Count/find missing values #

Display full width of a DataFrame #

Find the n-th largest and smallest number #

List all index values #

List all columns #

List all unique values of a column #

List rows with duplicated values of a column #

List rows with duplicated values of indices #

Select columns (filter) #

By name #

By name with RegEx #

By column type #

Select rows (filter) #

By index #

By column value (boolean indexing) #

By column value (query) #

By value from another DataFrame #

By cell value in any column or any row #

By cell type in column #

Select triangle of DataFrame #

Groupby #

Calculate average of columns with many filters (conditions) #

Groupby column names (custom groupby function) #

Column operations #

Convert number to range categories #

Method 1: empty column #

Method 2: boolean columns #

Convert type of columns #

Combine multiple columns #

Divide columns by columns #

Fill NA with certain values #

Get column-wise delta (difference) #

Reorder columns #

Index operations #

Set index #

Unset index (return to data columns) #

Drop redundant levels (index or column) #

DataFrame operations #

Combine two DataFrames #

Concat (same columns) #

Join (same indices) #

Drop things #

Wide to long (melt/unpivot) #

Long to wide (pivot) #

Loop over rows #

Custom `usecol` function for fuzzy column names #

`pandas-ods-reader` #

`pd.read_excel` #

`pd.read_stata` #

`pd.io.stata.StataReader` #

Groupby column names (custom `groupby` function) #