Pandas #
Warning: This post hasn't been updated for over a year. The information may be out of date.
Most important notes #
axis: {0 or ‘index’ (rows), 1 or ‘columns’}, default 0
(Maybe) Useful resources #
Import Data #
Dict/JSON #
Doc: pandas.DataFrame.from_dict — pandas 1.3.5 documentation
For dict like {"id1": data1, "id2": data2, ...}:
df = pd.DataFrame.from_dict(
data_dict,
orient='index',
columns=["data"], # otherwise very troublesome to rename later
)
df.index = df.index.set_names(["ID"])For dict like: {"row1": [...], "row2": [...], ...}, and JSON files:
df = pd.DataFrame.from_dict(
data_json
).transpose()Excel #
Doc: pandas.read_excel — pandas 2.0.0 documentation
Extra dependency:
openpyxl
Read sheet:
import pandas as pd
df = pd.read_excel(
os.path.join("data", "data.xlsx"), # read ./data/data.xlsx
sheet_name="Sheet 1",
skiprows=1, # if there are pre-data rows
header=[0,1], # if there are multiple (combined) header rows
nrows=100, # only read first x rows
usecols=["ID", "Rating", "Date"], # only import certain columns
index_col="ID", # set index as soon as import
)Read names of all sheets: (credit)
import pandas as pd
xl = pd.ExcelFile(os.path.join("data", "data.xlsx"))
xl.sheet_names # list of sheet names
df = xl.parse("Sheet 1") # if need sheetsRead all sheets: (credit)
This is significantly slower than the
pd.ExcelFilemethod above. Only use when you actually need every single sheet.
import pandas as pd
dfs = pd.read_excel(file_name, sheet_name=None)
dfs.keys() # similar result as xl.sheet_namesCustom usecol function for fuzzy column names #
See Python #Match case (Switch)
CSV/TSV #
Pandas #
Doc: pandas.read_csv — pandas 2.0.0 documentation
import pandas as pd
names_list = ["col 1", "col 2", "col 3"]
df = pd.read_csv(
os.path.join("data", "data.csv"), # read ./data/data.csv
# sep='\t', # if tsv
# quotechar="'", # if see extra quotation marks after import
header=None,
index_col=[0, 1], # can use names if given
names=names_list,
usecols=[0, 1, 2, 3], # can use names if given
dtype={"Phone number": "str"}, # read columns as type
nrows=100, # only read first 100 rows
)CSV module #
See Python #CSV.
ODS #
pandas-ods-reader #
Extra dependency:
pandas-ods-reader,ezodf, andlxml
The newest code on GitHub supports
skiprows, but notnrows. However, the pip version does not supportskiprowsyet. To get it, download and use the development version of the package.
This is by far the fastest method, if you know the sheet name.
import pandas as pd
from pandas_ods_reader import read_ods
df = read_ods(
os.path.join("data", "data.ods"), # read ./data/data.ods
sheet=2, # get 2nd sheet (1 based), default 1
# or
sheet="sheet 1", # get sheet by name
skiprows=1, # only works if you have the development version
columns=["ID", "Rating", "Date"] # this is NOT usecols!! it renames the columns, and truncates the sheet if list is shorter than # of columns
)
# Simulate usecols callable
usecols = pd.array([usecols_fn(col) for col in df.columns], dtype="boolean")
df = df.loc[:, usecols]pd.read_excel #
Extra dependency:
odfpy
Since
odfpydoes not have row selection method when loading files,nrowsoption happens after the whole file is loaded into memory, unlike when dealing with other types of tables. This causes long loading time when the file is huge. See pandas-dev/pandas Issue #53185.
Slightly faster (and more robust) method:
import pandas as pd
ods = pd.ExcelFile(os.path.join("data", "data.ods"))
ods.sheet_names # get all sheet names
df = ods.parse("sheet 1")Documented method:
import pandas as pd
df = pd.read_excel(
os.path.join("data", "data.ods"), # read ./data/data.ods
)Stata dta #
pd.read_stata #
Convert (all) categorical to strings (objects):
This will use up significantly more memory compared to the second method below.
import pandas as pd
data_path = os.path.join("data", "data.dta")
df = pd.read_stata(
data_path,
convert_categoricals=False,
)
with pd.io.stata.StataReader(data_path) as sr:
value_labels = sr.value_labels()
for col in df:
if col in value_labels:
df[col].replace(value_labels[col], inplace=True)pd.io.stata.StataReader #
When getting ValueError: Value labels for column col_x are not unique. These cannot be converted to pandas categoricals. The repeated labels are: Not applicable, I found this alternative way of doing the import by excessive digging into the (not documented any more) methods.
It may be easier if you have access to Stata and can do the cleaning up in Stata first. See: python - Stata to Pandas: even if there are repeated Value Labels? - Stack Overflow
More detailed explanation: python - Loading STATA file: Categorial values must be unique - Stack Overflow
Docs:
- IO tools (text, CSV, HDF5, …) — pandas 2.1.1 documentation
- pandas.read_stata — pandas 2.1.1 documentation
Step 1: read the Stata file with the io reader.
import pandas as pd
df_reader = pd.io.stata.StataReader(
os.path.join("data", "data.dta"),
)Step 2: Verify the repeated labels.
print(df_reader.value_labels()["col_x"])
# {...
# -1: 'Not applicable',
# ...
# 6: 'Not applicable'}Step 3: Replace the repeated value with something else.
df_reader.value_labels()["col_x"][6] = "Not applicable (1)"Step 4: Read the file from reader.
df = df_reader.read()Done!
Since it is said in code that Using StataReader as a context manager is the only supported method, this is the context manager version:
import pandas as pd
data_path = os.path.join("data", "data.dta")
with pd.io.stata.StataReader(path) as sr:
value_labels = sr.value_labels()
# print(value_labels["col_x"])
# ...
value_labels["col_x"][6] = "Not applicable (1)"
# ...
df = sr.read()Export DataFrame #
Excel/ODS #
Extra dependency:
openpyxlorodfpy
df.to_excel(os.path.join("data", "data.xlsx"))
df.to_excel(os.path.join("data", "data.ods"))
# or if the df is too big
df.iloc[:1000, :1000].to_excel(os.path.join("data", "data.xlsx"))CSV/Plain text #
Ref: Python, Pandas : write content of DataFrame into text File - Stack Overflow
# CSV
df.to_csv(os.path.join("data", "data.csv"))
# text method 1
df.to_csv(os.path.join("data", "data.txt"))
# text method 2
with open(os.path.join("data", "data.txt"), "w") as f:
f.write(df.to_string())Debugging #
Import #
If get UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position xx: invalid start byte, try: (credit)
df = pd.read_csv(
os.path.join("data", "data.csv"), # read ./data/data.csv
encoding="cp1252",
)Calculation #
If get TypeError: unhashable type: 'numpy.ndarray', it means one (or more) of the cells contain list value (instead of single value).
Get a subset of all rows containing list-value cells:
df_subset = df[df["Average"].apply(lambda x: isinstance(x, np.ndarray))]Index #
If get ValueError: Index contains duplicate entries, cannot reshape, it means there are duplicate indices sets (rows where all indices are identical).
First check the rows contain same data, then drop rows with duplicate indices.
PerformanceWarning: DataFrame is highly fragmented could happen after a concat, when no index is present in all sub-dfs, but somehow there are still repeated indices. Try the following:
df = df.reset_index()
df = df.drop(["index"], axis=1)Convert series to frame with title #
sr.name = "Column"
df = sr.to_frame()Type conversion upon merge or join #
df = df.astype('Int64')Observe things #
Descriptive statistics #
Docs:
- pandas.DataFrame.describe — pandas 2.2.2 documentation
- pandas.DataFrame.quantile — pandas 2.2.2 documentation
- Essential basic functionality # Custom describe — pandas 2.2.2 documentation
Basic descriptive table:
df.describe()
# with multiple index
df.groupby("ID").min().describe()Quantiles:
df.quantile([.1,.25,.5,.75,.9,.99])
# with multiple index
df.groupby("ID").min().quantile([.1,.25,.5,.75,.9,.99])Custom everything: (quantile results could be different from quantile)
from functools import partial
q_90 = partial(pd.Series.quantile, q=0.25)
q_90.__name__ = "90%"
q_99 = partial(pd.Series.quantile, q=0.75)
q_99.__name__ = "99%"
df.agg(["count", "mean", "std", "min", "median", q_90, q_99, "max"])Calculate bivariate correlation coefficients and p-values matrix #
Ref: python - pandas columns correlation with statistical significance - Stack Overflow
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
# Correlation coefficients: (same)
corr = df.corr()
corr = df.corr(method=lambda x, y: pearsonr(x, y)[0])
# p-values:
p_val = df.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(len(df.columns))
# p_val = p_val.round(decimals=4)Calculate mode by any level of MultiIndex #
Ref: python - GroupBy pandas DataFrame and select most common value - Stack Overflow
So far I haven’t found any way to get this mode df merge/join/concat/etc with the original df, except first pivot the original df.
modecould return a list of values. Make sure to check the type of your modes before proceeding to other tasks.
# mode of each ID
df.groupby(["ID"]).agg(lambda x: x.mode())
# mode of each ID x each date
df.groupby(["ID","Date"]).agg(lambda x: x.mode())Calculate sum of all columns or rows #
# sum of all rows, by column
df.sum(axis=0)
# sum of all columns, by row
df.sum(axis=1)Count length of levels of MultiIndex #
Ref: python - Hierarhical Multi-index counts in Pandas - Stack Overflow
df.groupby(level=["ID"]).size()
df.groupby(level=["ID","Date"]).size()
# example output:
# ID
# 01 240
# 02 240
# 03 240
# ...
# 57 211
# 59 211
# 60 211
# Length: 60, dtype: int64Count number of values in columns #
df["Ratings"].value_counts()
# If all columns are similar
df.apply(lambda col: col.value_counts(), axis="index")
# Count changes between two columns
df.query[["2014-2015", "2015-2016"]].value_counts()
df.query('`2014-2015`!=`2015-2016`')[["2014-2015", "2015-2016"]].value_counts()Elaborated use case: track changes between every two columns #
Refs:
- python - Pandas dataframe total row - Stack Overflow
- python - Creating an empty Pandas DataFrame, and then filling it - Stack Overflow
- python - Using .loc with a MultiIndex in pandas - Stack Overflow
- MultiIndex / advanced indexing — pandas 2.2.1 documentation
Also see: Python #Loop
Track changes of any (string/categorical) value across every two columns
import pandas as pd
from itertools import tee, islice, chain
def p_i_n(some_iterable):
prevs, items, nexts = tee(some_iterable, 3)
prevs = chain([None], prevs)
nexts = chain(islice(nexts, 1, None), [None])
return zip(prevs, items, nexts)
def value_count_across(df):
count_across_list = []
cols_labels = ["Before", "After"]
for p,i,n in p_i_n(df.columns):
if None is p:
print("Start of list")
print("Processing", i)
if None is n:
continue
count = df[[i, n]].value_counts()
count.index = count.index.set_names(cols_labels)
count = count.to_frame().sort_values(cols_labels)
col_name = i + "-" + n.split("-")[1]
count = count.rename(columns={0: col_name})
count_across_list.append(count)
count_across_df = count_across_list[0].join(
count_across_list[1:],
how="outer",
).astype('Int64').sort_values(cols_labels).reset_index()
mask_changed = (count_across_df["Before"] != count_across_df["After"])
count_across_df_changed = count_across_df[mask_changed].set_index(cols_labels)
count_across_df_changed.loc[("Total","Changed"),:] = count_across_df_changed.sum(axis="index")
count_across_df_unchanged = count_across_df[~mask_changed].set_index(cols_labels)
count_across_df_unchanged.loc[("Total","Unchanged"),:] = count_across_df_unchanged.sum(axis="index")
count_across_df = count_across_df.set_index(cols_labels)
count_across_df.loc[("Total","Total"),:] = count_across_df.sum(axis="index")
count_across_df.loc[("Total","Changed"),:] = count_across_df_changed.loc[("Total","Changed"),:]
count_across_df.loc[("Total","Unchanged"),:] = count_across_df_unchanged.loc[("Total","Unchanged"),:]
return count_across_dfTrack changes of any (ordinal categorical/numerical) value across every two columns:
import pandas as pd
from itertools import tee, islice, chain
def p_i_n(some_iterable):
prevs, items, nexts = tee(some_iterable, 3)
prevs = chain([None], prevs)
nexts = chain(islice(nexts, 1, None), [None])
return zip(prevs, items, nexts)
def value_compare_across(df):
count_better_list = []
count_worse_list = []
count_same_list = []
for p,i,n in p_i_n(df.columns):
if None is p:
print("Start of list")
print("Processing", i)
if None is n:
continue
better = sum(df[n] > df[i])
count_better_list.append(better)
worse = sum(df[n] < df[i])
count_worse_list.append(worse)
same = sum(df[n] == df[i])
count_same_list.append(same)
count_list = [count_better_list, count_worse_list, count_same_list]
count_df = pd.DataFrame(
count_list,
columns = df.columns[1:],
index=["Increased", "Decreased", "Same"]
)
count_df.loc["Changed"] = count_df.loc[["Increased","Decreased"]].sum(axis="index")
count_df.loc["Total"] = count_df.loc[["Changed","Same"]].sum(axis="index")
return count_dfFind changes within groups (indices) #
# need to reset index if already has groups as index
df_mark = (df.reset_index().groupby(['ID'])
.nunique()
.gt(1)
.replace({True: 'X', False: ''})
)
changed_mask = df_mark["Rating"]=="X"Assign ranks/observation count to rows #
Ref: python - Number rows within group in increasing order in a pandas dataframe - Stack Overflow
Doc: pandas.DataFrame.rank — pandas 2.2.1 documentation
df["Obs"] = df.groupby(["ID", "Year"])["Date"].rank(method="first", ascending=True).astype('Int64')Use rank instead of staggered dates to pivot:
pd.pivot(
df,
index=["ID", "Year"],
columns="Obs",
values=["Rating", "Date"],
)Count/find missing values #
Ref: python - How to check if any value is NaN in a Pandas DataFrame - Stack Overflow
# Check the whole df
df.isnull().values.any()
# Check each column
df.isnull().sum(axis="index")
# Check each row
df.isnull().sum(axis="columns")
# For a column, list rows with missing value
# https://stackoverflow.com/a/47377251/10668706
df[df["Bad column"].isnull()]
# For a row, list columns with missing value
# https://stackoverflow.com/a/57141046/10668706
df.loc["Bad_ID", df.isnull().any()]Display full width of a DataFrame #
Ref:
- python - How can I display full (non-truncated) dataframe information in HTML when converting from Pandas dataframe to HTML? - Stack Overflow
- Python, Pandas : write content of DataFrame into text File - Stack Overflow
# method 1
with pd.option_context('display.max_colwidth', None):
display(df)
# method 2
print(df.to_string())Find the n-th largest and smallest number #
# full list of largest n and smallest n
df["col"].nlargest(3)
df["col"].nsmallest(3)
# only the single point
df["col"].nlargest(3).iloc[[-1]]
df["col"].nsmallest(3).iloc[[-1]]List all index values #
df.index
# Result:
# Index(...)
# MultiIndex(...)
# For MultiIndex, if only want some level
df.index.get_level_values(0)
df.index.get_level_values("Time")
# Result:
# Index(...)Furthur conversion: (also see: #By column value (boolean indexing))
df.index.values
# Result:
# array(...)
df.index.to_list()
# Result:
# [...]List all columns #
Ref: python - How to show all columns’ names on a large pandas dataframe? - Stack Overflow
For huge databases, best to refer to its data manual.
# Cut off after 1,000 columns
print(df.columns.tolist())
# Shows everything, slow:
with pd.option_context("display.max_columns", None):
print(df.head())List all unique values of a column #
Ref: python - print the unique values in every column in a pandas dataframe - Stack Overflow
print(df["Status Code"].unique())To sort the unique values, use: (dropna in case there is NaN)
sorted(df["Status Code"].dropna().unique())List rows with duplicated values of a column #
Ref: How do I get a list of all the duplicate items using pandas in python? - Stack Overflow
df[df.duplicated(["ID"], keep=False)].sort_values("ID")List rows with duplicated values of indices #
df[df.index.duplicated(keep='first')]Select columns (filter) #
By name #
# must use double brackets
df_subset = df[['col1']]
df_subset = df[['col1', 'col2', ...]]
# contains
# https://stackoverflow.com/a/11531402
df_subset = df[df.columns.str.contains("hello")]By name with RegEx #
Credit: python - How to select columns from dataframe by regex - Stack Overflow
rating_cols = df.filter(regex=("rating_*")).columns
# then when need to use
df[rating_cols] # returns a small df
df[rating_cols]["rating_2021_1"] # returns a seriesBy column type #
Doc: pandas.DataFrame.select_dtypes — pandas 2.1.1 documentation
# See type of all columns
df.dtypes
# Only get columns of float type
df_subset = df.select_dtypes("float")Select rows (filter) #
By index #
Doc: pandas.DataFrame.loc — pandas 2.1.0 documentation
df.loc["A0001"]By column value (boolean indexing) #
Doc: Indexing and selecting data # Boolean indexing — pandas 2.2.1 documentation
# only keep rows with certain value for a col
df_subset = df[df["Status Code"]=="A"]
# only keep rows without certain value for a col
df_subset = df[df["Status Code"]!="B"]
# only keep rows with a certain starting…
# https://stackoverflow.com/a/17958424/10668706
df_subset = df[df["Name"].str.startswith("One d", na=False)]
# …or without
df_subset = df[~df["Name"].str.startswith("One d", na=False)]
# especially, for nan values:
df_subset = df[df["Status Code"].isna()]
df_subset = df[df["Status Code"].notna()]
# combining multiple conditions
mask_1 = df["Status Code"].isna()
mask_2 = ~df["Name"].str.startswith("One d", na=False)
df_subset = df[mask_1 * mask_2]Get MultiIndex as a list from boolean masks: (ref, also see my comment)
import numpy as np
import pandas as pd
mask_1 = df["Status Code"].isna()
index_list = np.array(mask_1.loc[lambda item: item].index.to_list())
# all first indices
index_list.T[0]
# all second indices
index_list.T[1]By column value (query) #
Docs:
- pandas.DataFrame.query — pandas 2.2.1 documentation
- Indexing and selecting data # The
query()Method — pandas 2.2.1 documentation
When the value of some cells are lists, pandas will raise
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). In that case, try this method instead:Credit: python - How do I select rows from a DataFrame based on column values? - Stack Overflow
Also see: Calculate average of columns with many filters (conditions)
# only keep rows with certain value for a col
# https://stackoverflow.com/a/50697599
df_subset = df.query('`Status Code` == "A"')
# only keep rows with certain value for a col
df_subset = df.query('`Status Code` != "B"')
# no backticks for single word columns
df_subset = df.query('Date != 2015')
# multiple conditions
df_subset = df.query('Date != 2015 & `Status Code` == "A"')For multi-level columns: (ref)
First backtick-quote the tuple, then single-quote string column names (except for numerical columns).
df_subset = df.query("`('Category', 1)` > `('Category', 2)`")To use variable as column name: (ref, the accepted method doesn’t work in this case)
for i in names_list:
search = df[[i]].query("{}<5".format(i)).index.to_list()
print(search)
breakAlso see: (not tested)
- python - Pandas filter data frame rows by function - Stack Overflow
- python - Pandas filtering not working - Stack Overflow
By value from another DataFrame #
Ref: python - pandas - filter dataframe by another dataframe by row elements - Stack Overflow
The core idea is to get a list of index (from another DataFrame), and use the index list to filter the original DataFrame.
This extra column is not within the original DataFrame, the only thing they need to have in common is the index.
With boolean indexing:
i_q_0_258 = df_q_sum[(df_q_sum['q'] < 258) & (df_q_sum['q'] >0)].index
# old way of subsetting
df_q_0_258 = df[df.index.isin(i_q_0_258)]
# new way of subsetting
df_q_0_258 = df.loc[i_q_0_258]
len(df_q_0_258)
# 41With query:
i_q_0_258 = df_q_sum.query('q < 258 & q > 0').index
# old way of subsetting
df_q_0_258 = df[df.index.isin(i_q_0_258)]
# new way of subsetting
df_q_0_258 = df.loc[i_q_0_258]
len(df_q_0_258)
# 41By cell value in any column or any row #
Credit: python - Test if any column of a pandas DataFrame satisfies a condition - Stack Overflow
# boolean matrix with same shape as df
(df[df.columns]<=5.0)
(df[df.columns]<=5.0).any(axis="columns") # list of rows
(df[df.columns]<=5.0).any(axis="index") # list of columns
(df[df.columns]<=5.0).any(axis="columns").sum() # number of rows
(df[df.columns]<=5.0).any(axis="index").sum() # number of columnsBy cell type in column #
Credit: python - Select row from a DataFrame based on the type of the object(i.e. str) - Stack Overflow
# only keep rows with certain types of value
df_subset = df[df["Rating"].apply(lambda x: isinstance(x, float))]
# or the opposite
df_subset = df[df["Rating"].apply(lambda x: not isinstance(x, float))]Select triangle of DataFrame #
python - Melt the Upper Triangular Matrix of a Pandas Dataframe - Stack Overflow
corr = df.corr()
# k=1 for dropping diagonal, k=0 for keeping it
keep = np.triu(corr, k=1).astype('bool').reshape(corr.size)
# or for tril, k=-1 for dropping diagonal, k=0 for keeping it
keep = np.tril(corr, k=-1).astype('bool').reshape(corr.size)
corr_flat = corr.stack()[keep].reset_index()
corr_flat.columns = ["x1","x2","Coef"]After this, can do a sns.kdeplot(data=corr_flat,x="Coef") to see the kernel density of all the coefficients.
Groupby #
Doc: pandas.DataFrame.groupby — pandas 2.1.1 documentation
Calculate average of columns with many filters (conditions) #
import pandas as pd
# Condition: A=Yes or B=Yes
# For and conditions, use &
df_subset = hse_micro.query('A == "Yes" | B == "Yes"')
# For every possible combination of C and D,
# get average of AGE and YEAR
df_subset.groupby([["C", "D"]])[["AGE", "YEAR"]].mean()Groupby column names (custom groupby function) #
Use case: get average columns across multiple columns with similar names.
Step 0: Consider list of groups
groups = ["AAA", "BBB", ...]
len(groups)
# 18Step 1: Write groupby function
def groupby_fn(col):
groups = ["AAA", "BBB", ...]
col_group = col.replace("Rating_", "").split("0")[0]
if col_group in groups:
return col_group # Mark col as col_group
else:
return "Others" # Mark col as OthersStep 2: Get columns to group by (optional)
cols = df.filter(regex=("Rating_*")).columnsStep 3: Groupby
df_grouped = df.groupby(groupby_fn, axis=1).mean()
# Or if used Step 2:
df_grouped = df[cols].groupby(groupby_fn, axis=1).mean()
# Check no col was marked as Others, otherwise check groupby function
len(df_grouped.columns)
# 18Done!
Column operations #
Convert number to range categories #
Use case: one column of various numbers => one column of ranges in words.
Method 1: empty column #
Ref: python - How to add an empty column to a dataframe? - Stack Overflow
df["q_groups"] = ""
# Note: this used to be: df[(df["q_sum"]>=434)]["q_groups"]
# But newer versions of pandas will raise warning
df.loc[(df["q_sum"]>=434), "q_groups"] = "Q>=434"
df.loc[(df["q_sum"]>=289) & (df["q_sum"]<434), "q_groups"] = "289<=Q<434"
df.loc[(df["q_sum"]>=258) & (df["q_sum"]<289), "q_groups"] = "258<=Q<289"
df.loc[(df["q_sum"]<258), "q_groups"] = "Q<258"Method 2: boolean columns #
Step 1: get range booleans (credit)
The new column names here will be the name of ranges.
df["Q>=434"] = (df["q_sum"] >= 434)
df["289<=Q<434"] = (df["q_sum"] >= 289) & (df["q_sum"] < 434)
df["258<=Q<289"] = (df["q_sum"] >= 258) & (df["q_sum"] < 289)
df["Q<258"] = (df["q_sum"] < 258)Step 2: make backup df (skip if no other columns are boolean)
q_groups = df.copy(deep=True)
q_groups = df.drop(["q_sum", ...], axis=1) # drop all other columnsStep 3: make one column (credit)
q_groups_col = q_groups.eq(True).dot(q_groups.columns + ',').str.rstrip(',')Step 4: merge back
df["q_groups"] = q_groups_colDone!
Convert type of columns #
Inspect column types:
df.info()To numeric:
# float
df["Rating"] = df["Rating"].astype(float)
# auto
df["Rating"] = pd.to_numeric(df["Rating"])
# int: useful when doing boolean -> int
# https://stackoverflow.com/a/27362540
df["Dummy"] = df["Dummy"].astype(int)
df["Dummy"] = df["Dummy"].map({'True': 1,'False' :0 })
# Int64
# If get error: TypeError: float() argument must be a string or a real number, not 'NAType'
# float
df["Rating"] = df["Rating"].astype("Int64")To string:
df = df.astype("string")
df = df.astype({"col1": "string", "col2": "float", ... }) # need quotes
df = df.astype(dict(col1="string", col2="float", ...)) # no quotesTo category:
from pandas.api.types import CategoricalDtype
rate_type = CategoricalDtype(
categories=["No rating", "Bad", "Good", "Outstanding"],
ordered=True,
)
df["Rating"] = df["Rating"].astype(rate_type)To list: (ref)
import ast
df["Lists"] = df["Lists"].map(ast.literal_eval)With inline functions:
# datetime
df["Date"] = [np.datetime64(cell) for cell in df["Date"]]
df["Date"] = [cell.strftime("%Y") for cell in df["Date"]]
# string
df["ID"] = [str(val) for val in df["ID"]]Combine multiple columns #
Refs:
- python - Combine two columns of text in pandas dataframe - Stack Overflow
- python - pandas combine two strings ignore nan values - Stack Overflow
data["Address"] = data["Address Line 1"].fillna("")
for col in ["Address Line 2", "Address Line 3", "Address Line 4", "Address Line 5", "Postcode"]:
data["Address"] = data["Address"].astype(str) + " " + data[col].fillna("").astype(str)Divide columns by columns #
# Divide one column by another column
# https://stackoverflow.com/a/34540647
df["percent"] = df["numerator"].div(df["denominator"], axis=0)
# Divide all elements in a df by all elements in another df
# To avoid errors caused by NA
df_perc = df_numerator.truediv(df_denominator)
# Take the sum of every row first
df_perc = df_numerator.sum(axis="columns").truediv(df_denominator.sum(axis="columns"))Fill NA with certain values #
Using giving fixed value:
# https://stackoverflow.com/a/38134049/10668706
df[["a", "B"]] = df[["a", "B"]].fillna(value="Something")Using existing values:
# Caution: This means filling along the rows
df = df.ffill(axis="columns")
# Caution: This means filling along the column
# https://stackoverflow.com/a/63190278
df[["Some_col"]] = df[["Some_col"]].groupby("ID").apply(lambda y: y.ffill(axis="index"))Get column-wise delta (difference) #
Docs:
- pandas.DataFrame.diff — pandas 2.1.1 documentation
- pandas.DataFrame.pct_change — pandas 2.1.1 documentation
# numerical changes
df_diff = df.diff(axis=1) # first column empty
df_diff = df.diff(periods=2, axis=1) # first two columns empty
# percentage change
df_diff_pct = df.pct_change(axis=1) # first column empty
df_diff_pct = df.pct_change(periods=2, axis=1) # first two columns emptyReorder columns #
Ref: python - How to change the order of DataFrame columns? - Stack Overflow
# Take the last column to first
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]Index operations #
Set index #
From no index, or discard all old index (warning: their values will be lost).
df = df.set_index(["ID", ...])Add extra index columns besides existing index: (ref)
df = df.set_index([df.index, "Status", "Level", ...])Unset index (return to data columns) #
# return all indices back to normal columns
df = df.reset_index()
# return only selected indices
df = df.reset_index(level=["Status", "Level", ...])Drop redundant levels (index or column) #
df = df.droplevel("Unused", axis="index")
df = df.droplevel("Unused", axis="columns")DataFrame operations #
Combine two DataFrames #
Docs:
- Merge, join, concatenate and compare — pandas 2.1.1 documentation
- pandas.DataFrame.merge — pandas 2.1.1 documentation
- pandas.concat — pandas 2.1.1 documentation
- pandas.DataFrame.join — pandas 2.1.1 documentation
Concat (same columns) #
(Stack):
df_AB = pd.concat([df_A, df_B])Make all column names upper case or lower case: (credit) (also see: lower case for cell values)
# should also work on pd.merge, not tested
df_AB = pd.concat(
[df_A.rename(columns=str.lower),
df_B.rename(columns=str.lower)]
)
df_AB = pd.concat(
[df_A.rename(columns=str.upper),
df_B.rename(columns=str.upper)]
)Join (same indices) #
Always use index to join. Do not use
on="Data col"when performing join, pandas will raiseValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat.
If you do not want to modify the original df, usedf_AB = df_A.set_index("col").join(df_B.set_index("col")).
You can join single-indexed DataFrame with MultiIndex DataFrame using this commend, but it must be approached very carefully. When the single shared index has different values, aka,
outerdoes not equalinner, the join function returns an error.Possible methods:
- Make the MultiIndex DataFrame single-indexed first. First use commands like
df_multi_index.index.set_namesanddf_multi_index.reset_index(level=[<all non-shared indices>]).- Avoid
how="outer". If the MultiIndex DataFrame is left, usehow="left"andon="<shared_single_index>".
With two parts:
df_AB = df_A.join(
df_B, # Or [df_B, df_C, ...]
how="outer", # Keep all different index values
lsuffix="A", rsuffix="B", # if get ValueError: columns overlap
)If want to select only one column of the left df, will return AttributeError: 'Series' object has no attribute 'join'. In this case, do:
df_AB = df_A[["Rating"]].join(
df_B,
how="inner",
)With a single list of DataFrames: (index is ID) (credit)
import pandas as pd
from functools import reduce
df = reduce(lambda x, y: pd.merge(x, y, on="ID", how="outer"), df_list)Drop things #
Rows with duplicate indices:
Doc: Duplicate Labels — pandas 2.1.0 documentation
Credit: python - Remove pandas rows with duplicate indices - Stack Overflow
df.index.is_unique
# False
# Look at rows of all duplicated index
df[df.index.duplicated(keep=False)]
# Drop all rows with duplicated index but the first instance
df = df[~df.index.duplicated(keep='first')]Columns:
df = df.drop(["drop", ...], axis=1)Rows:
df = df.drop(["drop", ...], axis=0)Wide to long (melt/unpivot) #
Docs:
- pandas.DataFrame.melt — pandas 2.2.2 documentation
- pandas.melt — pandas 2.2.2 documentation
- pandas.wide_to_long — pandas 2.2.2 documentation
Time series wide to long:
df_long = df.melt(
id_vars="ID",
var_name="year",
value_name="variable",
)Long to wide (pivot) #
For example, say you have a very long MultiIndex DataFrame like this (asterisk means index column):
VALUE
ID CODE LEVEL
*A0001* *A-1* *B-1* 96.0
*B-2* 5.0
*A-2* *B-1* 18.0
*B-2* 18.0
*B-3* 1.0
*A-3* ...
...But want you really want is one ID per row, with everything in the row.
Step 1:
df_unstack = df.stack().unstack(level=['CODE', 'LEVEL']) Which has almost identical effects as (but this one returns only one index)
df_unstack = pd.pivot(
df.reset_index(),
index="ID",
columns=["CODE", "LEVEL"],
values="VALUE",
)Refer to index debugging when in trouble.
Result:
CODE A-1 A-2 ...
LEVEL B-1 B-2 B-1 B-3 ...
ID (unnamed)
*A0001* *VALUE* 96.0 5.0 18.0 18.0
*A0002* *VALUE* 505.0 5.0 182.0 191.0
*A0003* *VALUE* 275.0 5.0 156.0 159.0
*A0004* *VALUE* 245.0 5.0 72.0 72.0
*A0005* *VALUE* 377.0 5.0 196.0 196.0 Step 2:
df_unstack.columns = ['_'.join(col) for col in df_unstack.columns]If want to do this to index
This code also works for index:
df.index = ['_'.join(ind) for ind in df.index]But need to make sure all values in the indices are strings. If not, try first convert:
df = df.reset_index()
df["Level"] = [str(val) for val in df["Level"]]
# or if values are timestamps
df["Date"] = [val.strftime("%Y-%m-%d") for val in df["Date"]]Result:
A-1_B-1 A-1_B-2 A-2_B-1 ...
ID (unnamed)
*A0001* *VALUE* 96.0 5.0 18.0
*A0002* *VALUE* 505.0 5.0 182.0
*A0003* *VALUE* 275.0 5.0 156.0
*A0004* *VALUE* 245.0 5.0 72.0
*A0005* *VALUE* 377.0 5.0 196.0Step 3:
If you have used pd.pivot, skip this step.
df_unstack.index = df_unstack.index.set_names(["ID", "drop"]) # give the second index a name
df_unstack = df_unstack.reset_index(level=["drop"]) # un-index
df_unstack = df_unstack.drop("drop",axis=1) # removeResult:
A-1_B-1 A-1_B-2 A-2_B-1 ...
ID
A0001 96.0 5.0 18.0
A0002 505.0 5.0 182.0
A0003 275.0 5.0 156.0
A0004 245.0 5.0 72.0
A0005 377.0 5.0 196.0Done!
Loop over rows #
Docs:
- pandas.DataFrame.iterrows — pandas 2.1.0 documentation
- pandas.Series.items — pandas 2.1.0 documentation
Refs:
- python - How to iterate over rows in a DataFrame in Pandas - Stack Overflow
- Efficiently iterating over rows in a Pandas DataFrame | by Maxime Labonne | Towards Data Science
row["new_col"]does not work within the loops.Doc:
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
for ind, row in df.iterrows():
print("\033[1m" + "ind:" + "\033[0m" + "\n")
print(ind) # A0001
print("\n" + "\033[1m" + "row:" + "\033[0m" + "\n")
print(row) # pd.series
# print(row["column1"]) # value of df.loc["A0001"]["column1"]
print("\n" + "\033[1m" + "cells:" + "\033[0m" + "\n")
for cell in row.items():
print(cell[0]) # column name
print(cell[1]) # cell value
breakLoop over columns #
Method 1: enumerate
Ref: python - How to iterate over columns of a pandas dataframe - Stack Overflow
for i, column in enumerate(df):
print(i, column) # 0-index
print(column) # column name
print(df[column]) # values of the whole column
# 0, col_1; 1, col_2; ...
# completely the same:
# for i, column in enumerate(df.columns)Method 2: items
Doc: pandas.DataFrame.items — pandas 2.2.1 documentation
for label, content in df.items():
print(label) # column name
print(content) # values of the whole column
breakMethod 3: apply
df.apply(lambda x: print(x.name), axis="index")
df.apply(func, axis="index")Loop over list of DataFrames #
Ref: python - Pandas Loop through list of Data Frames and Change Index - Stack Overflow
Things would fail or stop operating after first iteration if use for df in list. Write instead:
for i, df in enumerate(df_list):
# ...Rename things #
Index columns:
df.index = df.index.set_names(["ID"])Normal columns:
df = df.rename(columns={"A": "a", "B": "c"})
# or (no quotation marks)
df = df.rename(columns=dict(A="a", B="c"))
# or just replace everything
df.columns = list_of_column_namesBy column index (for None columns that are otherwise helpless): (ref)
df.columns.values[0] = "Safe"Batch rename columns:
# with string operation
df = df.rename(columns=lambda x: x.replace("_FINAL", ""))
df = df.rename(columns=lambda x: x+"_PERC")
# with regex
# https://stackoverflow.com/a/26500249/10668706
import re
df = df.rename(columns=lambda x: re.sub(" $", "", x))Custom regex rename function: (inspired by ref, also see Python #Match case (Switch))
import re
class RegexEqual(str):
def __eq__(self, pattern):
return bool(re.match(pattern, self))
def rename_fn(col):
match RegexEqual(col):
case "[A-z\s]*Date":
return "Date"
case "Previous Rating": # Checks something more specific
return "Previous"
case "[A-z\s]*Rating": # Then a fuzzy check
return "Rating"
return col # otherwise could get None columnsFind & replace in all cells #
Credit: python - How to replace a value in pandas, with NaN? - Stack Overflow
df["rating"] = df["rating"].replace("No ratings", np.NaN)Using dictionary: (credit)
replace = {"No ratings": np.NaN}
df["rating"].map(replace)
# Or (slower, but also works in a loop)
df["rating"].replace(replace, inplace=True)Sort things #
Rows by values in any combination of columns or indices:
df = df.sort_values(["ID","level"])
# Use different sorting methods
df = df.sort_values(["ID","level"], ascending=[True, False])Rows by index values:
df = df.sort_index()Columns by column names: (credit)
df = df.reindex(sorted(df.columns), axis=1)