Pandas #
Most important notes #
axis: {0 or ‘index’ (rows), 1 or ‘columns’}, default 0
(Maybe) Useful resources #
Import Data #
Dict/JSON #
Doc: pandas.DataFrame.from_dict — pandas 1.3.5 documentation
For dict like {"id1": data1, "id2": data2, ...}
:
df = pd.DataFrame.from_dict(
data_dict,
orient='index',
columns=["data"], # otherwise very troublesome to rename later
)
df.index = df.index.set_names(["ID"])
For dict like: {"row1": [...], "row2": [...], ...}
, and JSON files:
df = pd.DataFrame.from_dict(
data_json
).transpose()
Excel #
Doc: pandas.read_excel — pandas 2.0.0 documentation
Extra dependency:
openpyxl
Read sheet:
import pandas as pd
df = pd.read_excel(
os.path.join("data", "data.xlsx"), # read ./data/data.xlsx
sheet_name="Sheet 1",
skiprows=1, # if there are pre-data rows
header=[0,1], # if there are multiple (combined) header rows
nrows=100, # only read first 100 rows
)
Read names of all sheets: (credit)
import pandas as pd
xl = pd.ExcelFile(os.path.join("data", "data.xlsx"))
xl.sheet_names # list of sheet names
df = xl.parse("Sheet 1") # if need sheets
Read all sheets: (credit)
This is significantly slower than the
pd.ExcelFile
method above. Only use when you actually need every single sheet.
import pandas as pd
dfs = pd.read_excel(file_name, sheet_name=None)
dfs.keys() # similar result as xl.sheet_names
CSV/TSV #
Pandas #
Doc: pandas.read_csv — pandas 2.0.0 documentation
import pandas as pd
names_list = ["col 1", "col 2", "col 3"]
df = pd.read_csv(
os.path.join("data", "data.csv"), # read ./data/data.csv
# sep='\t', # if tsv
# quotechar="'", # if see extra quotation marks after import
header=None,
index_col=[0,1], # only numbers, because there is no name yet
names=names_list,
dtype={"Phone number": "str"}, # read columns as type
nrows=100, # only read first 100 rows
)
CSV module #
See Python #CSV.
ODS #
pandas-ods-reader
#
Extra dependency:
pandas-ods-reader
,ezodf
, andlxml
It supports
skiprows
, but notnrows
.
import pandas as pd
from pandas_ods_reader import read_ods
df = read_ods(
os.path.join("data", "data.ods"), # read ./data/data.ods
2, # get 2nd sheet (1 based), default 1
# or
"sheet 1", # get sheet by name
)
pd.read_excel
#
Extra dependency:
odfpy
Since
odfpy
does not have row selection method when loading files,nrows
option happens after the whole file is loaded into memory, unlike when dealing with other types of tables. This causes long loading time when the file is huge. See pandas-dev/pandas Issue #53185.
Slightly faster (and more robust) method:
import pandas as pd
ods = pd.ExcelFile(os.path.join("data", "data.ods"))
ods.sheet_names # get all sheet names
df = ods.parse("sheet 1")
Documented method:
import pandas as pd
df = pd.read_excel(
os.path.join("data", "data.ods"), # read ./data/data.ods
)
Export DataFrame #
As Excel file #
df.to_excel(os.path.join("data", "data.xlsx"))
# or if the df is too big
df.iloc[:1000, :1000].to_excel(os.path.join("data", "data.xlsx"))
As plain text file #
Ref: Python, Pandas : write content of DataFrame into text File - Stack Overflow
# method 1
df.to_csv(os.path.join("data", "data.txt"))
# method 2
with open(os.path.join("data", "data.txt"), "w") as f:
f.write(df.to_string())
Debugging #
Import #
If get UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position xx: invalid start byte
, try: (credit)
df = pd.read_csv(
os.path.join("data", "data.csv"), # read ./data/data.csv
encoding="cp1252",
)
Calculation #
If get TypeError: unhashable type: 'numpy.ndarray'
, it means one (or more) of the cells contain list value (instead of single value).
Get a subset of all rows containing list-value cells:
df_subset = df[df["Average"].apply(lambda x: isinstance(x, np.ndarray))]
Index #
If get ValueError: Index contains duplicate entries, cannot reshape
, it means there are duplicate indices sets (rows where all indices are identical).
First check the rows contain same data, then drop rows with duplicate indices.
Observe things #
Calculate bivariate correlation coefficients and p-values matrix #
Ref: python - pandas columns correlation with statistical significance - Stack Overflow
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
# Correlation coefficients: (same)
corr = df.corr()
corr = df.corr(method=lambda x, y: pearsonr(x, y)[0])
# p-values:
p_val = df.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(len(df.columns))
# p_val = p_val.round(decimals=4)
Calculate mode by any level of MultiIndex #
Ref: python - GroupBy pandas DataFrame and select most common value - Stack Overflow
So far I haven’t found any way to get this mode df merge/join/concat/etc with the original df, except first pivot the original df.
mode
could return a list of values. Make sure to check the type of your modes before proceeding to other tasks.
# mode of each ID
df.groupby(["ID"]).agg(lambda x: x.mode())
# mode of each ID x each date
df.groupby(["ID","Date"]).agg(lambda x: x.mode())
Calculate sum of all columns or rows #
# sum of all rows, by column
df.sum(axis=0)
# sum of all columns, by row
df.sum(axis=1)
Count length of levels of MultiIndex #
Ref: python - Hierarhical Multi-index counts in Pandas - Stack Overflow
df.groupby(level=["ID"]).size()
df.groupby(level=["ID","Date"]).size()
# example output:
# ID
# 01 240
# 02 240
# 03 240
# ...
# 57 211
# 59 211
# 60 211
# Length: 60, dtype: int64
Display full width of a DataFrame #
Ref:
- python - How can I display full (non-truncated) dataframe information in HTML when converting from Pandas dataframe to HTML? - Stack Overflow
- Python, Pandas : write content of DataFrame into text File - Stack Overflow
# method 1
with pd.option_context('display.max_colwidth', None):
display(df)
# method 2
print(df.to_string())
List all unique values of a column #
Ref: python - print the unique values in every column in a pandas dataframe - Stack Overflow
print(df["Status Code"].unique())
List rows with duplicated values of a column #
Ref: How do I get a list of all the duplicate items using pandas in python? - Stack Overflow
df[df.duplicated(["ID"], keep=False)].sort_values("ID")
List rows with duplicated values of indices #
df[df.index.duplicated(keep='first')]
Select (filter) things #
Select columns by name #
# must use double brackets
df_subset = pd.DataFrame(df[['col1']])
df_subset = pd.DataFrame(df[['col1', 'col2', ...]])
Select columns with RegEx #
Credit: python - How to select columns from dataframe by regex - Stack Overflow
rating_cols = df.filter(regex=("rating_*")).columns
# then when need to use
df[rating_cols] # returns a small df
df[rating_cols]["rating_2021_1"] # returns a series
Select rows by column value #
# only keep rows with certain value for a col
df_subset = df[df["Status Code"]=="A"]
# only keep rows without certain value for a col
df_subset = df[df["Status Code"]!="B"]
# only keep rows with a certain starting
# https://stackoverflow.com/a/17958424/10668706
df_subset = df[df["Name"].str.startswith("One d", na=False)]
# especially, for nan values:
df_subset = df[df["Status Code"].isna()]
df_subset = df[df["Status Code"].notna()]
When the value of some cells are lists, pandas will raise
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
. In that case, try this method instead:Credit: python - How do I select rows from a DataFrame based on column values? - Stack Overflow
# only keep rows with certain value for a col
df_subset = df.query('"Status Code" == "A"')
# only keep rows with certain value for a col
df_subset = df.query('"Status Code" != "B"')
Also see: (not tested)
- python - Pandas filter data frame rows by function - Stack Overflow
- python - Pandas filtering not working - Stack Overflow
Select rows by cell type #
Credit: python - Select row from a DataFrame based on the type of the object(i.e. str) - Stack Overflow
# only keep rows with certain types of value
df_subset = df[df["Rating"].apply(lambda x: isinstance(x, float))]
# or the opposite
df_subset = df[df["Rating"].apply(lambda x: not isinstance(x, float))]
Select rows by index #
Doc: pandas.DataFrame.loc — pandas 2.1.0 documentation
df.loc["A0001"]
Select (mask) triangle of DataFrame #
python - Melt the Upper Triangular Matrix of a Pandas Dataframe - Stack Overflow
corr = df.corr()
# k=1 for dropping diagonal, k=0 for keeping it
keep = np.triu(corr, k=1).astype('bool').reshape(corr.size)
# or for tril, k=-1 for dropping diagonal, k=0 for keeping it
keep = np.tril(corr, k=-1).astype('bool').reshape(corr.size)
corr_flat = corr.stack()[keep].reset_index()
corr_flat.columns = ["x1","x2","Coef"]
After this, can do a sns.kdeplot(data=corr_flat,x="Coef")
to see the kernel density of all the coefficients.
Groupby #
Groupby columns #
Use case: get average columns across multiple columns with similar names.
Step 0: Consider list of groups
groups = ["AAA", "BBB", ...]
len(groups)
# 18
Step 1: Write groupby function
def groupby_fn(col):
groups = ["AAA", "BBB", ...]
col_group = col.replace("Rating_", "").split("0")[0]
if col_group in groups:
return col_group # Mark col as col_group
else:
return "Others" # Mark col as Others
Step 2: Get columns to group by (optional)
cols = df.filter(regex=("Rating_*")).columns
Step 3: Groupby
df_grouped = df.groupby(groupby_fn, axis=1).mean()
# Or if used Step 2:
df_grouped = df[cols].groupby(groupby_fn, axis=1).mean()
# Check no col was marked as Others, otherwise check groupby function
len(df_grouped.columns)
# 18
Column-level operations #
Convert type of columns #
Inspect column types:
df.info()
To numeric:
# float
df["Rating"] = df["Rating"].astype('float')
# auto
df["Rating"] = pd.to_numeric(df["Rating"])
To string:
df = df.astype("string")
df = df.astype({"col1": "string", "col2": "float", ... }) # need quotes
df = df.astype(dict(col1="string", col2="float", ...)) # no quotes
To category:
from pandas.api.types import CategoricalDtype
rate_type = CategoricalDtype(
categories=["No rating", "Bad", "Good", "Outstanding"],
ordered=True,
)
df["Rating"] = df["Rating"].astype(rate_type)
With inline functions:
# datetime to string
df["Date"] = [ cell.strftime("%Y") for cell in df["Date"] ]
df["Location ODS Code"] = [str(val) for val in df["Location ODS Code"]]
df["Service / Population Group"] = [str(val) for val in df["Service / Population Group"]]
df["Domain"] = [str(val) for val in df["Domain"]]
Combine multiple columns #
Refs:
- python - Combine two columns of text in pandas dataframe - Stack Overflow
- python - pandas combine two strings ignore nan values - Stack Overflow
data["Address"] = data["Address Line 1"].fillna("")
for col in ["Address Line 2", "Address Line 3", "Address Line 4", "Address Line 5", "Postcode"]:
data["Address"] = data["Address"].astype(str) + " " + data[col].fillna("").astype(str)
Reorder columns #
Ref: python - How to change the order of DataFrame columns? - Stack Overflow
# Take the last column to first
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]
Index-level operations #
Unset indices:
# return all indices back to normal columns
df = df.reset_index()
# return only selected indices
df = df.reset_index(level=["Status", "Level", ...])
Set new indices:
df = df.set_index(["ID", ...])
DataFrame-level operations #
Combine two DataFrames #
With same columns (stack):
df_AB = pd.concat([df_A, df_B])
With same indices (join):
It is said by some people online that you can join single-indexed DataFrame with MultiIndex DataFrame using this commend, which is not correct (at least not stable). When the single shared index has different values, aka,
outer
does not equalinner
, the join returns an error.To avoid this, first use commands like
df_multi_index.index.set_names
anddf_multi_index.reset_index(level=[<all non-shared indices>])
to make the MultiIndex DataFrame single-indexed first, then perform the join.
Always use index to join. Do not use
on="col"
when performing join, pandas will raiseValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
.
If you do not want to modify the original df, usedf_AB = df_A.set_index("col").join(df_B.set_index("col"))
.
df_AB = df_A.join(
df_B,
how="outer", # Keep all different index values
lsuffix="A", rsuffix="B", # if get ValueError: columns overlap
)
If want to select only one column of the left df, will return AttributeError: 'Series' object has no attribute 'join'
. In this case, do:
df_AB = df_A[["Rating"]].join(
df_B,
how="inner",
)
Drop things #
Rows with duplicate indices:
Doc: Duplicate Labels — pandas 2.1.0 documentation
Credit: python - Remove pandas rows with duplicate indices - Stack Overflow
df.index.is_unique
# False
# Look at content of all duplicated index
df[df.index.duplicated()]
# Drop all rows with duplicated index but the first instance
df = df[~df.index.duplicated(keep='first')]
Columns:
df = df.drop(["drop", ...], axis=1)
Long to wide #
Reverse of pandas.wide_to_long — pandas 2.0.2 documentation.
For example, say you have a very long MultiIndex DataFrame like this (asterisk means index column):
VALUE
ID CODE LEVEL
*A0001* *A-1* *B-1* 96.0
*B-2* 5.0
*A-2* *B-1* 18.0
*B-2* 18.0
*B-3* 1.0
*A-3* ...
...
But want you really want is one ID per row, with everything in the row.
Step 1:
df_unstack = df.stack().unstack(level=['CODE', 'LEVEL'])
Which has almost identical effects as (but this one returns only one index)
df_unstack = pd.pivot(
df.reset_index(),
index="ID",
columns=["CODE", "LEVEL"],
values="VALUE",
)
Refer to index debugging when in trouble.
Result:
CODE A-1 A-2 ...
LEVEL B-1 B-2 B-1 B-3 ...
ID (unnamed)
*A0001* *VALUE* 96.0 5.0 18.0 18.0
*A0002* *VALUE* 505.0 5.0 182.0 191.0
*A0003* *VALUE* 275.0 5.0 156.0 159.0
*A0004* *VALUE* 245.0 5.0 72.0 72.0
*A0005* *VALUE* 377.0 5.0 196.0 196.0
Step 2:
df_unstack.columns = ['_'.join(col) for col in df_unstack.columns]
If want to do this to index
This code also works for index:
df.index = ['_'.join(ind) for ind in df.index]
But need to make sure all values in the indices are strings. If not, try first convert:
df = df.reset_index()
df["Level"] = [str(val) for val in df["Level"]]
# or if values are timestamps
df["Date"] = [val.strftime("%Y-%m-%d") for val in df["Date"]]
Result:
A-1_B-1 A-1_B-2 A-2_B-1 ...
ID (unnamed)
*A0001* *VALUE* 96.0 5.0 18.0
*A0002* *VALUE* 505.0 5.0 182.0
*A0003* *VALUE* 275.0 5.0 156.0
*A0004* *VALUE* 245.0 5.0 72.0
*A0005* *VALUE* 377.0 5.0 196.0
Step 3:
If you have used pd.pivot
, skip this step.
df_unstack.index = df_unstack.index.set_names(["ID", "drop"]) # give the second index a name
df_unstack = df_unstack.reset_index(level=["drop"]) # un-index
df_unstack = df_unstack.drop("drop",axis=1) # remove
Result:
A-1_B-1 A-1_B-2 A-2_B-1 ...
ID
A0001 96.0 5.0 18.0
A0002 505.0 5.0 182.0
A0003 275.0 5.0 156.0
A0004 245.0 5.0 72.0
A0005 377.0 5.0 196.0
Done!
Loop over rows #
Ref: python - How to iterate over rows in a DataFrame in Pandas - Stack Overflow
Docs:
- pandas.DataFrame.iterrows — pandas 2.1.0 documentation
- pandas.Series.items — pandas 2.1.0 documentation
row["new_col"]
does not work within the loops.Doc:
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
for ind, row in qof_perc.iterrows():
print("\033[1m" + "ind:" + "\033[0m" + "\n")
print(ind) # A0001
print("\n" + "\033[1m" + "row:" + "\033[0m" + "\n")
print(row) # pd.series
# print(row["column1"]) # value of df.loc["A0001"]["column1"]
print("\n" + "\033[1m" + "cells:" + "\033[0m" + "\n")
for cell in row.items():
print(cell[0]) # column name
print(cell[1]) # cell value
break
Rename things #
Index column:
df.index = df.index.set_names(["ID"])
Normal column:
df = df.rename(columns={"A": "a", "B": "c"})
# or (no quotation marks)
df = df.rename(columns=dict(A="a", B="c"))
Batch rename columns:
# with string operation
df = df.rename(columns=lambda x: x.replace("_FINAL", ""))
df = df.rename(columns=lambda x: x+"_PERC")
# with regex
# https://stackoverflow.com/a/26500249/10668706
import re
df = df.rename(columns=lambda x: re.sub(" $", "", x))
Find & replace in all cells #
Credit: python - How to replace a value in pandas, with NaN? - Stack Overflow
df["rating"] = df["rating"].replace("No ratings", np.NaN)
Sort things #
Rows by values in columns:
df = df.sort_values(["ID","level"])
Rows by index values:
df = df.sort_index()
Columns by column names: (credit)
df = df.reindex(sorted(df.columns), axis=1)