Skip to main content
Stata

Stata #

Hugo’s syntax highlight engine, chroma, doesn’t recognise Stata syntax as of writing. For readability, I mark all the Stata code blocks in this section as JavaScript. Do be informed that these are actually Stata scripts.

(Maybe) Useful resources #

Basic syntax #

Help function #

help <command>:

[by varlist:] command [varlist] [=exp] [if exp] [in range] [weight] [, options]

Operators #

help operator:

ArithmeticLogicalRelational (numeric and string)
+ addition& and> greater than
- subtraction| or< less than
* multiplication! not>= > or equal
/ division~ not<= < or equal
^ power== equal
- negation!= not equal
+ string concatenation~= not equal

A double equal sign (==) is used for equality testing.

The order of evaluation (from first to last) of all operators is ! (or ~), ^, - (negation), /, *, - (subtraction), +, != (or ~=), >, <, <=, >=, ==, &, and |.

Quotes #

".."
String

Basic commands #

help #

Usage:

help <command>
Show documentation of <command>, like ? in R
help help
help contents
Show Stata quickstart guide
help resources
Show a list of resources to learn Stata
The Stata Journal

pwd & cd #

Print working directory, change directory. Just as expected.

doedit & do #

doedit
Brings up the editor
do <filename>
Execute the do file <filename>.do

Logging #

// start logging with file name log1
// replacing any previously existing log1
log using log1, replace

// some commands here...

// stop logging
// if no name is given, then stop all loggings
log close

// view log file inside Stata
view log1.smcl

// continue logging
log using log1, append

// save log file as plain text file
translate log1.smcl log1.log

Data #

Import & export data #

Clear all previous datasets and load one dataset (Stata format)

use "~/Documents/Stata/auto.dta", clear
// or
clear
use "~/Documents/Stata/auto.dta"

Use Stata example dataset

// see all the available datasets
sysuse dir

// use an example dataset
sysuse auto, clear

Use Excel dataset: FileImportExcel spreadsheet

// the point here is to use the graphical interface once
// and then copy-paste the auto-generated command into do file
import excel "~/Documents/Stata/data.xlsx", sheet("Sheet1") firstrow clear

Save data as .dta file

// to current directory
save auto2

// to given path
// quietly replace any existing file
save "~/Documents/Stata/auto2", replace

Explore data #

View data

sysuse auto, clear

// read-only data viewer
browse
browse var1 var2 ...

// show data in command results window
list

// editable data viewer
// note: no undo!!
edit
edit var1 var2 ...

Summarise data

sysuse auto, clear

// get a list of variable names, types, labels
describe
des
d

// get a descriptive statistics table
summarize
su

// get filtered results
summarize price if foreign == 1
summarize price if foreign == 0

Summarise data with weights

sysuse census, clear

describe
summarize

// (guessed: analytical weights)
summarize medage [weight=pop]

// frequency weights
summarize medage [fweight=pop]

// scatter plot with size=pop
scatter divorce marriage [fweight=pop]

tabulate (tab) (counting each category / refined descriptive stats)

sysuse auto, clear

// one-way counting
tabulate rep78
tab rep78

// binary categoricals
tabulate foreign
tabulate foreign, nolabel // showing numbers instead of labels

// two-way counting
tabulate rep78 foreign
tabulate rep78 foreign, col row // with percentages

// n-way counting
// Statistics >> Summaries, tables, and tests >> Tables of frequencies
// two tables by foreign=0 or 1
table ( rep78 ) ( headroom ) ( foreign )
// single table with sub rows
table ( foreign rep78 ) ( headroom )

// tabstat
// Statistics >> Summaries, tables, and tests >> Other tables >> Compact table
// like a transposed `summarize` with custom stats
tabstat mpg rep78 headroom trunk, statistics( mean count sd )

// show stats of price by foreign
// in <=Stata 16, this used to be:
// table foreign, content (mean price count price)
tabulate foreign, summarize(price)

Inspecting single variable

sysuse auto, clear

// display an ASCII-rendered histogram
inspect price

// print percentiles and other stats
summarize price, detail

Missing values #

Stata treats missing value (.) as positive infinity (inf), so must be excluded when counting

sysuse auto, clear

// list number of missing values in data
misstable summarize

// count treating missing value as a category
tabulate rep78, miss

// if not adding second condition,
// the missing values will be included
summarize price if (rep78 >= 5) & (rep78 != .)

// choose another number to be the missing value
mvencode *, mv(-99)
// return to .
mvdecode *, mv(-99)

Declare variables #

Using gen (generate)

sysuse auto, clear

// with booleans
gen sample = 0
replace sample = 1 if (age >= 40) & (age != .) & (race != 1) & (married == 1)

// with simple calculations
gen newprice = price/weight
gen price1000 = price+1000
gen logprice = log(price)

Using replace and recode

sysuse auto, clear

// modify existing variables
replace rep78=1 if rep78==2

recode rep78 (3=2) (4=3) (5=4)
recode mpg (10/19 = 1) (20/29 = 2) (30/99 = 3)

Using egen

sysuse auto, clear

// extended generate with functions
egen deciles = cut(price), group(10)
egen rowmean = rowmean(price mpg weight)

// quantiles: same
egen quantile1 = cut(price), group(4)
xtile quantile2 = price, n(4)

Generate dummy variables

sysuse auto, clear

// simple way
gen highprice = 1 if price > 6000
replace highprice = 0 if price <= 6000
// above is the same as
recode price (min/6000 = 0) (else = 1), gen(highprice2)

// categorical -> a set of dummies
tab rep78, gen(rep_dummy)
// this gives rep_dummy1, rep_dummy2, ..., rep_dummy5

Rename & relabelling variables #

Stata allows you to give multiple variables the same label.

However, if you want to use a Stata .dta file with, say, Python scripts, it is VERY important to have unique labels for each and every variable. I have had huge trouble dealing with this.

Stata is case sensitive.

sysuse auto, clear

// GUI variable manager
varmanage

// rename
rename rep78 repair
tab repair

// using wild cards
rename * *1978
rename *1978 *

// change variable label
label variable repair "repair categories"

// list all categorical labels
label dir

// define new categorical labels
label define repair_cat 1 "A" 2 "B" 3 "C" 4 "D" 5 "E"
label values repair repair_cat

tab repair           // A, B, C, D, E
tab repair, nolabel  // 1, 2, 3, 4, 5

Drop data #

sysuse auto, clear

// drop variables
drop price
keep *e*

// drop or keep certain rows
drop if foreign == 1
keep if sample == 1

// keep a temporary snapshot of data in memory
preserve
drop if weight > 3000
// and restore the data
restore

Merge data #

Append (horizontal merge)

// data_p1 and data_p2 contain different observations
// of the same group of variables
// * Missing data is allowed

// load data_p1
use data_p1, clear

// append
append using data_p2

// examine the merge results
list

Merge (vertical merge)

// data_p3 and data_p4 contain common ID variable
// with different other variables

use data_p3, clear

// merge <option> <ID_variable> (could be string) using <data>
merge 1:1 make using data_p4

// Result                      Number of obs
// -----------------------------------------
// Not matched                             1
//     from master                         0  (_merge==1)
//     from using                          1  (_merge==2)

// Matched                                 5  (_merge==3)
// -----------------------------------------

// examine the merge results
tab _merge

//    Matching result from |
//                   merge |      Freq.     Percent        Cum.
// ------------------------+-----------------------------------
//          Using only (2) |          1       16.67       16.67
//             Matched (3) |          5       83.33      100.00
// ------------------------+-----------------------------------
//                   Total |          6      100.00

Type conversion #

String -> number:

destring income, replace

destring price, replace
// price: contains nonnumeric characters; no replace

destring price, replace force
// price: contains nonnumeric characters; replaced as int
// (1 missing value generated)

Number -> string

tostring income price, replace

String -> categorical

encode geography, gen(geo2)

// list categories
tab geo2

// list categories as number not label
tab geo2, nolabel

Categorical -> string (using labels)

decode geo2, gen (string_geo)

Graph #

Twoway scatter plot #

sysuse auto, clear

// same x-axis, two variables
graph twoway (scatter price mpg) (scatter weight mpg)

// save as editable format
graph save graph1, replace

// open in graph editor
graph use graph1

With fitted lines & CI

sysuse auto, clear

scatter price mpg

// linear fit & ci
twoway (scatter price mpg) (lfit price mpg)
tw
twoway (lfitci price mpg) (scatter price mpg)

// quadratic fit & ci
twoway (scatter price mpg) (qfit price mpg)
twoway (qfitci price mpg) (scatter price mpg)

// polynomial fit, change bandwith, & ci
twoway (scatter price mpg) (lpoly price mpg)
twoway (scatter price mpg) (lpoly price mpg, bw(0.5))
twoway (lpolyci price mpg) (scatter price mpg)

Bar & dot chart #

sysuse auto, clear

// use the graphical editor to start

// bar chat: option 2
graph bar (count), over(rep78)
graph bar (count), over(rep78) over(foreign)

// bar chat: option 3
graph bar, over(rep78) over(foreign)

// bar chat: option 1
graph bar (mean) price (sd) price, over(rep78) over(foreign)

// dot chat: option 1
graph dot (mean) price (sd) price, over(rep78) over(foreign)

Distribution plots #

sysuse auto, clear

histogram price, bin(20)
kdensity price
twoway (histogram price, bin(20)) (kdensity price)

// quantile (q-q plot)
quantile price

// box plots
graph box price weight
graph box price weight, over(foreign) // over(group)

Statistics & tests #

Tests #

Categorical variables

sysuse auto, clear

tab rep78 foreign

// Chi-squared test w/ h0: no relationships between the two vars
tab rep78 foreign, chi2

tab rep78 foreign, chi2 expected // with expected value if h0 holds

// + Fisher's exact test
tab rep78 foreign, chi2 exact

// All available tests (excluding exact)
tab rep78 foreign, all

Continuous variables

sysuse auto, clear

su price

// univariate mean
ttest price == 2000

// univariate mean by two groups
ttest price, by(foreign)

// multivariate mean by two groups
hotelling price mpg weight, by(foreign)

// standard diviation
sdtest price == 3000

// proportion (mean of 0/1 variable)
prtest foreign == 0.5

More groups than two

sysuse nlsw88, clear

// univariate mean by or more groups
oneway wage race, tabulate

// multivariate mean by two or more groups
anova wage union race

Correlation #

sysuse auto, clear

// Pearson correlation coefficient
// (dropping obs with any missing value)
correlate price mpg weight
// covariance matrix
correlate price mpg weight, cov

// Pairwise correlation coefficient
// (maximising obs)
pwcorr price mpg weight
// test significance
pwcorr price mpg weight, sig

OLS #

sysuse auto, clear

regress price mpg weight length
reg

// Categorical variables
// i.cat_var
// ib(reference_cat).cat_var
regress price mpg weight length i.foreign ib(5).rep78

// Logarithmic form
gen logprice = log(price)
regress logprice mpg length i.foreign

// Quadratic form / Interactions
// not including linear form
regress logprice c.mpg#c.mpg length i.foreign
// including linear form
regress logprice c.mpg##c.mpg length i.foreign

// Predictions & plotting
margins, at(mpg=(0 10 20 30 40 50))
marginsplot

Post estimation checks #

Applied to the last regression that was done.

For more: StatisticsPostestimation

predict xb, xb       // calculate X*b, save as variable xb
predict resid, resid // calculate residual, save as variable resid

summary xb resid

// check ols assuption: error terms are normal
kdensity resid

// plot residual against fitted values
// should be around 0 w/o patterns
rvfplot

// plot outliers & influence (leverage)
lvr2plot

// check colinearity (should be <50 or 100)
estat vif

// check heteroskedasticity
// small p-value is bad
estat hettest

// check functional form
estat ovtest

Hypothesis testing #

sysuse auto, clear

regress price mpg length i.rep78

// H0: the coefficient of mpg is 100
test mpg = 100

// H0: the coefficient of mpg equals to length
test mpg = length

// H0: all coefficients of rep78 are 0
test 2.rep78 3.rep78 4.rep78 5.rep78

Making tables #

Document: estout - Making regression tables in Stata

sysuse auto, clear

ssc install estout, replace

reg price mpg
estimates store a1

reg price mpg length
estimates store a2

reg price mpg length turn
estimates store a3

// Stata's basic table
estimates table a1 a2 a3

// Using estout
esttab a1 a2 a3, se
esttab a1 a2 a3, b(2) se(2) // control number format

// Output to latex
esttab a1 a2 a3 using example.tex, b(2) se(2) label replace ///
    title(Regression table\label{tab1})

Project & file management #

  • .do files: execution file
  • .ado files: program (function)

Preferences #

Show review (command history) window to the left: (credit)

ViewLayoutWidescreen

Increase main window scrollback buffer size: (credit)

General PreferencesWindowsResultsScrollback buffer size set to 2000 (maximum)

Setting preferences with command: set

3rd Party Plugins #

Cannot find plugin on Apple Silicon Mac #

According to this post and this post, not all plugins are available as native binaries. Should be able to solve this using Open Using Rosetta open when launching Stata (macOS will prompt to install Rosetta).

Or, if the source code is available, download the code, and Stata will compile it automatically if no MACARM64 version is available.

net install <plugin>, from(https://raw.github.com/<user>/<repo>/main/) replace