STATA Guide - Reading Data and Data Management
This guide is suitable for new Stata users or advanced level Stata users looking for information on specific topics. It has been created using Stata version 11.1. The data used in most of the sections is a modified version of the 2014 Toronto Municipal Political Poll. The two other data sets used are fabricated data sets created for the purpose of this guide.
TABLE OF CONTENTS
clear command to remove the data set that is currently in memory in order to use a new data set
CTRL-R to recall submitted commands
help( [enter command here]) returns the help page for the entered command
Setting and exploring directory
Pwd gives the current directory. Use cd to change the directory to the location of your data sets. Dir and ls both show the content of the current directory.
pwd cd "[enter path here]" dir ls
Reading CSV file
Use the insheet command to read a csv (comma-separated values) file. The variable window shows the list of variables, their label and their data type (i.e. byte or float for numeric and string for character variables). Describe command gives us a list of the variables and the number of observations. To view/print the values of all or specific observations, use list.
insheet using ont14.csv describe list in 1
To save a file as a Stata data file (.dta), save it using the File and Save As menu. To use a Stata data file that is in the current directory, apply the use command.
clear use ont14
NOTE: This data set will be used in most sections of this guide.
Reading CSV file with no variable names
When the data set has no variable names, the data columns are named v1, v2, …, vn for n number of variables. We can assign our own variable names for this type of data set as follows.
clear insheet using noname.csv describe
clear insheet vote1 vote2 votenot fordapp toryheard torapp stinheard stinapp chowheard chowapp sokheard sokapp using noname.csv describe
Reading Fixed file
For fixed files, str# (where # is the number of spaces) needs to be included before every character or string type variable along with a spacing dictionary for all variables in the data set. Note that without this information, Stata returns an error message.
clear infix using http://data.library.utoronto.ca/dataut/workshop/ontfixed.fix infix str14 q1 1-14 str22 city 16-37 gender 39 str14 age 41-55 edu 56 income 58 weights 61-70 using [command continues ...] http://data.library.utoronto.ca/dataut/workshop/ontfixed.fix describe
Inputting Data using the Do-file editor
To enter data manually, use the do-file editor under the Window menu. In a new do-file editor, enter the variables using the input command followed by the actual data set in a fixed file format. Precede character variables with the str# option. End the data set with the end command. Enter any other commands you would like, save and execute the do-file. The following data set has 20 observations and 11 variables.
ENTER, SAVE AND EXECUTE
clear input str16 vote1 vote2 city gender age transport property education children income weights "Olivia Chow" 3 1 2 5 2 1 3 2 5 0.282550782 "John Tory" 2 1 2 5 1 1 3 2 6 0.282550782 "Olivia Chow" 3 1 2 5 2 2 3 2 2 0.282550782 "Olivia Chow" 3 1 2 5 5 1 2 2 2 0.282550782 "Olivia Chow" 3 1 2 5 2 1 4 2 7 0.282550782 "Olivia Chow" 3 1 2 5 2 2 3 2 1 0.282550782 "John Tory" 2 1 2 5 5 1 1 2 3 0.282550782 "Olivia Chow" 3 1 2 5 5 1 2 2 4 0.282550782 "Olivia Chow" 3 1 2 5 5 2 2 2 2 0.282550782 "Olivia Chow" 3 1 2 5 2 2 2 2 2 0.282550782 "David Soknacki" 4 1 2 5 5 1 4 2 8 0.282550782 "John Tory" 2 1 2 5 1 1 4 2 6 0.282550782 "Rob Ford" 1 1 2 5 2 2 2 2 2 0.282550782 "Don't know" 4 1 2 5 4 2 4 2 2 0.282550782 "Olivia Chow" 3 1 2 5 1 2 4 2 5 0.282550782 "John Tory" 2 1 2 5 3 1 3 2 3 0.282550782 "Olivia Chow" 3 1 2 5 2 2 3 2 1 0.282550782 "Olivia Chow" 3 1 2 5 2 1 3 2 8 0.282550782 "John Tory" 2 1 2 5 2 2 2 2 3 0.282550782 "Olivia Chow" 3 1 2 5 2 1 3 2 5 0.282550782 END describe list in 1
For this section, we will use the ont14.dta Stata data set saved in the previous section. To view/print specific values of any observations we can use the list command with options for specific variables and observations.
clear use ont14 list list vote1-gender list vote1 city weights list in 1 list in 5/10
Before exploring any further, sometimes we want to save all the commands and output that will ensue in a text file. This can be done with the log command as follows. Here we called our text file unit2.
log using unit2.txt, text replace
To explore variables
Describe enumerates the variables and their type. Codebook gives a frequency table for character variables, basic statistics (mean, std dev etc) for numeric variables and the number of missing observations for both types of variables. Sum is an abbreviation for summarize. It gives basic statistics for numeric variables. It can be modified to give statistics for a subset of the data set. For example, we want the statistics on whether voters have heard of David Soknacki or not only for individuals who have not heard of Olivia Chow (i.e., chowheard==0). We can summarize the same information for voters who have not heard of John Tory (i.e., toryheard==0).
describe codebook summarize toryheard sum toryheard chowheard stinheard sokheard sum sokheard if chowheard==0 sum sokheard if toryheard==0
Tabstat displays a table of means for specified variables by default. Other summary statistics can be displayed if specified in the stat option. These statistics can also be further grouped by another variable using the by option. For example, here we can see the means and the frequencies of approval rates for each candidate within each category of approval of Rob Ford.
tabstat fordapp tabstat chowapp toryapp stinapp sokapp, by (fordapp) tabstat chowapp toryapp stinapp sokapp, by (fordapp) tabstat chowapp toryapp stinapp sokapp, by (fordapp) stat(mean n)
The tab command produces one-way or two-way frequency tables with row (row) and column percentages (col) as options. For example, we can obtain a two-way frequency table of the candidate the voters would vote for (vote1) by the candidate they would not vote for (votenot) along with the row and column percentages.
tab vote1 tab vote1 votenot tab vote1 votenot, row col
To make graphs
The commands histogram, kdensity, graph box, and scatter can be used to make the graphs as described by the command name. These graph commands have options to modify the graph title, color of graph, axis labels using the xtitle and ytitle commands, presence of normal curve etc. The images shown below are made using the last graph command for each graph type.
The histogram command has options bin number or width for continuous variables or discrete for discrete ordinal data. This is a histogram of the income variable which is a discrete ordinal variable in our case. This income variable has 8 levels where '8' denotes missing. Suppose we do not want to include the missing observations.
histogram income hist income if income!=8 hist income if income!=8, normal hist income if income!=8, normal width(0.4) hist income if income!=8, normal bin(15) hist income if income!=8, normal discrete title(Histogram of Income) xtitle(“Income”) color(ltblue) lcolor(white)
kdensity ageavg kdensity ageavg, normal kdensity income, normal
graph box income graph box income, over(fordapp) graph box income, over(fordapp)title(“Income boxplots for each approval category of Rob Ford”)
scatter income ageavg scatter income ageavg, jitter(15) scatter ageavg income, jitter(20) msymbol(d) mcolor(red) scatter ageavg income, jitter(20) msymbol(d) mcolor(red) xtitle("Income") ytitle("Average Age")
A correlation or covariance matrix can be obtained using the correlate command. The pwcorr command, not used here, also displays all pairwise correlation coefficients with more options for significance levels and handling missing values.
correlate income ageavg
Graph matrices can be produced using the graph matrix command.
graph matrix education ageavg income graph matrix education ageavg income, jitter(10)
To save the log file and view it
log close view unit2.txt
To change the order of variables use the order command. For example, we may want all the ford related variables to come right after the first 3 vote variables.
use ont14, clear order vote1 vote2 votenot fordapp voteford fordleave fordres votefordafterrehab
Labels can be created for the data set, a variable and the levels of a variable using define. Once the label for the levels of a variable has been created, it can be assigned to the variable using label values.
label data "2014 Ontario Municipal Elections" label variable fordapp "Approval of Rob Ford" label define fordappvalues 0 "No" 1 "Yes" label values fordapp fordappvalues codebook fordapp
Character variables such as vote1 can be recoded into a numeric variable using the encode command. By default the alphabetical order of the variable’s values is used as can be seen with the variable vote1num. For this reason, category "Don't know" is recoded as 2. If, for example, we want the “Don’t know” category to be the last category, the order can be altered by labeling each value of vote1 with a number as can be seen with the variable vote1numcorrect.
encode vote1, generate(vote1num) label variable vote1num "Vote between 5 Candidates" codebook vote1num
label define vote1numvalues 1 "David Soknacki" 2 "John Tory" 3 "Karen Stintz" 4 "Olivia Chow" 5 "Rob Ford" 6 "Don't know" encode vote1, generate(vote1numcorrect) label(vote1numvalues) codebook vote1numcorrect
codebook voteford label define votefordvalues 0 "No" 1 "Yes" 2 "Don't know" encode voteford, generate(votefordnum) label(votefordvalues) recode votefordnum (2=.) codebook votefordnum
To recode a categorical variable, use the recode command. For example, a copy of the gender variable, gendercopy (1= Male, 2=Female) is recoded and renamed female (1=Female, 0=Male).
generate gendercopy=gender rename gendercopy female recode female(1=0) (2=1) label define femalevalues 1 female 0 male label values female femalevalues codebook female
To drop variables use the drop command. To drop labels use the label drop command.
drop voteford label list label drop femalevalues codebook female
To generate a new variable that is a function of other variables, use the generate command. Below, the variable totalapp (total approval) was created as an average of the candidate approval rates and recoded into a categorical variable with 3 levels. The command egen is used to generate new variables using special functions. For example, the std() function generates standardized values and the rowmean() function generates row means.
generate totalapp = (fordapp + chowapp + toryapp + stinapp + sokapp)/5 summarize totalapp recode totalapp (0/0.33=0 Strict) (0.331/0.66=1 Average) (0.661/1=2 Lenient), generate (voteptmp) decode voteptmp, generate(votep) label variable votep "Voting Personality" codebook voteptmp votep
egen ztotalapp = std(totalapp) summarize ztotalapp list totalapp ztotalapp in 1/5 egen meantotalapp = mean(totalapp), by(income) tab meantotalapp education tab meantotalapp income egen activity = rowmean(fordapp chowapp toryapp stinapp sokapp toryheard chowheard stinheard sokheard) tab activity income
Notes can be attached to the data set or specific variables under memory. These become part of the data set and can be retrieved whenever the data is opened. They can be helpful for future reference or reminders. Notes can be created or found using the notes command.
notes female: the variable gender was renamed to female notes income: values of income coded 8 were recoded to be missing notes: Missing data points denote incomplete information and response "Don't know" notes
Creating ID variable using generate and _n
use ont14, clear generate voterid=_n save ont14, replace
We want to subset, for example, the female voters with the highest income on one hand and the male voters with the lowest education on the other hand. These data sets are saved as subset1 and subset2 respectively.
use ont14, clear keep if income==7 & gender==2 save subset1, replace
use ont14, clear keep if education==4 & gender==1 save subset2, replace
Use the append command to subset
The append command stacks data sets to append therefore the number and order of variables between the two data sets must match in order for the append to be effective.
use subset1 append using subset2 save alldata, replace
Keeping and dropping variables
use ont14, clear keep voterid fordapp voteford fordleave fordres votefordafterrehab save forddata, replace list in 1/5
use ont14, clear drop vote1 - votefordafterrehab save votehist, replace list in 1/5
To merge data sets, sort each one by the merging variable first then use the merge command.
use forddata, clear sort voterid save forddata, replace use votehist, clear sort voterid save votehist, replace use forddata, clear merge voterid using votehist
save votehist, replace
The Stata built-in variables _n and _N are used to count observations for the purpose of generating IDs, finding the lowest and highest scores for each group of observations, finding duplicates etc. The variable _n enumerates all the observations in a specific group whereas _N gives the total number of observations in a specific group.
use ont14, clear generate id = _n generate nsize = _N list id nsize vote1 vote2 votenot in 1/10
Finding lowest and highest score in each group
Here we want to find the voters with the lowest and highest education level for each city by age group. First, we sort by all three variables city, age and education from lowest to highest. In this case, we also sort by id because there are some, city by age, groups with more than one individual in the same education level. Then, we count by education for each city by age group. Listing the observations where n1 equals 1 gives us the lowest education levels and n1 equals n2 gives us the highest education level.
sort city age education id by city age : generate n1 = _n by city age : generate n2 = _N list vote1 vote2 votenot city age education n1 n2 if n1==1 in 1/100
list vote1 vote2 votenot city age education n1 n2 if n1==n2 in 1/100
Checking for duplicates
The two methods below will print out rows of observations that are duplicates.
use ont14, clear sort _all by _all: generate n = _N list if n > 1
use ont14, clear bysort * : generate n = _N list if n > 1
The following data consists of two dates: one in string format and another one is numeric.
use http://data.library.utoronto.ca/datapub/workshop/dates, clear list
To convert string type or numeric dates into elapsed dates or number of days since January 1, 1960 use the date and mdy functions respectively.
generate birthday = date(bdate, "MDY") generate event = mdy(dmon, dday, dyear) drop bdate - dyear list
Note that the third birthday is missing for the variable birthday because the year is not in four digits. This can be fixed manually by replacing the value of birthday for the 3rd observation using the replace command.
replace birthday = mdy(1, 21, 1972) in 3
The format command can be used to control the display of elapsed dates.
format birthday event %dD_m_CY list
The number of days elapsed can be converted into weeks, months and years as follows.
generate diff = event - birthday generate weeks = diff/7 generate months = diff/30.5 generate years = diff/365.25 list
To reshape a data set between wide and long format, use the reshape long or reshape wide commands. For a data set in wide format, the variables that denote the repetition should be named with the variable name followed by a number (i.e. variablename#). For example, in our data set we have drug1, drug2 and drug3. The numbers do not need to be consecutive.
clear use http://data.library.utoronto.ca/datapub/workshop/wide list
describe list reshape long drug, i(id) j(drugtype) list
table drugtype, contents(n drug mean drug sd drug)
author Nadia Muhe, Map and Data Library, University of Toronto.