STATA Guide - Reading Data and Data Management

STATA Guide

This guide is suitable for new Stata users or advanced level Stata users looking for information on specific topics. It has been created using Stata version 11.1. The data used in most of the sections is a modified version of the 2014 Toronto Municipal Political Poll. The two other data sets used are fabricated data sets created for the purpose of this guide.

 

TABLE OF CONTENTS

 

Entering Data

 

Quick Tips

clear command to remove the data set that is currently in memory in order to use a new data set
CTRL-R to recall submitted commands
help( [enter command here]) returns the help page for the entered command

 

Setting and exploring directory

Pwd gives the current directory. Use cd to change the directory to the location of your data sets. Dir and ls both show the content of the current directory.

pwd 
cd "[enter path here]" 
dir 
ls

    Reading CSV file

    DATA: ont14.csv

    Use the insheet command to read a csv (comma-separated values) file. The variable window shows the list of variables, their label and their data type (i.e. byte or float for numeric and string for character variables). Describe command gives us a list of the variables and the number of observations. To view/print the values of all or specific observations, use list.

    insheet using ont14.csv
    describe
    list in 1

      To save a file as a Stata data file (.dta), save it using the File and Save As menu. To use a Stata data file that is in the current directory, apply the use command.

      clear 
      use ont14
      

      NOTE: This data set will be used in most sections of this guide.

        Reading CSV file with no variable names

        DATA: noname.csv

        When the data set has no variable names, the data columns are named v1, v2, …, vn for n number of variables. We can assign our own variable names for this type of data set as follows.

        clear
        insheet using noname.csv
        describe

        clear
        insheet vote1 vote2 votenot fordapp toryheard torapp stinheard stinapp chowheard chowapp sokheard sokapp using noname.csv
        describe
        

        Reading Fixed file

        For fixed files, str# (where # is the number of spaces) needs to be included before every character or string type variable along with a spacing dictionary for all variables in the data set. Note that without this information, Stata returns an error message.

        clear
        infix using http://data.library.utoronto.ca/dataut/workshop/ontfixed.fix
        infix str14 q1 1-14 str22 city 16-37 gender 39 str14 age 41-55 edu 56 income 58 weights 61-70 using [command continues ...]
        http://data.library.utoronto.ca/dataut/workshop/ontfixed.fix
        describe
        

          Inputting Data using the Do-file editor

          To enter data manually, use the do-file editor under the Window menu. In a new do-file editor, enter the variables using the input command followed by the actual data set in a fixed file format. Precede character variables with the str# option. End the data set with the end command. Enter any other commands you would like, save and execute the do-file. The following data set has 20 observations and 11 variables.

          ENTER, SAVE AND EXECUTE

          clear
          input str16 vote1 vote2 city gender age transport property education children income weights
          "Olivia Chow"    3 1 2 5 2 1 3 2 5 0.282550782
          "John Tory"      2 1 2 5 1 1 3 2 6 0.282550782
          "Olivia Chow"    3 1 2 5 2 2 3 2 2 0.282550782
          "Olivia Chow"    3 1 2 5 5 1 2 2 2 0.282550782
          "Olivia Chow"    3 1 2 5 2 1 4 2 7 0.282550782
          "Olivia Chow"    3 1 2 5 2 2 3 2 1 0.282550782
          "John Tory"      2 1 2 5 5 1 1 2 3 0.282550782
          "Olivia Chow"    3 1 2 5 5 1 2 2 4 0.282550782
          "Olivia Chow"    3 1 2 5 5 2 2 2 2 0.282550782
          "Olivia Chow"    3 1 2 5 2 2 2 2 2 0.282550782
          "David Soknacki" 4 1 2 5 5 1 4 2 8 0.282550782
          "John Tory"      2 1 2 5 1 1 4 2 6 0.282550782
          "Rob Ford"       1 1 2 5 2 2 2 2 2 0.282550782
          "Don't know"     4 1 2 5 4 2 4 2 2 0.282550782
          "Olivia Chow"    3 1 2 5 1 2 4 2 5 0.282550782
          "John Tory"      2 1 2 5 3 1 3 2 3 0.282550782
          "Olivia Chow"    3 1 2 5 2 2 3 2 1 0.282550782
          "Olivia Chow"    3 1 2 5 2 1 3 2 8 0.282550782
          "John Tory"      2 1 2 5 2 2 2 2 3 0.282550782
          "Olivia Chow"    3 1 2 5 2 1 3 2 5 0.282550782
          END
          describe
          list in 1
          

          Main Menu

            

           

          Exploring Data

           

          DATA: ont14.dta

          For this section, we will use the ont14.dta Stata data set saved in the previous section. To view/print specific values of any observations we can use the list command with options for specific variables and observations.

          clear
          use ont14
          
          list 
          list vote1-gender 
          list vote1 city weights
          list in 1
          list in 5/10
          

            Before exploring any further, sometimes we want to save all the commands and output that will ensue in a text file. This can be done with the log command as follows. Here we called our text file unit2.

            log using unit2.txt, text replace 
            

              To explore variables

              Describe enumerates the variables and their type. Codebook gives a frequency table for character variables, basic statistics (mean, std dev etc) for numeric variables and the number of missing observations for both types of variables. Sum is an abbreviation for summarize. It gives basic statistics for numeric variables. It can be modified to give statistics for a subset of the data set. For example, we want the statistics on whether voters have heard of David Soknacki or not only for individuals who have not heard of Olivia Chow (i.e., chowheard==0). We can summarize the same information for voters who have not heard of John Tory (i.e., toryheard==0).

              describe
              codebook 
              
              summarize toryheard 
              sum toryheard chowheard stinheard sokheard
              sum sokheard if chowheard==0
              sum sokheard if toryheard==0
              

              Tabstat displays a table of means for specified variables by default. Other summary statistics can be displayed if specified in the stat option. These statistics can also be further grouped by another variable using the by option. For example, here we can see the means and the frequencies of approval rates for each candidate within each category of approval of Rob Ford.

              tabstat fordapp 
              tabstat chowapp toryapp stinapp sokapp, by (fordapp) 
              tabstat chowapp toryapp stinapp sokapp, by (fordapp)
              tabstat chowapp toryapp stinapp sokapp, by (fordapp) stat(mean n)
              

              The tab command produces one-way or two-way frequency tables with row (row) and column percentages (col) as options. For example, we can obtain a two-way frequency table of the candidate the voters would vote for (vote1) by the candidate they would not vote for (votenot) along with the row and column percentages.

              tab vote1 
              tab vote1 votenot
              tab vote1 votenot, row col
              

              To make graphs

              The commands histogram, kdensity, graph box, and scatter can be used to make the graphs as described by the command name. These graph commands have options to modify the graph title, color of graph, axis labels using the xtitle and ytitle commands, presence of normal curve etc. The images shown below are made using the last graph command for each graph type.

              The histogram command has options bin number or width for continuous variables or discrete for discrete ordinal data. This is a histogram of the income variable which is a discrete ordinal variable in our case. This income variable has 8 levels where '8' denotes missing. Suppose we do not want to include the missing observations.

              histogram income 
              hist income if income!=8
              hist income if income!=8, normal
              hist income if income!=8, normal width(0.4) 
              hist income if income!=8, normal bin(15)
              hist income if income!=8, normal discrete title(Histogram of Income) xtitle(“Income”) color(ltblue) lcolor(white)
              

              kdensity ageavg
              kdensity ageavg, normal
              kdensity income, normal
              

              graph box income
              graph box income, over(fordapp)
              graph box income, over(fordapp)title(“Income boxplots for each approval category of Rob Ford”)
              

              scatter income ageavg
              scatter income ageavg, jitter(15)
              scatter ageavg income, jitter(20) msymbol(d) mcolor(red)
              scatter ageavg income, jitter(20) msymbol(d) mcolor(red) xtitle("Income") ytitle("Average Age")
              

              A correlation or covariance matrix can be obtained using the correlate command. The pwcorr command, not used here, also displays all pairwise correlation coefficients with more options for significance levels and handling missing values.

              correlate income ageavg
              

              Graph matrices can be produced using the graph matrix command.

              graph matrix education ageavg income
              graph matrix education ageavg income, jitter(10)
              

                To save the log file and view it

                log close
                view unit2.txt
                

                  Main Menu

                    

                   

                  Modifying Data

                   

                  DATA: ont14.dta

                  To change the order of variables use the order command. For example, we may want all the ford related variables to come right after the first 3 vote variables.

                  use ont14, clear
                  order vote1 vote2 votenot fordapp voteford fordleave fordres votefordafterrehab
                  

                  Labels can be created for the data set, a variable and the levels of a variable using define. Once the label for the levels of a variable has been created, it can be assigned to the variable using label values.

                  label data "2014 Ontario Municipal Elections"
                  label variable fordapp "Approval of Rob Ford"
                  label define fordappvalues 0 "No" 1 "Yes"
                  label values fordapp fordappvalues
                  codebook fordapp
                  

                  Character variables such as vote1 can be recoded into a numeric variable using the encode command. By default the alphabetical order of the variable’s values is used as can be seen with the variable vote1num. For this reason, category "Don't know" is recoded as 2. If, for example, we want the “Don’t know” category to be the last category, the order can be altered by labeling each value of vote1 with a number as can be seen with the variable vote1numcorrect.

                  Example 1

                  encode vote1, generate(vote1num)
                  label variable vote1num "Vote between 5 Candidates" 
                  codebook vote1num
                  

                  label define vote1numvalues 1 "David Soknacki" 2 "John Tory" 3 "Karen Stintz" 4 "Olivia Chow" 5 "Rob Ford" 6 "Don't know"
                  encode vote1, generate(vote1numcorrect) label(vote1numvalues)
                  codebook vote1numcorrect
                  

                  Example 2

                  codebook voteford
                  label define votefordvalues 0 "No" 1 "Yes" 2 "Don't know"
                  encode voteford, generate(votefordnum) label(votefordvalues)
                  recode votefordnum (2=.)
                  codebook votefordnum
                  

                  To recode a categorical variable, use the recode command. For example, a copy of the gender variable, gendercopy (1= Male, 2=Female) is recoded and renamed female (1=Female, 0=Male).

                  generate gendercopy=gender
                  rename gendercopy female
                  recode female(1=0) (2=1)
                  label define femalevalues 1 female 0 male
                  label values female femalevalues
                  codebook female
                  

                  To drop variables use the drop command. To drop labels use the label drop command.

                  drop voteford
                  label list
                  label drop femalevalues
                  codebook female
                  

                    To generate a new variable that is a function of other variables, use the generate command. Below, the variable totalapp (total approval) was created as an average of the candidate approval rates and recoded into a categorical variable with 3 levels. The command egen is used to generate new variables using special functions. For example, the std() function generates standardized values and the rowmean() function generates row means.

                    generate totalapp = (fordapp + chowapp + toryapp + stinapp + sokapp)/5
                    summarize totalapp
                    recode totalapp (0/0.33=0 Strict) (0.331/0.66=1 Average) (0.661/1=2 Lenient), generate (voteptmp)
                    decode voteptmp, generate(votep)
                    label variable votep "Voting Personality"
                    codebook voteptmp votep
                    

                    egen ztotalapp = std(totalapp)
                    summarize ztotalapp
                    list totalapp ztotalapp in 1/5
                    
                    egen meantotalapp = mean(totalapp), by(income)
                    tab meantotalapp education 
                    tab meantotalapp income
                    
                    egen activity = rowmean(fordapp chowapp toryapp stinapp sokapp toryheard chowheard stinheard sokheard)
                    tab activity income
                    

                      Notes can be attached to the data set or specific variables under memory. These become part of the data set and can be retrieved whenever the data is opened. They can be helpful for future reference or reminders. Notes can be created or found using the notes command.

                      notes female: the variable gender was renamed to female
                      notes income: values of income coded 8 were recoded to be missing
                      notes: Missing data points denote incomplete information and response "Don't know"
                      notes
                      

                        Main Menu

                         

                         

                        Managing Data

                         

                        DATA: ont14.dta

                        Creating ID variable using generate and _n

                        use ont14, clear
                        generate voterid=_n
                        save ont14, replace
                        

                          Subsetting data

                          We want to subset, for example, the female voters with the highest income on one hand and the male voters with the lowest education on the other hand. These data sets are saved as subset1 and subset2 respectively.

                          Example 1

                          use ont14, clear
                          keep if income==7 & gender==2 
                          save subset1, replace
                          

                          Example 2

                          use ont14, clear
                          keep if education==4 & gender==1
                          save subset2, replace
                          

                            Use the append command to subset

                            The append command stacks data sets to append therefore the number and order of variables between the two data sets must match in order for the append to be effective.

                            use subset1
                            append using subset2
                            save alldata, replace
                            

                              Keeping and dropping variables

                              use ont14, clear
                              keep voterid fordapp voteford fordleave fordres votefordafterrehab
                              save forddata, replace
                              list in 1/5
                              

                              use ont14, clear
                              drop vote1 - votefordafterrehab
                              save votehist, replace
                              list in 1/5
                              

                              Merging data

                              To merge data sets, sort each one by the merging variable first then use the merge command.

                              use forddata, clear
                              sort voterid
                              save forddata, replace
                              
                              use votehist, clear
                              sort voterid
                              save votehist, replace
                              
                              use forddata, clear
                              merge voterid using votehist
                              

                              save votehist, replace
                              

                              Main Menu

                               

                               

                              Counting Data

                               

                              DATA: ont14.dta

                              The Stata built-in variables _n and _N are used to count observations for the purpose of generating IDs, finding the lowest and highest scores for each group of observations, finding duplicates etc. The variable _n enumerates all the observations in a specific group whereas _N gives the total number of observations in a specific group.

                              Generating IDs

                              use ont14, clear
                              generate id = _n
                              generate nsize = _N
                              list id nsize vote1 vote2 votenot in 1/10
                              

                              Finding lowest and highest score in each group

                              Here we want to find the voters with the lowest and highest education level for each city by age group. First, we sort by all three variables city, age and education from lowest to highest. In this case, we also sort by id because there are some, city by age, groups with more than one individual in the same education level. Then, we count by education for each city by age group. Listing the observations where n1 equals 1 gives us the lowest education levels and n1 equals n2 gives us the highest education level.

                              sort city age education id
                              by city age : generate n1 = _n
                              by city age : generate n2 = _N
                              list vote1 vote2 votenot city age education n1 n2 if n1==1 in 1/100
                              

                               

                              list vote1 vote2 votenot city age education n1 n2 if n1==n2 in 1/100
                              

                              Checking for duplicates

                              The two methods below will print out rows of observations that are duplicates.

                              Method 1

                              use ont14, clear
                              sort _all
                              by _all: generate n = _N
                              list if n > 1
                              

                              Method 2

                              use ont14, clear
                              bysort * : generate n = _N
                              list if n > 1
                              

                                Main Menu

                                 

                                 

                                Dates

                                 

                                The following data consists of two dates: one in string format and another one is numeric.

                                use http://data.library.utoronto.ca/datapub/workshop/dates, clear
                                list
                                

                                To convert string type or numeric dates into elapsed dates or number of days since January 1, 1960 use the date and mdy functions respectively.

                                generate birthday = date(bdate, "MDY")
                                generate event = mdy(dmon, dday, dyear)
                                drop bdate - dyear
                                list
                                

                                Note that the third birthday is missing for the variable birthday because the year is not in four digits. This can be fixed manually by replacing the value of birthday for the 3rd observation using the replace command.

                                replace birthday = mdy(1, 21, 1972) in 3
                                

                                  The format command can be used to control the display of elapsed dates.

                                  format birthday event %dD_m_CY
                                  list
                                  

                                  The number of days elapsed can be converted into weeks, months and years as follows.

                                  generate diff = event - birthday
                                  generate weeks = diff/7
                                  generate months = diff/30.5
                                  generate years = diff/365.25
                                  list
                                  

                                  Main Menu

                                   

                                   

                                  Reshaping Data

                                   

                                  To reshape a data set between wide and long format, use the reshape long or reshape wide commands. For a data set in wide format, the variables that denote the repetition should be named with the variable name followed by a number (i.e. variablename#). For example, in our data set we have drug1, drug2 and drug3. The numbers do not need to be consecutive.

                                  clear
                                  use http://data.library.utoronto.ca/datapub/workshop/wide
                                  list
                                  

                                  describe
                                  list
                                  reshape long drug, i(id) j(drugtype)
                                  list
                                  

                                  table drugtype, contents(n drug mean drug sd drug)
                                  

                                  reshape wide
                                  

                                  Main Menu

                                  Sources

                                  [1] http://www.ats.ucla.edu/stat/stata/topics/data_management.htm
                                   

                                  author Nadia Muhe, Map and Data Library, University of Toronto.