Describing data: part 1

Lecture 7

Aidan Combs

Duke University
SOCIOL 333 - Summer Term 1 2023

2023-05-30

Logistics

  • Final project proposals due tonight

    • Update your first draft: incorporate useful peer feedback
    • Don’t forget to add the last paragraph about what question you think you should go with
    • Any last questions?
  • Late work policy

    • You have 3 no-questions-asked late days to use through the semester
    • Applies to project components and assignments (except final paper and presentation–I can’t extend those)

Today

  • Data analysis process
  • Univariate summaries: what do my variables look like?

The (approximate) data analysis process

  • Determine topic
  • Find data; learn what observations and variables are available
  • Write research question
  • Describe distributions of relevant variables
  • Prepare data frame for analysis
  • Describe relationships between variables
  • Perform statistical tests/write models
  • Communicate results

The (approximate) data analysis process

  • Determine topic ✓
  • Find data; learn what observations and variables are available ✓
  • Write research question ✓
  • Describe distributions of relevant variables
  • Prepare data frame for analysis
  • Describe relationships between variables
  • Perform statistical tests/write models
  • Communicate results

The (approximate) data analysis process

  • Determine topic ✓
  • Find data; learn what observations and variables are available ✓
  • Write research question ✓
  • Describe distributions of relevant variables
  • Prepare data frame for analysis
  • Describe relationships between variables
  • Perform statistical tests/write models
  • Communicate results

Project component 2: Descriptive statistics

Goals:

  • Understand how your variables are distributed in your data

  • Make any necessary modifications to your data frame

    • Remove irrelevant observations
    • Modify or create variables as necessary
  • Describe the relationships between your variables

Understanding how your variables are distributed

Today: With numbers (later: with plots)

  • Summary statistics
  • What makes sense depends on variable type

R syntax for today

  • To access specific variables: dataframe$variable
  • Functions do things with variables/other inputs: do_this(with_this)
  • We can save results as objects to use later with <-: an_object_name <- some_function(some_input)
    • eg: meanvalue <- mean(dataframe$variable)

Data set 1 for today: EADA data on sports teams

glimpse(sports)
Rows: 879
Columns: 8
$ school             <chr> "Duke", "Duke", "Duke", "Duke", "Duke", "Duke", "Du…
$ year               <dbl> 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 200…
$ division           <chr> "NCAA Division I-A", "NCAA Division I-A", "NCAA Div…
$ teamgender         <chr> "men", "men", "men", "men", "men", "men", "men", "m…
$ sport              <chr> "Baseball", "Basketball", "Fencing", "Football", "G…
$ ncoaches           <dbl> 3, 4, 1, 10, 2, 3, 3, 2, 2, 3, 3, 4, 4, 8, 6, 13, 2…
$ nplayers           <dbl> 33, 15, 19, 81, 12, 45, 23, 32, 11, 50, 35, 35, 30,…
$ player_coach_ratio <dbl> 11.000000, 3.750000, 19.000000, 8.100000, 6.000000,…

Before we dive in: an intro to missing data

  • Sometimes not everyone answers every question or data is not available for a specific observation
  • Creates NA values in your data frame
  • Check for them–they can create unexpected results otherwise

Categorical variables

Categorical: What are the options?

  • unique()
unique(sports$school)
[1] "Duke" "UNC" 
unique(sports$sport)
 [1] "Baseball"                  "Basketball"               
 [3] "Fencing"                   "Football"                 
 [5] "Golf"                      "Lacrosse"                 
 [7] "Soccer"                    "Swimming and Diving"      
 [9] "Tennis"                    "Track and Field X Country"
[11] "Wrestling"                 "All Track Combined"       
[13] "Field Hockey"              "Rowing"                   
[15] "Volleyball"                "Gymnastics"               
[17] "Softball"                 

Categorical: counts for one variable

  • Counts: table()

    • useNA = "always" tells R to let you know if there is any missing data on your variables
table(sports$school, useNA = "always")

Duke  UNC <NA> 
 423  456    0 

Categorical: counts for two variables

  • Still table(), but use two variables as arguments
table(sports$teamgender, sports$school, useNA = "always")
       
        Duke UNC <NA>
  men    209 209    0
  women  214 247    0
  <NA>     0   0    0

Exercise

glimpse(acs12)
Rows: 2,000
Columns: 13
$ income       <int> 60000, 0, NA, 0, 0, 1700, NA, NA, NA, 45000, NA, 8600, 0,…
$ employment   <fct> not in labor force, not in labor force, NA, not in labor …
$ hrs_work     <int> 40, NA, NA, NA, NA, 40, NA, NA, NA, 84, NA, 23, NA, NA, N…
$ race         <fct> white, white, white, white, white, other, white, other, a…
$ age          <int> 68, 88, 12, 17, 77, 35, 11, 7, 6, 27, 8, 69, 69, 17, 10, …
$ gender       <fct> female, male, female, male, female, female, male, male, m…
$ citizen      <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ye…
$ time_to_work <int> NA, NA, NA, NA, NA, 15, NA, NA, NA, 40, NA, 5, NA, NA, NA…
$ lang         <fct> english, english, english, other, other, other, english, …
$ married      <fct> no, no, no, no, no, yes, no, no, no, yes, no, no, yes, no…
$ edu          <fct> college, hs or lower, hs or lower, hs or lower, hs or low…
$ disability   <fct> no, yes, no, no, yes, yes, no, yes, no, no, no, no, yes, …
$ birth_qrtr   <fct> jul thru sep, jan thru mar, oct thru dec, oct thru dec, j…

Question 1 solutions

Response options for the employment variable

unique(acs12$employment)
[1] not in labor force <NA>               employed           unemployed        
Levels: not in labor force unemployed employed

Question 1 solutions

How many people are in each category? How many missing values are there?

table(acs12$employment, useNA = "always")

not in labor force         unemployed           employed               <NA> 
               656                106                843                395 

Question 1 solutions

Create a table that shows the number of people of each gender in each employment category.

table(acs12$gender, acs12$employment, useNA = "always")
        
         not in labor force unemployed employed <NA>
  male                  283         59      470  219
  female                373         47      373  176
  <NA>                    0          0        0    0

Does it look like gender and employment status are related?

Numeric variables

Numeric: All about distributions!

Summarizing a distribution

  • Center
  • Spread
  • Shape

Center: Mean

Say we measure heights of 7 people (in inches):

heights <- c(67, 70, 80, 61, 70, 71, 62)
  • aka the average
  • Add everything together and divide by the number of values: (66 + 70 + 80 + 62 + 68 + 71 + 62)/7 = 68.7
  • mean()
mean(heights)
[1] 68.71429

Center: Median

heights
[1] 67 70 80 61 70 71 62
  • The middle value
  • Order the values least to greatest, select the one in the middle
  • 61, 62, 67, 70, 70, 71, 80
  • median()
median(heights)
[1] 70

Spread: minimum and maximum

min(heights)
[1] 61
max(heights)
[1] 80

Spread: quartiles

  • percentiles
  • lowest 1/4 of your data is below the 25th percentile, lowest 1/2 below 50%, etc
  • Calculated similarly to the median
  • quantile()
quantile(heights)
  0%  25%  50%  75% 100% 
61.0 64.5 70.0 70.5 80.0 

Spread: standard deviation

  • Measure of degree of variation in data: bigger standard deviation = more variation
  • Used to establish statistical significance–we’ll come back to this later
  • sd()
heights_moresimilar <- c(60, 63, 62, 60, 61, 61, 64)
heights
[1] 67 70 80 61 70 71 62
heights_moresimilar
[1] 60 63 62 60 61 61 64
sd(heights)
[1] 6.369571
sd(heights_moresimilar)
[1] 1.511858

summary(): (almost) everything all at once

summary(sports$nplayers)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5.00   15.00   28.00   35.63   40.00  148.00 

Summary statistics functions and missing values

mean(acs12$income)
[1] NA
  • NA! What!
  • Summary statistics functions will return NA if any of the observations are missing information for the variable
  • Always check for missing data!
  • Use the na.rm = TRUE option to tell the function to ignore the NA values.
summary(acs12$income)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0       0    3000   23600   33700  450000     377 
mean(acs12$income, na.rm = TRUE)
[1] 23599.98