Describing data: part 2

Lecture 8

Aidan Combs

Duke University
SOCIOL 333 - Summer Term 1 2023

2023-05-31

Logistics

  • Project proposals

    • I will get my feedback to you soon–will be posted as a github issue (grades will be in Sakai)
  • Project descriptive statistics

    • Due Tuesday June 6. Instructions are up; template and example to come this afternoon.
  • Homework 1

    • Cancelled as a mandatory assignment; time is short
    • BUT practice is the only way to learn this stuff!
    • So, extra credit opportunities will be worked into exercises–I highly recommend doing as much as you can!

Today

  • Finish univariate summaries
  • Preparing data frames

Where we left off: summarizing numeric variables

Summarizing a distribution

  • Center

    • median(dataset$var, na.rm = TRUE)
    • mean(dataset$var, na.rm = TRUE)
  • Spread

    • quantile(dataset$var, na.rm = TRUE)
    • sd(dataset$var, na.rm = TRUE)
  • (Almost) everything

    • summary(dataset$var)

Exercise question 2

glimpse(acs12)
Rows: 2,000
Columns: 13
$ income       <int> 60000, 0, NA, 0, 0, 1700, NA, NA, NA, 45000, NA, 8600, 0,…
$ employment   <fct> not in labor force, not in labor force, NA, not in labor …
$ hrs_work     <int> 40, NA, NA, NA, NA, 40, NA, NA, NA, 84, NA, 23, NA, NA, N…
$ race         <fct> white, white, white, white, white, other, white, other, a…
$ age          <int> 68, 88, 12, 17, 77, 35, 11, 7, 6, 27, 8, 69, 69, 17, 10, …
$ gender       <fct> female, male, female, male, female, female, male, male, m…
$ citizen      <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ye…
$ time_to_work <int> NA, NA, NA, NA, NA, 15, NA, NA, NA, 40, NA, 5, NA, NA, NA…
$ lang         <fct> english, english, english, other, other, other, english, …
$ married      <fct> no, no, no, no, no, yes, no, no, no, yes, no, no, yes, no…
$ edu          <fct> college, hs or lower, hs or lower, hs or lower, hs or low…
$ disability   <fct> no, yes, no, no, yes, yes, no, yes, no, no, no, no, yes, …
$ birth_qrtr   <fct> jul thru sep, jan thru mar, oct thru dec, oct thru dec, j…

Question 2 solutions

What is the mean of hrs_work? What is the median? (there are two ways to get this info)

Approach 1:

mean(acs12$hrs_work, na.rm = TRUE)
[1] 37.97706
median(acs12$hrs_work, na.rm = TRUE)
[1] 40

Approach 2:

summary(acs12$hrs_work)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1.00   32.00   40.00   37.98   40.00   99.00    1041 

Question 2 solutions

What is the standard deviation of hrs_work?

sd(acs12$hrs_work, na.rm = TRUE)
[1] 13.49768

Question 2 solutions

What proportion of people in the data set are missing information (have NAs) for this variable? (there are many ways to do this)

# Find the number of NA values
summary(acs12$hrs_work)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1.00   32.00   40.00   37.98   40.00   99.00    1041 
# Find the total number of observations
glimpse(acs12)
Rows: 2,000
Columns: 13
$ income       <int> 60000, 0, NA, 0, 0, 1700, NA, NA, NA, 45000, NA, 8600, 0,…
$ employment   <fct> not in labor force, not in labor force, NA, not in labor …
$ hrs_work     <int> 40, NA, NA, NA, NA, 40, NA, NA, NA, 84, NA, 23, NA, NA, N…
$ race         <fct> white, white, white, white, white, other, white, other, a…
$ age          <int> 68, 88, 12, 17, 77, 35, 11, 7, 6, 27, 8, 69, 69, 17, 10, …
$ gender       <fct> female, male, female, male, female, female, male, male, m…
$ citizen      <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ye…
$ time_to_work <int> NA, NA, NA, NA, NA, 15, NA, NA, NA, 40, NA, 5, NA, NA, NA…
$ lang         <fct> english, english, english, other, other, other, english, …
$ married      <fct> no, no, no, no, no, yes, no, no, no, yes, no, no, yes, no…
$ edu          <fct> college, hs or lower, hs or lower, hs or lower, hs or low…
$ disability   <fct> no, yes, no, no, yes, yes, no, yes, no, no, no, no, yes, …
$ birth_qrtr   <fct> jul thru sep, jan thru mar, oct thru dec, oct thru dec, j…

Question 2 solutions

Now that we have those values, we can use R to do the math

# Divide!
1041/2000
[1] 0.5205

Shape

Shape

  • We’ll go over how to make plots later
  • But summary statistics tell you things about shape too

Skew

Right-skewed data

Left-skewed data

Skew

  • Extreme values influence the mean more than the median
  • When mean is higher than median: data might be skewed right (and vice versa)

Example: American Community Survey commute time

time_to_work: Travel time to work, in minutes.

mean_commute <- mean(acs12$time_to_work, na.rm = TRUE)
median_commute <- median(acs12$time_to_work, na.rm = TRUE)
mean_commute
[1] 25.99745
median_commute
[1] 20

Example: American Community Survey commute time

Shape: Quartiles

  • Location of quartiles also tells you something about shape
summary(acs12$time_to_work)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      1      10      20      26      30     163    1217 

Review

  • Categorical data: look at the available categories and how many observations are in each category

    • unique(dataframe$variable)
    • table(dataframe$variable, useNA = "always")
    • table(dataframe$variable1, dataframe$variable2, useNA = "always")

Review

  • Numeric data: Look at summary statistics that tell you something about the distribution
    • summary(dataframe$variable)
    • sd(dataframe$variable, na.rm = TRUE)

Modifying data frames

Removing observations

  • You may not be interested in all observations

  • Example:

    • RQ: How is commute time related to hours worked?

Removing observations

  • How is commute time related to hours worked?
  • Data: ACS 2012
glimpse(acs12)
Rows: 2,000
Columns: 13
$ income       <int> 60000, 0, NA, 0, 0, 1700, NA, NA, NA, 45000, NA, 8600, 0,…
$ employment   <fct> not in labor force, not in labor force, NA, not in labor …
$ hrs_work     <int> 40, NA, NA, NA, NA, 40, NA, NA, NA, 84, NA, 23, NA, NA, N…
$ race         <fct> white, white, white, white, white, other, white, other, a…
$ age          <int> 68, 88, 12, 17, 77, 35, 11, 7, 6, 27, 8, 69, 69, 17, 10, …
$ gender       <fct> female, male, female, male, female, female, male, male, m…
$ citizen      <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ye…
$ time_to_work <int> NA, NA, NA, NA, NA, 15, NA, NA, NA, 40, NA, 5, NA, NA, NA…
$ lang         <fct> english, english, english, other, other, other, english, …
$ married      <fct> no, no, no, no, no, yes, no, no, no, yes, no, no, yes, no…
$ edu          <fct> college, hs or lower, hs or lower, hs or lower, hs or low…
$ disability   <fct> no, yes, no, no, yes, yes, no, yes, no, no, no, no, yes, …
$ birth_qrtr   <fct> jul thru sep, jan thru mar, oct thru dec, oct thru dec, j…
-   Need to remove people who are unemployed/not in labor force (and don't have a job or a commute!)

Removing observations

  • How? With filter()
  • filter(dataframe, condition)
acs12_employedonly <- filter(acs12, employment == "employed")
  • Did it work? Check with table()
table(acs12_employedonly$employment, useNA = "always")

not in labor force         unemployed           employed               <NA> 
                 0                  0                843                  0 

Conditions

  • filter(acs12, employment == "employed")
  • condition: employment == "employed"
  • This specifies which observations you want to keep
  • Common comparison operators:
    • == : equal to (note there are two equals signs!)
    • != : not equal to
    • >, >= : greater than, greater than or equal to
    • <, <= : less than, less than or equal to
  • Values:

    • Can be numbers, letters/words, or TRUE/FALSE–should match the response options of your variable
    • Put letters/words in quotation marks

Condition examples

  • filter(acs12, citizen == "no")
  • filter(acs12, income <= 12000)
  • filter(acs12, birth_qrtr != "jan thru mar")
  • filter(acs12, hrs_work > 20)

Other useful condition operators

  • Two or more requirements

    • & : and

    • | : or

    • filter(acs12, citizen == "no" & lang == "english")

    • filter(acs12, race == "black" | race == "asian")

  • Missing values

    • is.na()

    • !is.na()

      • is not missing – good for removing rows with missing values
    • filter(acs12, !is.na(income))

  • More: Click here for R for Data Science

Common mistakes and error messages

filter(acs12, employment = "employed")
Error in `filter()`:
! We detected a named input.
ℹ This usually means that you've used `=` instead of `==`.
ℹ Did you mean `employment == "employed"`?

Common mistakes and error messages

filter(acs12, employment == employed)
Error in `filter()`:
ℹ In argument: `employment == employed`.
Caused by error:
! object 'employed' not found

Common mistakes and error messages

filter(acs12, employment == "employd")
# A tibble: 0 × 13
# ℹ 13 variables: income <int>, employment <fct>, hrs_work <int>, race <fct>,
#   age <int>, gender <fct>, citizen <fct>, time_to_work <int>, lang <fct>,
#   married <fct>, edu <fct>, disability <fct>, birth_qrtr <fct>

Common mistakes and error messages

filter(acs12, employment == "Employed")
# A tibble: 0 × 13
# ℹ 13 variables: income <int>, employment <fct>, hrs_work <int>, race <fct>,
#   age <int>, gender <fct>, citizen <fct>, time_to_work <int>, lang <fct>,
#   married <fct>, edu <fct>, disability <fct>, birth_qrtr <fct>

Exercise: Filtering