5-31-23 exercise: filtering

Setup

First, we need to load some packages. Click the play button to run the code in the block below (remember, you only need to do this once per session).

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openintro)
Loading required package: airports
Loading required package: cherryblossom
Loading required package: usdata

Filtering

Like yesterday, we will be working with the acs12 data set.

Often, we only want to analyze adult respondents. In the chunk below, create a modified version of acs12 that contains only respondents who are 18 or older. Call your new data frame acs_adults.

acs_adults <- filter(acs12, age >= 18)

Look at the summary statistics of the age variable in your new, filtered data frame. Did the filtering work?

summary(acs_adults$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  18.00   33.00   49.00   49.06   63.00   94.00 
# yes, filtering worked! Our minimum age is now 18

How many respondents are left in this new data frame? (we started with 2000)

(hint: there are multiple ways to do this, some of which involve code, one of which involves looking at the entry for your new data frame in the “environment” RStudio panel)

glimpse(acs_adults)
Rows: 1,561
Columns: 13
$ income       <int> 60000, 0, 0, 1700, 45000, 8600, 0, 33500, 4000, 19000, 34…
$ employment   <fct> not in labor force, not in labor force, not in labor forc…
$ hrs_work     <int> 40, NA, NA, 40, 84, 23, NA, 55, 8, 35, 25, NA, 12, 40, 8,…
$ race         <fct> white, white, white, other, white, white, white, white, w…
$ age          <int> 68, 88, 77, 35, 27, 69, 69, 52, 67, 36, 40, 27, 18, 35, 3…
$ gender       <fct> female, male, female, female, male, female, male, male, f…
$ citizen      <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ye…
$ time_to_work <int> NA, NA, NA, 15, 40, 5, NA, 20, 10, 15, NA, NA, NA, 30, 20…
$ lang         <fct> english, english, other, other, english, english, english…
$ married      <fct> no, no, no, yes, yes, no, yes, yes, yes, yes, no, no, no,…
$ edu          <fct> college, hs or lower, hs or lower, hs or lower, hs or low…
$ disability   <fct> no, yes, yes, yes, no, no, yes, no, no, no, yes, no, no, …
$ birth_qrtr   <fct> jul thru sep, jan thru mar, jul thru sep, jul thru sep, o…
# glimpse shows us that there are 1561 rows in the data frame now. We can also see this by looking at the information on acs_adults in the environment tab.

Extra credit

Use the chunks below to create other subsets of the acs12 data. Do as many or as few as you feel are useful to you. When you feel like you’ve got the basics down, try out some of the more complex filtering operators we discussed.

Check your results by looking at summary statistics or tables. Look at your new data frames by opening them up from the environment panel.

For practice with simple comparison operators (==, !=, <, <=, >, >=):

  • Include only US citizens

    acs_citizens <- filter(acs12, citizen == "yes")
    
    table(acs_citizens$citizen)
    
      no  yes 
       0 1882 
  • Include only people who are not White

    acs_poc <- filter(acs12, race != "white")
    
    table(acs_poc$race)
    
    white black asian other 
        0   206    87   152 
  • Include only people who are retirement age (>67).

    acs_retired <- filter(acs12, age > 67)
    
    summary(acs_retired$age)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      68.00   71.00   76.00   77.21   82.00   94.00 
  • Exclude people who were born in winter

    glimpse(acs12) # variable is birth_qrtr
    Rows: 2,000
    Columns: 13
    $ income       <int> 60000, 0, NA, 0, 0, 1700, NA, NA, NA, 45000, NA, 8600, 0,…
    $ employment   <fct> not in labor force, not in labor force, NA, not in labor …
    $ hrs_work     <int> 40, NA, NA, NA, NA, 40, NA, NA, NA, 84, NA, 23, NA, NA, N…
    $ race         <fct> white, white, white, white, white, other, white, other, a…
    $ age          <int> 68, 88, 12, 17, 77, 35, 11, 7, 6, 27, 8, 69, 69, 17, 10, …
    $ gender       <fct> female, male, female, male, female, female, male, male, m…
    $ citizen      <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ye…
    $ time_to_work <int> NA, NA, NA, NA, NA, 15, NA, NA, NA, 40, NA, 5, NA, NA, NA…
    $ lang         <fct> english, english, english, other, other, other, english, …
    $ married      <fct> no, no, no, no, no, yes, no, no, no, yes, no, no, yes, no…
    $ edu          <fct> college, hs or lower, hs or lower, hs or lower, hs or low…
    $ disability   <fct> no, yes, no, no, yes, yes, no, yes, no, no, no, no, yes, …
    $ birth_qrtr   <fct> jul thru sep, jan thru mar, oct thru dec, oct thru dec, j…
    unique(acs12$birth_qrtr) # value is "jan thru mar"
    [1] jul thru sep jan thru mar oct thru dec apr thru jun
    Levels: jan thru mar apr thru jun jul thru sep oct thru dec
    acs_nowinter <- filter(acs12,
           birth_qrtr != "jan thru mar")
    
    table(acs_nowinter$birth_qrtr)
    
    jan thru mar apr thru jun jul thru sep oct thru dec 
               0          479          504          532 

For practice with more complex comparison operators:

  • Include only the people who are not in the labor force but still say they work more than 20 hours per week (??)

    unique(acs12$employment) # "not in labor force" is the value we're interested in
    [1] not in labor force <NA>               employed           unemployed        
    Levels: not in labor force unemployed employed
    acs_confusingwork <- filter(acs12, hrs_work > 20 & employment == "not in labor force")
    
    # table checks the employment variable--we only have people who are not in the labor force
    table(acs_confusingwork$employment)
    
    not in labor force         unemployed           employed 
                    47                  0                  0 
    # and summary checks hrs_work: we only have people who work more than 20 hours.
    summary(acs_confusingwork$hrs_work)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      23.00   33.50   40.00   38.77   40.00   99.00 
  • Include only Black and Asian respondents

    unique(acs12$race)
    [1] white other asian black
    Levels: white black asian other
    acs_blackasian <- filter(acs12, 
                             race == "black" | race == "asian")
    
    table(acs_blackasian$race)
    
    white black asian other 
        0   206    87     0 
  • Include only the people who reported a usable (not-NA) commute time

    acs_withcommutes <- filter(acs12, !is.na(time_to_work))
    
    # no NA's reported!
    summary(acs_withcommutes$time_to_work)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
          1      10      20      26      30     163 
  • Include people who have a disability or are unemployed

    # checking for variable names: we want "disability", which has yes/no values, and "employment", where the relevant value is "unemployed"
    glimpse(acs12)
    Rows: 2,000
    Columns: 13
    $ income       <int> 60000, 0, NA, 0, 0, 1700, NA, NA, NA, 45000, NA, 8600, 0,…
    $ employment   <fct> not in labor force, not in labor force, NA, not in labor …
    $ hrs_work     <int> 40, NA, NA, NA, NA, 40, NA, NA, NA, 84, NA, 23, NA, NA, N…
    $ race         <fct> white, white, white, white, white, other, white, other, a…
    $ age          <int> 68, 88, 12, 17, 77, 35, 11, 7, 6, 27, 8, 69, 69, 17, 10, …
    $ gender       <fct> female, male, female, male, female, female, male, male, m…
    $ citizen      <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ye…
    $ time_to_work <int> NA, NA, NA, NA, NA, 15, NA, NA, NA, 40, NA, 5, NA, NA, NA…
    $ lang         <fct> english, english, english, other, other, other, english, …
    $ married      <fct> no, no, no, no, no, yes, no, no, no, yes, no, no, yes, no…
    $ edu          <fct> college, hs or lower, hs or lower, hs or lower, hs or low…
    $ disability   <fct> no, yes, no, no, yes, yes, no, yes, no, no, no, no, yes, …
    $ birth_qrtr   <fct> jul thru sep, jan thru mar, oct thru dec, oct thru dec, j…
    acs_disabled_or_unemployed <- filter(acs12,
                                         disability == "yes" | employment == "unemployed")
    
    # the table shows that we have people with disabilities, people who are unemployed, and people who both have a disability and are unemployed. But there are no people who neither have a disability nor are unemployed. So this worked as expected!
    table(acs_disabled_or_unemployed$disability, acs_disabled_or_unemployed$employment)
    
          not in labor force unemployed employed
      no                   0         85        0
      yes                223         21       70
  • Include non-retired adults (people who are older than 18 but younger than 67)

    # I should have phrased this question better; really if we're interested in adults it should be 18 or older, not older than 18.
    acs_workingadults <- filter(acs12, age >= 18 & age < 67)

Other subsets that interest you:

Submission

When you are done, commit your changes and push them to GitHub! (instructions)