Plotting a single variable

Lecture 11

Aidan Combs

Duke University
SOCIOL 333 - Summer Term 1 2023

2023-06-06

Logistics

  • Project component 2: descriptive statistics

    • Due Thursday June 8 11:59pm
    • this is the last time it’s moving–thank you for your patience. :)

Today

  • Getting help with R
  • Single variable plots
  • Factors

Finding help in R

  • #1: Your classmates and I!

    • Copy/pasting from slides or exercises and editing the pieces you need is a good strategy
  • Documentation! Use ?
?median

Finding help in R

  • Google!

    • Try “how to calculate the median in r” rather than “r how to calculate median”
    • StackOverflow/StackExchange is good
    • Paste in your error message
    • Cite your sources

Finding help in R

  • ChatGPT??
-   ...eh. It can WRITE code, but it can't test it... 
-   So what it gives you probably won't work, or won't work correctly, at least not right out of the box
-   Might be useful for inspiration/finding a starting place, but not for doing your work for you (just like with writing!)
-   **Cite your sources**

Plots

Why plot?

Why plot?

Plotting a single variable

Categorical data: bar charts

Numeric data: histograms

How? With ggplot!

The basic structure of a plot:

ggplot(
  data = DATASET, 
  # aes stands for aesthetics--it is where you tell R what plot features should represent what variables. 
  aes(
    # sometimes--like in the single variable plots we're making today--you 
    # only need to specify one variable (x or y). 
    # If you plot more than one variable you'll specify more than one thing here.
    # eg, maybe x and y variables and a third variable for color
    x = X.VARIABLE, 
    y = Y.VARIABLE, 
    fill = FILL.COLOR.VARIABLE,
    OTHER.STUFF)) + # add more elements to the plot with +
  # geoms tell R what kind of plot you want to make. More options in the ggplot cheatsheet.
  geom_XXXX() 
# the same thing but without the messy comments

ggplot(data = DATASET, 
       aes(x = X.VARIABLE, 
           y = Y.VARIABLE, 
           fill = FILL.COLOR.VARIABLE,
           OTHER.STUFF)) +
  geom_XXXX() 

Back to the bar chart

ggplot(data = acs12, # we want to plot the acs12 dataset
       aes(x = employment)) + # and specifically the employment variable, which should be on the x axis
  geom_bar() # and we want that plot to be a bar chart
  • It’s there but it’s ugly!

Cleaning up the bar chart

  • Once we have a basic plot, we can add more elements to make it nicer
ggplot(data = acs12, aes(x = employment)) + 
  geom_bar() + 
  # this layer changes our axis labels
  labs(y = "", # make the y axis labels blank
       x = "Labor force status") # Change the x axis label

Cleaning up the bar chart

  • What about that NA bar?
  • There are NAs in our plot because there are NAs in our data set
table(acs12$employment, useNA = "always")

not in labor force         unemployed           employed               <NA> 
               656                106                843                395 

Cleaning up the bar chart

  • We can use filter() to get rid of them before plotting the data
# first we filter out the NAs
acs12_filtered <- filter(acs12, !is.na(employment))

# then we plot that new dataset
ggplot(data = acs12_filtered, aes(x = employment)) + 
  geom_bar() + 
  labs(y = "", 
       x = "Labor force status",
       # and let's add a title too
       title = "Number of people in each employment category"
       ) 

Cleaning up the bar chart

  • We could have also used a pipe to do the same thing in one command
# we start with acs12
acs12 |> 
  # then we filter out the NAs
  filter(!is.na(employment)) |> 
  # then we make our plot
  ggplot(aes(x = employment)) + 
  geom_bar() + 
  labs(y = "", 
       x = "Labor force status",
       # and let's add a title too
       title = "Number of people in each employment category"
  ) 

Back to the histogram

  • It has exactly the same structure!
ggplot(data = acs12, aes(x = hrs_work)) + 
  #notice that the geom is different--this is what specifies what kind of plot you want
  geom_histogram()

Cleaning up the histogram

  • Two messages come up:

    • stat_bin() using bins = 30. Pick better value with binwidth.
    • Warning: Removed 1041 rows containing non-finite values (stat_bin()).
  • Let’s start with the second one

    • When there are NA variables in categorical variables, R plots them as another category, in their own bar
    • But when there are NA variables in numeric variables, R doesn’t know what to do
    • So here it’s warning us that it dropped “non-finite” (ie, NA) values itself

Cleaning up the histogram

  • It’s better if we do that explicitly though
  • Same as the bar chart, let’s filter out the NAs before plotting
acs_filtered <- filter(acs12, !is.na(hrs_work))

ggplot(data = acs_filtered, aes(x = hrs_work)) + 
  geom_histogram()

Cleaning up the histogram

  • Now what about that other message?

    • stat_bin() using bins = 30. Pick better value with binwidth.
# Hmm. Let's check the function documentation. 
# Since it didn't give us this message last time, 
# it must have something to do with the histogram function.
?geom_histogram

Cleaning up the histogram

  • The function documentation tells us that geom_histogram gives you two ways to set the width of the bars: bins and binwidth.

    • binwidth sets the width of each bar: eg, each bar should represent 5 hours worked, or 10
    • We could use bins instead to tell R how many bars there should be–10 bars, 20 bars, etc
  • The message is telling us that R took its best dumb guess, but we should pick something better.

Cleaning up the histogram

acs_filtered <- filter(acs12, !is.na(hrs_work))

ggplot(data = acs_filtered, aes(x = hrs_work)) + 
  # the binwidth argument goes inside the geom_histogram function
  # let's say that each bar should equal five hours--so 0-5 hours, 5-10, etc
  geom_histogram(binwidth = 5) 

Cleaning up the histogram

  • What if we picked 8 instead? (logic being that 8 is a standard workday)
acs_filtered <- filter(acs12, !is.na(hrs_work))

ggplot(data = acs_filtered, aes(x = hrs_work)) + 
  geom_histogram(binwidth = 10) 

Cleaning up the histogram

  • What about 1?
acs_filtered <- filter(acs12, !is.na(hrs_work))

ggplot(data = acs_filtered, aes(x = hrs_work)) + 
  geom_histogram(binwidth = 1) 

Cleaning up the histogram

  • Now we’ve eliminated all the messages–let’s add labels!
acs_filtered <- filter(acs12, !is.na(hrs_work))

ggplot(data = acs_filtered, aes(x = hrs_work)) + 
  geom_histogram(binwidth = 5) + 
  labs(x = "Number of hours worked per week",
       y = "Number of people",
       title = "Distribution of hours worked per week",
       subtitle = "American Community Survey 2012")

Exercise: Plotting one variable