Lecture 8
Duke University
SOCIOL 333 - Summer Term 1 2023
2023-05-31
Project proposals
Project descriptive statistics
Homework 1
Center
Spread
(Almost) everything
Rows: 2,000
Columns: 13
$ income <int> 60000, 0, NA, 0, 0, 1700, NA, NA, NA, 45000, NA, 8600, 0,…
$ employment <fct> not in labor force, not in labor force, NA, not in labor …
$ hrs_work <int> 40, NA, NA, NA, NA, 40, NA, NA, NA, 84, NA, 23, NA, NA, N…
$ race <fct> white, white, white, white, white, other, white, other, a…
$ age <int> 68, 88, 12, 17, 77, 35, 11, 7, 6, 27, 8, 69, 69, 17, 10, …
$ gender <fct> female, male, female, male, female, female, male, male, m…
$ citizen <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ye…
$ time_to_work <int> NA, NA, NA, NA, NA, 15, NA, NA, NA, 40, NA, 5, NA, NA, NA…
$ lang <fct> english, english, english, other, other, other, english, …
$ married <fct> no, no, no, no, no, yes, no, no, no, yes, no, no, yes, no…
$ edu <fct> college, hs or lower, hs or lower, hs or lower, hs or low…
$ disability <fct> no, yes, no, no, yes, yes, no, yes, no, no, no, no, yes, …
$ birth_qrtr <fct> jul thru sep, jan thru mar, oct thru dec, oct thru dec, j…
What is the mean of hrs_work
? What is the median? (there are two ways to get this info)
Approach 1:
What is the standard deviation of hrs_work
?
What proportion of people in the data set are missing information (have NAs) for this variable? (there are many ways to do this)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.00 32.00 40.00 37.98 40.00 99.00 1041
Rows: 2,000
Columns: 13
$ income <int> 60000, 0, NA, 0, 0, 1700, NA, NA, NA, 45000, NA, 8600, 0,…
$ employment <fct> not in labor force, not in labor force, NA, not in labor …
$ hrs_work <int> 40, NA, NA, NA, NA, 40, NA, NA, NA, 84, NA, 23, NA, NA, N…
$ race <fct> white, white, white, white, white, other, white, other, a…
$ age <int> 68, 88, 12, 17, 77, 35, 11, 7, 6, 27, 8, 69, 69, 17, 10, …
$ gender <fct> female, male, female, male, female, female, male, male, m…
$ citizen <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ye…
$ time_to_work <int> NA, NA, NA, NA, NA, 15, NA, NA, NA, 40, NA, 5, NA, NA, NA…
$ lang <fct> english, english, english, other, other, other, english, …
$ married <fct> no, no, no, no, no, yes, no, no, no, yes, no, no, yes, no…
$ edu <fct> college, hs or lower, hs or lower, hs or lower, hs or low…
$ disability <fct> no, yes, no, no, yes, yes, no, yes, no, no, no, no, yes, …
$ birth_qrtr <fct> jul thru sep, jan thru mar, oct thru dec, oct thru dec, j…
Right-skewed data
Left-skewed data
time_to_work: Travel time to work, in minutes.
Categorical data: look at the available categories and how many observations are in each category
unique(dataframe$variable)
table(dataframe$variable, useNA = "always")
table(dataframe$variable1, dataframe$variable2, useNA = "always")
summary(dataframe$variable)
sd(dataframe$variable, na.rm = TRUE)
You may not be interested in all observations
Example:
Rows: 2,000
Columns: 13
$ income <int> 60000, 0, NA, 0, 0, 1700, NA, NA, NA, 45000, NA, 8600, 0,…
$ employment <fct> not in labor force, not in labor force, NA, not in labor …
$ hrs_work <int> 40, NA, NA, NA, NA, 40, NA, NA, NA, 84, NA, 23, NA, NA, N…
$ race <fct> white, white, white, white, white, other, white, other, a…
$ age <int> 68, 88, 12, 17, 77, 35, 11, 7, 6, 27, 8, 69, 69, 17, 10, …
$ gender <fct> female, male, female, male, female, female, male, male, m…
$ citizen <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ye…
$ time_to_work <int> NA, NA, NA, NA, NA, 15, NA, NA, NA, 40, NA, 5, NA, NA, NA…
$ lang <fct> english, english, english, other, other, other, english, …
$ married <fct> no, no, no, no, no, yes, no, no, no, yes, no, no, yes, no…
$ edu <fct> college, hs or lower, hs or lower, hs or lower, hs or low…
$ disability <fct> no, yes, no, no, yes, yes, no, yes, no, no, no, no, yes, …
$ birth_qrtr <fct> jul thru sep, jan thru mar, oct thru dec, oct thru dec, j…
- Need to remove people who are unemployed/not in labor force (and don't have a job or a commute!)
filter()
filter(dataframe, condition)
filter(acs12, employment == "employed")
employment == "employed"
==
: equal to (note there are two equals signs!)!=
: not equal to>
, >=
: greater than, greater than or equal to<
, <=
: less than, less than or equal toValues:
filter(acs12, citizen == "no")
filter(acs12, income <= 12000)
filter(acs12, birth_qrtr != "jan thru mar")
filter(acs12, hrs_work > 20)
Two or more requirements
&
: and
|
: or
filter(acs12, citizen == "no" & lang == "english")
filter(acs12, race == "black" | race == "asian")
Missing values
is.na()
!is.na()
filter(acs12, !is.na(income))
Clone and open the project repo now (ex-5-31-yourusername)
Then open the .qmd file and try out some filtering