Project component 1: Research question proposal

Author

Aidan Combs

Project proposal

Please see this page for instructions on how to write your project proposal. Put your work in this document. Render the document and commit your changes regularly (see instructions here). When you are done, push your changes to GitHub.

Please write both research questions and their accompanying supporting information and push your work to GitHub before class on Thursday, May 25.

Then, by Tuesday, May 30th, incorporate your peers’ feedback into your initial work and add your thoughts on which question you will move forward with (details on the version to submit for grading can be found here).

Setup

In the YAML at the top of the page, replace my name under author with yours.

Install the tidyverse and openintro packages by running the code chunk below (click the green play button). This only needs to be done once.

install.packages("tidyverse")
install.packages("openintro")

Then load them with the library function by running the code chunk below. This needs to be done once per session (ie, once each time you open this document).

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openintro)
Loading required package: airports
Loading required package: cherryblossom
Loading required package: usdata

Research question 1

Research question: How was income change as a result of the COVID-19 pandemic associated with the industry in which a person was employed?

Target population: Non-retired adults who were employed before the pandemic hit.

Hypothesis: I hypothesize that people employed in service-related industries will see greater income drops than people in other industries. I expect this because jobs in service-related industries are more likely to depend on in-person interactions with customers that were particularly affected by the pandemic.

Data set: I plan to use the reddit_finance data set from the openintro package to answer this question. These data are from a survey that was distributed to users of the subreddit r/financialindependence in 2020. It is an observational study. 1998 Reddit users were sampled.

Variables: To answer this question, I would use the variables “pan_inc_chg”, which measures whether someone’s income increased, decreased, or stayed the same as a result of the pandemic, and “industry”, which measures the industry that the respondent was employed in.

Strengths and weaknesses: The strengths of this data set are that it contains detailed measures of both pandemic-related income change and employment sector. It has two main weaknesses. First, it uses a convenience sample of Reddit users active in the r/financialindependence subreddit. This group is likely much more interested in and perhaps more knowledgeable about finances than people who do not choose to spend time in this subreddit, and so their financial trajectories post-pandemic were likely different than for others. This impairs the generalizability of results. Second, the “pan_inc_chg” variable is a self-reported estimate, not an exact value. People may have misremembered or misjudged their income change. This impairs the accuracy of the analysis.

glimpse(reddit_finance)
Rows: 1,998
Columns: 65
$ num_incomes             <chr> "1", "2", "1", "1", "1", "1", "1", "2", "2", "…
$ pan_inc_chg             <chr> "Stayed the same", "Stayed the same", "Increas…
$ pan_inc_chg_pct         <chr> "No change", "No change", "1-10%", "41-50%", "…
$ pan_exp_chg             <chr> "Decreased", "Decreased", "Decreased", "Decrea…
$ pan_exp_chg_pct         <chr> "11-20%", "1-10%", "1-10%", "11-20%", "11-20%"…
$ pan_fi_chg              <chr> "No change", "No change", "No change", "No cha…
$ pan_ret_date_chg        <chr> "No change", "No change", "No change", "Become…
$ pan_financial_impact    <chr> "Neutral", "Positive", "Neutral", "Positive", …
$ political               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ race_eth                <chr> "White / Caucasian", "White / Caucasian", "Whi…
$ gender                  <chr> "Male", "Male", "Male", "Male", "Male", "Male"…
$ age                     <chr> "24-28", "29-33", "29-33", "24-28", "29-33", "…
$ edu                     <chr> "Bachelor's Degree", "Bachelor's Degree", "Bac…
$ rel_status              <chr> "Single, never married", "Married", "In a rela…
$ children                <chr> "Do not have children, but intend to", "Do not…
$ country                 <chr> "Australia", "Australia", "Australia", "Austra…
$ fin_indy                <chr> "No", "No", "No", "No", "No", "No", "No", "No"…
$ fin_indy_num            <dbl> 3500000, 1000000, 1000000, 750000, 2800000, 10…
$ fin_indy_pct            <dbl> 12.00, 26.00, 8.00, 27.00, 78.00, 7.65, 2.00, …
$ retire_invst_num        <dbl> 2500000, 1250000, 1750000, 2000000, 4800000, N…
$ tgt_sf_wthdrw_rt        <dbl> 4.00, 4.00, 4.00, 4.00, 2.50, 3.50, 3.50, 3.75…
$ max_retire_sup          <dbl> 20000, 5000, NA, 20000, 89000, NA, 40000, 3500…
$ retire_exp              <dbl> 70000, 50000, 50000, 80000, 80000, 40000, 1500…
$ whn_fin_indy_num        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ fin_indy_lvl            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ retire_age              <chr> "49-53", "34-38", "44-48", "39-43", "44-48", "…
$ stp_whn_fin_indy        <chr> "Partially", "Partially", "No", "No", "Undecid…
$ industry                <chr> "Financial Services", "Utilities", "Public Adm…
$ employer                <chr> "Public corporation", "Public corporation", "P…
$ role                    <chr> "Management - Middle / Lower", "Professional (…
$ ft_status               <chr> "For an Organization (Employed)", "For an Orga…
$ pt_status               <chr> NA, NA, NA, NA, NA, NA, NA, "N/A", NA, NA, NA,…
$ gig_status              <chr> NA, NA, NA, NA, NA, NA, NA, "For Myself (Self-…
$ ne_status               <chr> NA, NA, NA, NA, NA, NA, NA, "N/A", NA, NA, NA,…
$ edu_status              <chr> "Not a student", "Not a student", "Not a stude…
$ housing                 <chr> "Rent", "Rent", "Rent", "Own", "Live with fami…
$ home_value              <dbl> 0, 0, NA, 565000, NA, 0, NA, 0, 320000, 0, 100…
$ brokerage_accts_tax     <dbl> 589000, 160000, 10000, 205000, 870000, 29000, …
$ retirement_accts_tax    <dbl> 74000.00, 90000.00, 39000.00, 41000.00, 130000…
$ cash                    <dbl> 10000.00, 10000.00, 28000.00, 25000.00, 260000…
$ invst_accts             <dbl> NA, 0.00, NA, 0.00, NA, 21400.00, 0.00, 0.00, …
$ spec_crypto             <dbl> 180, 0, NA, 30000, NA, 421, 1000, 0, 0, 0, 100…
$ invst_prop_bus_own      <dbl> NA, 0, NA, 0, 2880000, 0, 0, 400000, 0, 100000…
$ other_val               <dbl> 5000, NA, NA, 0, NA, NA, NA, 0, 30000, NA, 400…
$ student_loans           <dbl> 12000, 10000, 33000, 127000, NA, NA, 40000, 20…
$ mortgage                <dbl> NA, NA, NA, 505000, NA, NA, NA, NA, 33000, NA,…
$ auto_loan               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ credit_personal_loan    <dbl> 3500, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ medical_debt            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ invst_prop_bus_own_debt <dbl> NA, NA, NA, NA, 2130000, NA, NA, 206000, NA, N…
$ other_debt              <dbl> 220000, NA, NA, NA, NA, NA, NA, NA, NA, NA, 18…
$ `2020_gross_inc`        <dbl> 95000, 160000, 120000, 121000, 260000, 89000, …
$ `2020_housing_exp`      <dbl> 14000, 20000, 17500, 10400, NA, 11000, 27000, …
$ `2020_utilities_exp`    <dbl> 1300, 1500, 700, 480, NA, 650, 2000, 1400, 493…
$ `2020_transp_exp`       <dbl> 2300, 2000, 1154, 6600, NA, 700, 2000, 3700, 4…
$ `2020_necessities_exp`  <dbl> 18200, 6000, 16000, 5400, 10000, 8300, 4000, 1…
$ `2020_lux_exp`          <dbl> 2400, 2000, 5000, 2400, 5000, 1800, 20000, 200…
$ `2020_child_exp`        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 3471, NA, 2000…
$ `2020_debt_repay`       <dbl> 19200, NA, 8767, NA, NA, NA, 5000, 13000, NA, …
$ `2020_invst_save`       <dbl> 15000, 70000, 10000, 50000, 150000, NA, 40000,…
$ `2020_charity`          <dbl> 40, NA, NA, NA, NA, NA, NA, NA, 360, 500, 500,…
$ `2020_healthcare_exp`   <dbl> 70, NA, 450, NA, NA, 1400, 2000, 630, 1720, 36…
$ `2020_taxes`            <dbl> 23000, NA, 36500, 30000, 95000, 20000, 39000, …
$ `2020_edu_exp`          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1000, 4400…
$ `2020_other_exp`        <dbl> NA, NA, 2500, NA, NA, NA, NA, NA, NA, NA, 1500…

Research question 2

Research question: How does the number of coaches assigned to US college sports teams vary by the gender of the team?

Target population: College sports teams in the US

Hypothesis: I predict that for sports in which there is both a men’s and a women’s team, coaching levels will be equivalent. I expect this because universities are required to adhere to Title IX, which would prohibit differential levels of resources for equivalent sports.

Data: These data were reported by Duke and UNC annually from 2003-2021. It is an observational study. All coaches from both institutions are represented in the data set. It contains 2311 observations (team-years).

Variables: I will use the teamgender, sport, and ncoaches variables to answer this question.

Strengths and weaknesses: A strength of this data set is that it is administrative data, and so represents all coaches that were employed by these institutions. A weakness is that only two institutions are represented, and both are highly selective institutions with strong athletics programs. This limits the generalizability of the results to institutions that are less selective and that have less resourced athletics programs.

sports <- readRDS("sports_teams.rds") # This data set is not part of openintro, so this line is needed to load it from a file instead.

glimpse(sports)
Rows: 879
Columns: 8
$ school             <chr> "Duke", "Duke", "Duke", "Duke", "Duke", "Duke", "Du…
$ year               <dbl> 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 200…
$ division           <chr> "NCAA Division I-A", "NCAA Division I-A", "NCAA Div…
$ teamgender         <chr> "men", "men", "men", "men", "men", "men", "men", "m…
$ sport              <chr> "Baseball", "Basketball", "Fencing", "Football", "G…
$ ncoaches           <dbl> 3, 4, 1, 10, 2, 3, 3, 2, 2, 3, 3, 4, 4, 8, 6, 13, 2…
$ nplayers           <dbl> 33, 15, 19, 81, 12, 45, 23, 32, 11, 50, 35, 35, 30,…
$ player_coach_ratio <dbl> 11.000000, 3.750000, 19.000000, 8.100000, 6.000000,…

Post-peer review reflections

I think I should pursue research question 2, on the level of staffing of college sports teams by team gender. Although I believe both research questions are sufficiently specific, feasible, and sociological, I think the data set I would use to answer question 2 is a better fit for the question than the data set I would use to answer question 1. The fact that my data set for question 1 is a convenience sample of users of one particular subreddit dramatically reduces the ability of any analysis to generalize to a broad population. The data set that I would use for question 2 is also limited by its sample—it contains coaches at just two universities. However, it contains every coach that worked at each of those institutions, and those institutions are representative of a broader group of universities that would be interesting to generalize to (selective universities with strong athletics programs).