Title: | A Toolbox for Using the CPS’s Voting and Registration Supplement |
---|---|
Description: | Provides automated methods for downloading, recoding, and merging selected years of the Current Population Survey's Voting and Registration Supplement, a large N national survey about registration, voting, and non-voting in United States federal elections. Provides documentation for appropriate use of sample weights to generate statistical estimates, drawing from Hur & Achen (2013) <doi:10.1093/poq/nft042> and McDonald (2018) <http://www.electproject.org/home/voter-turnout/voter-turnout-data>. |
Authors: | Jay Lee [aut, cre], Paul Gronke [aut], Canyon Foot [ctb] |
Maintainer: | Jay Lee <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2025-02-20 05:00:44 UTC |
Source: | https://github.com/reed-evic/cpsvote |
This is a 10,000 row sample of the data that comes out of
cps_read(years = 2016)
.
cps_2016_10k
cps_2016_10k
A tibble with 10,000 rows and 17 columns:
Which default file the case came from
Year of interview
State postal abbreviation
Person's age as of the end of survey week; topcoded at 80 and 85
Binary sex
Highest level of school completed or degree received
Race
Hispanic status
Original CPS survey weight
Whether respondent voted in the election; self-reported
Whether respondent was registered to vote in the election; self-reported
Reason for not being registered to vote
Reason for not voting
Whether respondent voted by mail
Whether respondent voted on election day or before
Method of registration
Duration of time living at current address
This is a 10,000 row sample of the data that comes out of
cpsvote::cps_load_basic
.
cps_allyears_10k
cps_allyears_10k
A tibble with 10,000 rows and 25 columns:
Which default file the case came from
Year of interview
State postal abbreviation
Person's age as of the end of survey week; topcoded at 90 until 2002, 80 in 2004, and 80/85 after
Binary sex
Highest level of school completed or degree received
Race
Hispanic status
Original CPS survey weight
Whether respondent voted in the election; self-reported
Whether respondent was registered to vote in the election; self-reported
What time of day respondent voted
Duration of time living at current address
Reason for not voting
Method of voting, pre-2004
Whether respondent had registered to vote since 1995
Whether respondent registered at the DMV
Method of registration
Reason for not being registered to vote
Whether respondent voted by mail, 2004 on
Whether respondent voted on election day or before, 2004 on
A consolidation of VRS_VOTEMETHOD_1996to2002, VRS_VOTEMODE_2004toPRESENT, and VRS_VOTEWHEN_2004toPRESENT
Recode of VRS_VOTE for CPS turnout calculation
Recode of VRS_VOTE for adjusted Hur & Achen turnout calculation
Adjusted weight for calculating voter turnout (per Hur & Achen)
Because the CPS is a fixed-width file that changes data locations (and variable names) across years, to correctly read the data you have to specify which start/end positions correspond to which column names in each year. This is one such specification. To add extra data or change column names, see the Vignette.
cps_cols
cps_cols
A data frame with 204 rows and 8 columns:
year
original column name as given by the CPS
a new name, which tries to describe the variable and join sensibly across multiple years
which character of a line the variable starts with
which character of a line the variable ends with
whether the column is character, numeric, or a factor
the question text/description from the CPS
any notes for question administration or analysis
Download CPS microdata
cps_download_data( path = "cps_data", years = seq(1994, 2018, 2), overwrite = FALSE )
cps_download_data( path = "cps_data", years = seq(1994, 2018, 2), overwrite = FALSE )
path |
A file path (relative or absolute) where the downloads should go. |
years |
Which years of data to download. Defaults to all even-numbered years from 1994 to 2018. |
overwrite |
Logical, whether to write over existing files or not. Defaults to FALSE. |
File names will be written in the style "cps_nov2018.zip", with the appropriate years.
The Voting and Registration Supplement is only conducted in even-numbered
years (since 1964), so any entry in years
outside of this will be skipped.
Currently the package only supports downloads from 1994 onwards, so any
entry in years
before 1994 will be skipped.
## Not run: cps_download_data(path = "cps_docs", years = 2016, overwrite = TRUE) ## End(Not run)
## Not run: cps_download_data(path = "cps_docs", years = 2016, overwrite = TRUE) ## End(Not run)
Download CPS technical documentation
cps_download_docs( path = "cps_docs", years = seq(1994, 2018, 2), overwrite = FALSE )
cps_download_docs( path = "cps_docs", years = seq(1994, 2018, 2), overwrite = FALSE )
path |
A file path (relative or absolute) where the downloads should go. |
years |
Which years of documentation to download. Defaults to all even-numbered years from 1994 to 2018. |
overwrite |
Logical, whether to write over existing files or not. Defaults to FALSE. |
File names will be written in the style "cps_nov2018.pdf", with the appropriate years.
The Voting and Registration Supplement is only conducted in even-numbered
years (since 1964), so any entry in years
outside of this will be skipped.
Currently the package only supports downloads from 1994 onwards, so any
entry in years
before 1994 will be skipped.
## Not run: cps_download_docs(path = "cps_docs", years = 2016, overwrite = TRUE) ## End(Not run)
## Not run: cps_download_docs(path = "cps_docs", years = 2016, overwrite = TRUE) ## End(Not run)
Because the CPS changes factor levels across years, to correctly read the data you have to specify which numeric codes correspond to which character values in each year. This is one such specification. To add extra data, see the Vignette.
cps_factors
cps_factors
A data frame with 204 rows and 8 columns:
year
original column name as given by the CPS
a new name, which tries to describe the variable and join sensibly across multiple years
the numeric code contained in the raw CPS data
the character value corresponding to each numeric code
These match the exact specifications from the CPS, including NA codes and any typos that occur (e.g., "Hipsanic" is common in older years).
The CPS publishes their data in a numeric format, with a separate
PDF codebook (not machine readable) describing factor values. This function
labels the raw numeric CPS data according to a supplied factor key. Codes
that appear in a given year and are not included in factors
will be
recoded as NA
.
cps_label( data, factors = cpsvote::cps_factors, names_col = "new_name", na_vals = c("-1", "BLANK", "NOT IN UNIVERSE"), expand_year = TRUE, rescale_weight = TRUE, toupper = TRUE )
cps_label( data, factors = cpsvote::cps_factors, names_col = "new_name", na_vals = c("-1", "BLANK", "NOT IN UNIVERSE"), expand_year = TRUE, rescale_weight = TRUE, toupper = TRUE )
data |
The raw CPS data that factors should be applied to |
factors |
A data frame containing the label codes to be applied |
names_col |
Which column of |
na_vals |
Which character values should be considered "missing" across the dataset and be set to NA after labelling |
expand_year |
Whether to change the two-digit year listed in earlier surveys (94, 96) into a four-digit year (1994, 1996) |
rescale_weight |
Whether to rescale the weight, dividing by 10,000. The CPS describes the given weight as having "four implied decimals", so this rescaling adjusts the weight to produce sensible population totals. |
toupper |
Whether to convert all factor levels to uppercase |
CPS data with factor labels in place of the raw numeric data
cps_label(cps_2016_10k)
cps_label(cps_2016_10k)
This function is a quick starter to working with the CPS, using all of the
defaults that are baked into this package. Because the data is so large, it
made more sense to ship a "basic" CPS data set as a function rather than as a
package data object (which would have been over 10 MB). This function will
take you from nothing to having some basic CPS data in your environment, with
the option to save this data locally for future ease. A sample of the data
that comes out of this function is provided as cpsvote::cps_allyears_10k
.
cps_load_basic(years = seq(1994, 2018, 2), datadir = "cps_data", outdir = NULL)
cps_load_basic(years = seq(1994, 2018, 2), datadir = "cps_data", outdir = NULL)
years |
Which years should be read |
datadir |
The location where the CPS zip files live (or should be downloaded to) |
outdir |
The location where the final data file should be saved to |
## Not run: cps_load-basic(years = 2016, outdir = "data")
## Not run: cps_load-basic(years = 2016, outdir = "data")
Load multiple years of data from the Current Population Survey.
This function will also download the data for you, if it is not present in
the given dir
.
cps_read( years = seq(1994, 2018, 2), dir = "cps_data", cols = cpsvote::cps_cols, names_col = "new_name", join_dfs = TRUE )
cps_read( years = seq(1994, 2018, 2), dir = "cps_data", cols = cpsvote::cps_cols, names_col = "new_name", join_dfs = TRUE )
years |
Which years to read in. Thie function will read data from files
in |
dir |
The folder where the CPS data files live. These files should follow a naming scheme that contains the 4-digit year of the results in question, and have a ".zip" or ".gz" extension. |
cols |
Which columns to read. This must be a data frame, with required
columns |
names_col |
The column in |
join_dfs |
Whether to combine all of the years into a single data frame,
or leave them as a list of data frames. Defaults to |
a data frame, or list of data frames
## Not run: cps_read(years = 2016, names_col = "new_name")
## Not run: cps_read(years = 2016, names_col = "new_name")
Read one year of data from the Current Population Survey
cps_read_year( file, cols = cpsvote::cps_cols, names_col = "new_name", year = as.numeric(stringr::str_extract(file, "\\d{4}")) )
cps_read_year( file, cols = cpsvote::cps_cols, names_col = "new_name", year = as.numeric(stringr::str_extract(file, "\\d{4}")) )
file |
Where the fixed-width or zip/gz file for this year's data lives |
cols |
Which columns to read. This must be a data frame, with required
columns |
names_col |
The column in |
year |
Which year is being read; defaults to 4-digit year in file name |
a data frame, with dimensions depending on the year and columns specified
When the CPS calculates voter turnout, they consider the values "Don't know",
"Refused", and "No response" to be non-voters, that is they lump these in
with "No". With increased levels of survey non-response in recent years, this
has caused turnout estimates to artificially deflate when compared to
measures of voter turnout from state election offices. This function adds two
recodes of the original voting variable, one which applies the CPS recoding
where multiple categories map to "No", and one which follows the guidelines
from Hur & Achen (2013) of setting these categories to NA
. See the Vignette
for more information on this process.
cps_recode_vote( data, vote_col = "VRS_VOTE", items = c("DON'T KNOW", "REFUSED", "NO RESPONSE") )
cps_recode_vote( data, vote_col = "VRS_VOTE", items = c("DON'T KNOW", "REFUSED", "NO RESPONSE") )
data |
the input data set |
vote_col |
which column contains the voting variable |
items |
which items should be "No" in the CPS coding and |
data
with two columns attached, cps_turnout
and hurachen_turnout
,
voting variables recoded according to the process above
cps_recode_vote(cps_refactor(cps_label(cps_2016_10k)))
cps_recode_vote(cps_refactor(cps_label(cps_2016_10k)))
The response sets in certain CPS questions change between years. This function
consolidates several of these response sets across years (and fixes typos
from the CPS documentation), specifically race, Hispanic status, duration of
residency, reason for not voting, and method of registration. Additionally,
this creates a new column VRS_VOTEMETHOD_CON
which consolidates multiple
expressions of vote method across years (By Mail, Early, and Election Day)
into one variable.
cps_refactor(data, move_levels = TRUE)
cps_refactor(data, move_levels = TRUE)
data |
A dataset containing already-labelled CPS data |
move_levels |
Whether to move the levels "OTHER", "DON'T KNOW", and "REFUSED" to the end of each factor's level set |
While consolidating response sets across multiple surveys can be
fraught with peril, this function attempts to combine disparate levels for
race and other CPS variable across multiple years. Some of these are
relatively straightforward typos fixes ("NON-HIPSANIC" should clearly match
"NON-HISPANIC"), but others have differing degrees of subjectivity applied.
Take this function with a grain of salt, as it depends on some exact variable
names you may or may not be using, and recode variables as needed for your
own uses. To explore exactly how these variables were recoded, you can run
table(data$RACE, cps_refactor(data)$RACE)
in the console, substituting
your column of interest in for RACE
.
cps_refactor(cps_label(cps_2016_10k))
cps_refactor(cps_label(cps_2016_10k))
While the U.S. Census Bureau provides one weight with the CPS, a modified
weight is needed to properly calculate voter turnout. This data set provides
those calculations, according to Hur and Achen (2013). The comparison data
comes from Dr. Michael McDonald's estimates of voter turnout among the
voting-eligible population (VEP). It can be joined with CPS data to
calculate the new weights needed for analysis, using the function
cps_reweight_turnout
.
cps_reweight
cps_reweight
A tibble with 1,326 rows and 6 columns:
year
state
indicator of turnout in recent election
proportion of turnout indicator, calculated by McDonald
proportion of turnout indicator, calculated by CPS
the factor by which to scale original CPS weights
Turnout data from http://www.electproject.org/home/voter-turnout/voter-turnout-data
This function applies the turnout correction recommended by Hur & Achen
(2013). The data set containing the scaling factor is cpsvote::cps_reweight
.
cps_reweight_turnout(data)
cps_reweight_turnout(data)
data |
the input data set, containing columns |
cps_reweight_turnout(cps_recode_vote(cps_refactor(cps_label(cps_2016_10k))))
cps_reweight_turnout(cps_recode_vote(cps_refactor(cps_label(cps_2016_10k))))
na_if
vectorized na_if
na_ifin(x, y)
na_ifin(x, y)
x |
the vector to be checked |
y |
the values which should be replaced with NA |