The 2020 general election was one of the most important and historical elections in the United States, having a record turnout with Joe Biden receiving the most number of votes in history of any presidential candidate.
The overall objective of this project will be to analyze data about the voter turnout in each state of the United States during the 2020 general election. Throughout this tutorial, we will attempt to find potential trends between the voter turnout rate and state, the number of eligible voters and state, and how they correlate. Additionally, we will look at trends between the winner of the election in each state and the voter turnout rate in that state.
During this step, we will collect the data from websites/files. We have collected the data from https://data.world/government/vep-turnout, which got their data from each state's election site, and put it into a dataframe by using the pandas.read_csv function (more info at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). Additionally, we collected data about the winner of each state from https://graphics.reuters.com/USA-ELECTION/RESULTS-LIVE-US/jbyprxelqpe/. This will allow us to be able to manipulate the data in the next step to look cleaner.
import pandas as pd
import numpy as np
df = pd.read_csv("2020 November General Election - Turnout Rates.csv")
winners = pd.read_csv("Winner of Election by State.csv")
df.head()
During this step, we will organize or tidy the data so that it is more readable and easier to manipulate and perform an analysis on. First, we merged the two datasets so that the dataset with the turnout rates include the winner of each state as well. In the untidied data, the first row is the title of each column and the title row say "Unnamed." We renamed the title row to the according title name from the first row and deleted the first row. Additionally, we dropped the Source and State Abv column since it is unnecessary.
df = pd.merge(df, winners)
df = df.rename(columns={"Unnamed: 1": "Source", "Unnamed: 2": "DropThis",
"Unnamed: 3": "Total Ballots Counted (Estimate)", "Unnamed: 4": "Vote for Highest Office (President)",
"Unnamed: 5": "VEP Turnout Rate (Total Ballots Counted)", "Unnamed: 6": "VEP Turnout Rate (Highest Office)",
"Denominators": "Voting-Eligible Population (VEP)", "Unnamed: 8": "Voting-Age Population (VAP)",
"VEP Components (Modifications to VAP to Calculate VEP)": "% Non-citizen)", "Unnamed: 10": "Prison",
"Unnamed: 11": "Probation", "Unnamed: 12": "Parole", "Unnamed: 13": "Total Ineligible Felon",
"Unnamed: 14": "Overseas Eligible", "Unnamed: 15": "State Abv"})
df = df.drop(axis=1, labels=['Source', 'DropThis', 'State Abv'])
df = df.drop(axis=0, index=0)
df = df.replace(',','', regex=True)
df = df.drop(axis=0, index=1)
df = df.reset_index()
del df['index']
df = df.drop(axis=1, labels=['Overseas Eligible'])
Since all the numbers in the data table are currently Strings, we have to convert them to integers in order to perform an analysis with them later on.
df['Total Ballots Counted (Estimate)'] = df['Total Ballots Counted (Estimate)'].astype(int)
df['Vote for Highest Office (President)'] = df['Vote for Highest Office (President)'].astype(int)
df['Voting-Eligible Population (VEP)'] = df['Voting-Eligible Population (VEP)'].astype(int)
df['Voting-Age Population (VAP)'] = df['Voting-Age Population (VAP)'].astype(int)
df['Prison'] = df['Prison'].astype(int)
df['Probation'] = df['Probation'].astype(int)
df['Parole'] = df['Parole'].astype(int)
df['Total Ineligible Felon'] = df['Total Ineligible Felon'].astype(int)
df['VEP Turnout Rate (Total Ballots Counted)'] = df['VEP Turnout Rate (Total Ballots Counted)'].str.rstrip('%').astype('float')
df['VEP Turnout Rate (Highest Office)'] = df['VEP Turnout Rate (Highest Office)'].str.rstrip('%').astype('float')
df['% Non-citizen)'] = df['% Non-citizen)'].str.rstrip('%').astype('float')
df
Now that our data is all cleaned up and easy to use, we can begin analyzing it! Now we are going to calculate statistics for the total ballots counted such as the mean, median, minimum, maximum, and standard deviation (more info on numpy statistics at https://www.tutorialspoint.com/numpy/numpy_statistical_functions.htm). These are the basic statistics for any data set. It will help us see the central tendency and get a better understanding of our data.
mean = np.mean(df['Total Ballots Counted (Estimate)'])
median = np.median(df['Total Ballots Counted (Estimate)'])
mini = np.min(df['Total Ballots Counted (Estimate)'])
maxi = np.max(df['Total Ballots Counted (Estimate)'])
stddev = np.std(df['Total Ballots Counted (Estimate)'])
print('SUMMARY STATS FOR TOTAL BALLOTS COUNTED')
print('Mean: ', mean, '\nStd Dev: ', stddev, '\nMedian: ', median, '\nMin: ', mini, '\nMax: ', maxi)
mean = np.mean(df['Voting-Age Population (VAP)'])
median = np.median(df['Voting-Age Population (VAP)'])
stddev = np.std(df['Voting-Age Population (VAP)'])
print('SUMMARY STATS FOR VOTING AGE POPULATION')
print('Mean: ', mean, '\nStd Dev: ', stddev, '\nMedian: ', median)
As shown, the mean for total ballots counted is 3,130,066 and the mean for voting-age population is 5,051,080. This means that on average, 3,130,066/5,051,080 or 62% of people who are eligible to vote actually vote. Now let’s find out what states have the highest turnout rate. We can start by making a bar chart for the voter turnout rate in every state (more info on creating bar graphs at https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.bar.html).
import matplotlib.pyplot as plt
import matplotlib
fig= plt.figure(figsize=(127,50))
plt.ylim([0, 100])
plt.rcParams.update({'font.size': 22})
ax = fig.add_axes([0,0,1,1])
states = df['State']
turnout = df['VEP Turnout Rate (Total Ballots Counted)']
matplotlib.rc('ytick', labelsize=100)
plt.setp(ax.get_xticklabels(), fontsize=100, rotation='vertical')
plt.bar(states, turnout, align='center', alpha=0.5)
plt.title("Voter Turnout Rate Per State", fontsize = 200)
plt.xlabel("State", fontsize = 150)
plt.ylabel("Voter Turnout Rate (%)", fontsize = 150)
plt.show()
Based on the graph, Minnesota seems to have the highest voter turnout percentage and Oklahoma has the lowest. Minnesota has a long history of being known to have clean elections that make it easy for people to vote. They are one of few states to allow same-day voter registration and can submit absentee ballots starting 40 days before the election. (https://www.minnpost.com/politics-policy/2016/09/five-reasons-why-voter-turnout-minnesota-so-high/) Oklahoma’s low turnout rate also makes sense as its felon population is quite high and Oklahoma’s state law is that felons are not eligible to vote. (https://tulsaworld.com/news/local/govt-and-politics/the-recipe-for-oklahomas-low-voter-turnout-rate/collection_c40200d0-d6ca-11e8-8027-cf0e5c78159d.html#6)
Now let’s find out what states have the highest ballot count. We will make a bar chart for the total ballots counted in every state.
fig= plt.figure(figsize=(127,50))
plt.rcParams.update({'font.size': 22})
ax = fig.add_axes([0,0,1,1])
states = df['State']
turnout = df['Total Ballots Counted (Estimate)']
matplotlib.rc('ytick', labelsize=100)
plt.bar(states, turnout, align='center', alpha=0.5)
plt.setp(ax.get_xticklabels(), fontsize=100, rotation='vertical')
plt.title("Total Ballots Counted Per State", fontsize = 200)
plt.xlabel("State", fontsize = 150)
plt.ylabel("Total Ballots Counted (Tens of Millions)", fontsize = 150)
plt.show()
The graph shows us that California is the state with the highest ballot count, and Wyoming is the state with the lowest ballot count. This makes sense as California has the largest population, and Wyoming has the lowest population in the United States.
Next, we will graph the voter turnout rates over all Biden-winning states (blue states) vs. Trump-winning states (red states). First we will calculate the total number of ballots cast in red states and the total number of ballots cast in blue states. Then we will add up the total number of eligible voters in red states and the total number of eligible voters in blue states, and divide the number of ballots in red states by the number of eligible voters in red states and the same for blue states. This gives us the voter turnout in red states vs. blue states.
trump_ballots = 0;
red_eligible_voters = 0;
biden_ballots = 0;
blue_eligible_voters = 0;
for i, row in df.iterrows():
if row['Winner of State'] == 'Trump':
trump_ballots += row['Total Ballots Counted (Estimate)'];
red_eligible_voters += row['Voting-Eligible Population (VEP)'];
else:
biden_ballots = row['Total Ballots Counted (Estimate)'];
blue_eligible_voters = row['Voting-Eligible Population (VEP)'];
red_voter_turnout = trump_ballots/red_eligible_voters;
blue_voter_turnout = biden_ballots/blue_eligible_voters;
print("Trump-Winning States Voter Turnout: ")
print(red_voter_turnout)
print("Biden-Winning States Voter Turnout: ")
print(blue_voter_turnout)
candidates = ['Trump-Winning States', 'Biden-Winning States']
turnout = [red_voter_turnout*100, blue_voter_turnout*100]
matplotlib.rc('xtick', labelsize=10)
matplotlib.rc('ytick', labelsize=10)
barlist = plt.bar(candidates, turnout, align='center', alpha=0.5, color = 'blue')
plt.title("Voter Turnout Rates for States Based on their Election Winners", fontsize = 15)
plt.xlabel("Type of State", fontsize = 15)
plt.ylabel("Voter Turnout Rate (%)", fontsize = 15)
barlist[0].set_color('r')
plt.show()
Next let's create a boxplot showing the overall distribution of voter turnout among Trump winning and Biden winning states. We will create this boxplot using seaborn, a Python data visualization library (more info at https://seaborn.pydata.org/generated/seaborn.boxplot.html).
import seaborn as sns
matplotlib.rc('xtick', labelsize=10)
colors = ['red', 'blue']
ax = sns.boxplot(x=df['Winner of State'], y=df['VEP Turnout Rate (Total Ballots Counted)'], data=df, palette = colors)
plt.title("Voter Turnout Rates for States Based on their Election Winners", fontsize = 15)
plt.xlabel("Winner of State", fontsize = 15)
plt.ylabel("Voter Turnout Rate (%)", fontsize = 15)
The bar graph shows as that there is approximately a 65% voter turnout rate in states that Trump won, and approximately a 75% voter turnout rate in states that Biden won. The box plot shows us that in Trump winning states, the voter turnout rate is concentrated between 62 and 68%, whereas in Biden winning states, the voter turnout is concentrated between 66 and 75%. This corroborates with the result of the election with Biden winning — more people voted in blue states. Additionally, more mail-in voting increases the voter turnout. Biden supporters were more likely to vote by mail than Trump supporters, and universal mail-in voting has a positive increase on turnout. Additionally, many blue states offer easier ways of voting, for example Colorado, Oregon, Washington, and New Jersey all send registered voters their ballots more than two weeks in advance. Most red states do not have these options. It was easier for people in blue states to vote since there were more options offered of how to vote, and more Biden voters were likely to send in mail-in ballots or absentee ballots, increasing the voter turnout in these states (more info at https://www.capradio.org/articles/2020/05/18/does-voting-by-mail-lead-to-higher-turnout-in-red-blue-and-purple-states-its-not-that-simple/).
To perform hypothesis testing, let’s use the statsmodels package, which contains functions for statistical analysis including t-testing, linear regression, ANOVA, and much more. (more info at https://www.statsmodels.org/stable/index.html)
Let's test the hypothesis that a higher percent of non-citizens in a state results in a higher voter turnout rate for that state. We think that there might be a correlation between the two since states with higher amounts of immigrants are more diverse and typically more diverse areas vote for Democrats. Also, based on prior analysis, it seems to be that states with higher voter turnout vote for Democrats so there is a possibility that the two are correlated.
To perform this test, we can set our null and alternative hypotheses. The null hypothesis will be that the coefficient of the linear model is not different from zero. H0:β1=0 Ha:β1≠0
import statsmodels.api as sm
X = df[['% Non-citizen)']].values
X = sm.add_constant(X)
y = df['VEP Turnout Rate (Total Ballots Counted)'].values
model = sm.OLS(y, X)
results = model.fit()
results.summary()
Since the p-value is < 0.05 , we reject our null hypothesis that percent of non-citizens in a state has a relationship with voter turnout rate for that state. There is not enough evidence to prove there is a linear relationship between the two.
Now let's test whether a lower amount of total ineligible felons in a state results in a higher voter turnout rate for that state. We think that there might be a negative correlation between the two as fewer eligible citizens would be able to vote. To perform this test, we again can set our null and alternative hypotheses. The null hypothesis will be that the coefficient of the linear model is not different from zero. H0:β1=0 Ha:β1≠0
import statsmodels.api as sm
X = df[['Total Ineligible Felon']].values
X = sm.add_constant(X)
y = df['VEP Turnout Rate (Total Ballots Counted)'].values
model = sm.OLS(y, X)
results = model.fit()
results.summary()
This one also has a p-value < 0.05 and therefore, we reject our null hypothesis that the number of ineligible felons in a state has a relationship with voter turnout rate for that state. There is not enough evidence to prove there is a linear relationship between the two.
Now let’s try performing regression using multiple features with a random forest regressor.
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
import seaborn as sns
X = df[['VEP Turnout Rate (Total Ballots Counted)']].values
y = df['Voting-Eligible Population (VEP)'].values
kf = KFold(n_splits=5, random_state=0)
scores = []
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
regr = RandomForestRegressor(max_depth=10, random_state=0, n_estimators=100)
regr.fit(X_train, y_train)
scores.append(regr.score(X_test, y_test))
scores
_ = sns.violinplot(scores, orient="v")
_ = plt.title("Random Forest Score Distribution")
_ = plt.ylabel("Score")
In this case, the random forest was not able to accurately predict voter turnout rate based on voting-eligible population from the features we provided. In the future, we could perform parameter tuning or try a different classifier to see if these scores may improve.
Now we've walked through the entire data science pipeline and performed data curation, parsing, management, exploratory data analysis, hypothesis testing, and machine learning. The main trend that we noticed through analysis was that states with higher voter turnout rates voted Biden more than Trump.
Some potential future analyses that could be done are analyzing the voter turnout rate correlating to race, ethnicity, and gender. These are all important factors that affect a state’s voter turnout rate, because there are many key demographics of people that would be more likely to vote or behave a certain way. Another potential future analysis that could be done is analyzing the voter turnout rate among people of different education levels and comparing this to how states voted, depending on what the majority of people’s education levels are in each state. We hope that this tutorial serves as an eye-opener to us all that we as a country need to work to increase our voter turnout rates. As something that is a basic right and our civic duty to to fulfill, we should not take it for granted and work to make sure that our freedoms will be secure.
We hope you enjoyed our tutorial! Thanks for reading!