The Problem With Polling

I have gotten a lot of questions about political polls lately and I have found myself having the same conversation over and over about the reliability of polling in general. Those conversations have centered around the concept of Total Survey Error (TSE), or all the different ways that a survey can go wrong. So, I thought I would take a break from what I should be doing and write about the five basic forms of TSE.

Tallying the results of the presidential election on Nov. 2, 1948.

Coverage Error – This form of error is when your sampling frame (i.e. your list of people to potentially poll) does not accurately represent the population you are measuring. For example, if you want to conduct a poll by phone and your list only includes landlines, then you are leaving out everyone who does not have a landline. Dewey Defeats Truman is an example that fits this kind of TSE. They only surveyed people with telephones that year (1948), who were typically far wealthier and more likely to vote for Dewey, rather than Truman, the eventual winner. This coverage issue led to non-response bias (see below)

Specification Error – This error occurs when what is being measured isn’t clear. Typically, this is reserved for psychological constructs, which are oftentimes multidimensional. A political example of this would be ideology. We know that most people’s political beliefs lie along a spectrum, and those beliefs may be nuanced and context dependent. The Pew Research Center has an excellent example of measuring ideology as a construct. Fortunately, there is an easy way around this for political polls: ask them specifically which candidate(s) they are voting for.

Response Error – This form of bias has to do with who responded to the poll, and relatedly, who didn’t respond to the poll. This can be unit response (i.e. someone refuses to participate) or item response (i.e. someone refuses to answer a specific question). Again using a phone poll example: if you had a list of all numbers (cell phones and landlines) that you use to call on your poll, people with caller ID are less likely to pick up. Well, almost all cell phones have caller ID built in. This means that people with landlines – which are typically older people – are more likely to answer; younger people, less so.

Measurement Error – This form of error is probably the most well studied in the world of survey methodology, because it has so many parts to it. The order of the questions being asked, the tone of the interviewers voice or appearance, the wording of the questions themselves may unintentionally cause someone to answer a certain way. For example, I have seen many projections based solely on party identification, which does not account for people who plan on voting for one party in every race except one (i.e. “ticket splitters“). I imagine there will be a large number of people who cast their votes for all but one member of their preferred party this election. If you want to see an example of how not to predict an outcome, I humbly submit this one as an example of both specification error and easement error.

Processing Error – Processing error is all the ways that things can go wrong with the data AFTER it is collected. Some forms of this occur in encoding, editing, and weighting. The weighting piece is especially tricky, because it adjusts results based on known population parameters. For example, if we know that a poll had 80% of its respondents to be female, we would need to adjust the weights of the males in the survey to account for the fact that population parameter is known to be roughly 50%. Now, imagine that we are also accounting for race, income, education level, and age; you will see that things can get complicated in a hurry. One strategy to account for this is an iterative approach, known as “raking

Supporters of presidential candidate Hillary Clinton watch televised coverage of the U.S. presidential election at Comet Tavern in the Capitol Hill neighborhood of Seattle on Nov. 8. (Photo by Jason Redmond/AFP/Getty Images)

So, what does all this mean? There are lots of ways things can go wrong, and good surveys are incredibly expensive. They take time to construct and a shocking amount of money and manpower to collect. Also, many political polls are collected to drive media viewership, which means they are often more concerned about expediency rather than accuracy. That right there should be enough to give you pause.The 2016 election gave polling – and to a certain extent, statistics – a bad name. However, people don’t realize that the national polls (i.e. popular vote) were right on the money. The popular vote is one model. The electoral college tally is 51 models (all 50 states plus DC), which may take different strategies for collecting and analyzing, depending on the state. Lots of room for mistakes. If we want to predict who will likely win the popular vote, the statistical evidence that Biden will win that is pretty solid. Does that mean it is a certainty? Objectively no. Of course the election is decided off the electoral college, which again is 51 separate models. Some of those states are pretty clear. Others, not so much.

“A margin of error of plus or minus 3 percentage points at the 95 percent confidence level means that if we fielded the same survey 100 times, we would expect the result to be within 3 percentage points of the true population value 95 of those times.”

5 key things to know about the margin of error in election polls

Finally, it appears that we are headed for record levels of turnout due in part to enthusiasm, mail in voting, COVID-19, etc. The unprecedented nature of these factors only makes polling even more fraught for potential error. I would encourage anyone following the polls closely to lower their expectations considerably. That doesn’t mean the polls are wrong, but they should be viewed with a healthy amount of circumspection. With that being said, if you are like me and cannot help yourself, look at Nate Cohn and Nate Silver’s stuff. It is typically the most robust and transparent. Not surprisingly, they are the often times the most accurate predictions.

Tl;dr – Ignore the polls. We won’t really know much of anything until we see actual vote totals being counted. The rest is just theater.

EDA: Open & Closed Data


Each summer, nearly 8,000 incoming students attend New Student Orientation (NSO) at Penn State’s University Park campus. At the conclusion of NSO, each student is sent a survey to gather their perspectives on various aspects of their experiences at their respective sessions. Questions range from their experiences at check-in to their understanding of student services and various initiatives, such as Penn State’s commitment to diversity and inclusion. These data provide an opportunity to asses which aspects of NSO warrant further exploration.

Cleaning & Inspecting the Data

Survey results were provided from the office of New Student Orientation for the sessions occurring in the summers of 2017,2018, & 2019. Each of these databases were inspected to find which variables were consistent across all three spreadsheets. Variables that were not consistent across each survey were discarded combined into one master database, coded by their respective years. The variables that were consistent across all three years were:

  • Leader Connection – The extent to which a meaningful connection was made with their Orientation Leader during NSO.
  • Substances -The extent to which their understanding changed related to the consequences of alcohol and drug use and abuse during NSO.
  • Assault Resources – The extent to which their understanding changed as a result of attending NSO related to reporting and support services Penn State provides for victims of sexual harassment and sexual assault.
  • Bystander Prevention – The extent to which their understanding changed as a result of attending NSO related to how to handle dangerous situations.
  • Health Resources – The extent to which their understanding changed as a result of attending NSO related to support services Penn State provides for mental health, physical health, and personal well-being.
  • Safety Resources – The extent to which their understanding changed as a result of attending NSO related to support services Penn State provides to help keep me safe.
  • Diversity / Inclusion – The extent to which their understanding changed related to the importance of diversity and inclusion on our campus.
  • Definition of Consent – An open-ended survey question asking participants to define the term “consent.”

The final step in cleaning the data was to remove any personable identifiable information and missing values. Basic demographic information, such as race, gender, sexual orientation, resident status, and matriculation date were maintained, but not utilized for this analysis. Any observations that broke off from the survey prior to answering the 8 variables of interest were discarded as well.

Exploratory Data Analysis

Once data were gathered and cleaned, an exploratory data analysis was conducted to examine patterns in the data. Survey questions that utilized a Likert scale were compared against one another, revealing similar, right-skewed distributions on each of the factors, with the exception of leader connection, which was more widely distributed amongst Likert responses (figure 3.1). The leader connection variable, when examined in a bar chart, grouped by year (figure 3.2) showed the widest variety of distributions in comparison the remaining variables visualized in the same way (figure 3.3). Since Likert scale data is ordinal in nature, a Kruskil-Wallace test was conducted on each variable to examine differences by year, followed by a post-hoc analysis using the Dunn-Bonferroni correction to reveal where differences may occur. Each variable showed an upward trend in Likert responses over time, with with statistically significant differences (α < .05) in each variable over time, with the exception of the variable measuring the importance of diversity and inclusion at Penn State.

The open ended survey question asking for a definition of consent revealed interesting results across the three surveys. In 2017 (figure 3.4) & 2018 (figure 3.5), the top two words used to define consent were the words “consent” and “yes,” respectively. However, in 2019 (figure 3.6), the top two words were “given” and “freely.” It is known that the 2019 version of the Results Will Vary interactive play that all NSO students see featured a production related to consent. In this scene, the an acronym F.R.I.E.S is used to represent that consent must be freely given, reversible, informed, enthusiastic, & specific. Each of these words, along with the aforementioned acronym all occur in the top 10 words for the 2019 survey, while none of them were found in the top 10 of the previous two NSO years.

Likert Data Comparison (figure 3.1)

Leader Connection (FIGURE 3.2)

Protocols, Services, & Resources (FIGURE 3.3)

2017 Open Ended Data (FIGURE 3.4)

Frequency chart of top 20 Words (FIGURE 3.5)

2018 Open Ended Data (FIGURE 3.6)

Frequency chart of top 20 Words (FIGURE 3.7)

2019 Open Ended Data (FIGURE 3.8)

Frequency chart of top 20 Words (FIGURE 3.9)

Conclusion / Suggestions for the Future

While it is difficult to make inferences on observational data, we can see some trends towards greater understanding from students who attend NSO at Penn State’s University Park campus. These increases in understanding could be due to any variety of factors, including changes in the population of interest (e.g. incoming students) or a variety of areas within the NSO experiences. The open ended survey data defining consent showed the clearest picture of the differences between year with the 2019 data pointing clearly towards connections made with a scene dedicated to consent in the Results Will Vary interactive play.

To gain a greater understanding of student’s perceptions of new student orientation, a variety of opportunities exist. A clear definition of what kinds of insights you would like to gain from students regarding NSO should inform question formation. For example, an argument could be made that leader connection is a multi-dimensional construct that cannot be measured accurately with one question. Consequently, leader connection was the variable with both the lowest score and the widest distribution of responses among participants; further investigation is warranted. Finally, the use of open-ended survey responses could provide a wealth of feedback on specific initiatives, particularly if they are formed in conjunction with specific experiences during NSO. For example, the F.R.I.E.S scene from the Results Will Vary interactive play demonstrated clear connections towards changes in understanding of consent. Future new student orientations could benefit by exploring these connections in other topics covered within Results Will Vary to measure both the effectiveness of play and the perceptions of students.

Survey Data Analysis (The Hard Way)


In the summer of 2019, Penn State held 38 New Student Orientation (NSO) sessions and 3 International Student Orientation (ISO) sessions, during which all incoming freshman watched an interactive play titled “Results Will Matter.” This play touches on a variety of topics related to the college experience and typical pitfalls for students in their first year. At the conclusion of each night of the play, incoming students filled out a card with questions they had about the play and/or the college experience.

The “Results Will Vary” database consists of 7,558 handwritten responses that incoming freshman provided after given the following prompt: “What lingering questions do you have regarding the show.” The 2019 freshman class at University Park was approximately 8,000 students, providing a coverage rate close to 100%, and a response rate approaching 95%. With this information, we have the opportunity to gain perspectives on the effectiveness of the play and the general concerns of incoming students. After reading, counting, and numbering these cards, some themes began to emerge. Questions of consent, consequences of underage drinking, alcohol, drugs, roommate issues, and various campus services rose to the top.

Sampling Procedure

Due to the size of the dataset and available resources, the decision was made to conduct a sample from the full dataset (N = 7,588). Since we knew the size of the population, the desired sample size was calculated using the finite population correction:

Since this data is exploratory in nature, we have an unknown population proportion and choose to use the most conservative estimate of .5, resulting in a desired sample of 366 cards:

The sample itself was conducted across all observations using a random number generator. The size of this sample provides us with ± 5% margin of error at the .05 level of significance, and all analysis was conducted strictly on the sampled data.

Cleaning & Inspecting the Data

Once the cards were sampled, the text from each one was entered into excel verbatim, and open coded once again. Using RStudio, the text was cleaned by converting to lowercase and removing punctuation. In text analysis, it is standard practice to remove extremely common words such as “if”, “and”, “but”, and “the,” as they have little to no value in determining key terms in the vocabulary. These words are called “stop words.”

Zipf’s law, named after linguist and mathematician George Zipf, states that given a large sample of words, the frequency of any word is inversely proportional to its rank in the frequency table. This creates a long-tailed distribution as the number of words approaches infinity, with the most frequent word occurring approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. By removing “stop words”, we trim the bulk of the data along the Zipfian distribution and gain a greater understanding of what the central themes are in the data. Within this database, we removed standard English stop words in addition to a custom list of stop words to uncover the central themes within the data. Those words were “can”, “people”, “get”, “penn”, “state”, “student”, “thing”, “what”, “what’s”, “just”, “one”, “know”, “like”, “students”, and “campus”.

Once “stop words” were removed, the result was 792 unique words, represented in a word cloud below:

Upon further inspection, we see that the most frequent words are associated with alcohol, safety, health services, consent, and the consequences of underage drinking. These can be seen in the frequency plot below:

Developing Themes

After inspecting term frequency, open codes were revisited, resulting in 42 unique codes, with some observations requiring multiple codes. For example, one survey respondent wrote:

“I want to get involved in LGBT clubs / activities, but I am not out to my parents and they aren’t very accepting. Will they find out I am in those organizations (Through the internet or something)? Like, can they see what clubs I join?”

This example fell under the following four codes:

  1. Student resources
  2. FERPA / HIPPA / Privacy
  3. LGBTQ Community, concerns, resources
  4. Clubs

In addition, there were also some responses that didn’t fit within any themes, which were simply labeled “miscellaneous / unrelated.” An example of this was “What is your favorite color?” and “Who’s got the best gas on campus?”

Questions regarding alcohol, underage drinking, Responsible Action Protocol (RAP), and the consequences of alcohol/underage drinking occupied the largest portion of the data. In addition, questions of campus safety, student resources, residence halls, and consent emerged as central themes in the data. Feedback on the show was generally complimentary, with some questions regarding the intended meaning of scenes. For example, some participants cited the “ASMR scene” and the “Rollercoaster Scene” as areas of confusion. The distribution of the codes can be found in the frequency plot of all themes:

Sentiment & Emotional Analysis

Sentiment analysis refers to the use of natural language processing, text analysis, and computational linguistics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews of products and services as well as open ended survey data. With RStudio, we have access to open sourced packages that use natural language processing and text analysis to examine the sentiment and emotional content of each observation. Using the SentimentR package, text files were analyzed by comparing the data against known words that are associated with positive, negative, or neutral sentiments. Each sentence is extracted and scored from -1 to 1, with any sentences scored above .3 being considered positive, and any sentences below -.3 being considered negative. The sentences falling in the remaining 60% of the data are considered neutral. This data was overwhelmingly in the neutral category, with a mean of .031, a median of 0, and standard deviation of .243. When investigating sentiment by participant and organized by time, we see patterns in the data suggesting that the sentiment of the audience changed at times throughout the run of the show:

Finally, we conducted an emotional analysis once again using the SentimintR package in RStudio. This analysis is conducted by comparing known words associated with various emotions (anger, disgust, fear, joy, sadness, surprise, anticipation, trust, etc.) against the data set, classifying each sentence within an emotional category. Trust (n=153) was shown to be the most common emotion, with more than twice the number of occurrences of the second most common emotion, joy (n = 71). Fear (n = 59), anticipation (n=57), sadness (n = 38), surprise (n = 20), anger (n = 15), & disgust (n = 13) rounded out the top 10 emotions within the sampled data. These results are displayed in the frequency chart and pie chart below:

Conclusion / Suggestions for the Future

The 2019 interactive theater experience, “Results Will Vary,” discussed some of the common pitfalls of first year students, including sex, consent, alcohol, drug use, and peer pressure. Following each airing of the show, the incoming freshman who viewed the production were asked: “what lingering questions do you have about the show?” These questions were written on index cards and collected at the end of each orientation session, resulting in 7,588 responses. Survey cards (N = 7,588) were then read, numbered, sampled (n = 366), transcribed, and analyzed. The results indicated that the top questions of students were related to the consequences of alcohol and underage drinking. In addition, questions regarding campus safety, student resources, residence halls, and consent emerged as central themes from the audience members. With the SentimentR package in RStudio, the text was analyzed for overall sentiment and emotional content, which suggested that students enjoyed the show, with the most common emotional response being “trust,” followed by “joy.” This production, which is written and performed by current Penn State students provides an interesting model for engaging in difficult conversations.

As the Office of New Student Orientation and the Penn State’s School of Theatre begin plans for the future, a survey instrument that includes both open and closed ended questions may provide a better window into student perceptions and understanding. In addition, implementation of an online questionnaire that can store and share results quickly through mobile devices should be considered. The inclusion of an online questionnaire, if properly executed, should also allow for more granularity within the data and increased speed of data collection and analysis.

2017 US News Rankings (Part 2)

The U.S. News and World Report has collected, compiled, and published a list of the top colleges and universities around the country. This report is based on annual surveys sent to each school as well as general opinion surveys of university faculties and administrators who do not belong to the schools on the list. These rankings are among the most widely quoted of their kind in the United States and have played an important role among students making their college decisions. However, other factors may prove to be meaningful when making these decisions.  The data may indicate an interaction between some of the explanatory variables, such as tuition, cost of living, enrollment, and rank, warranting further investigation:

Heat Plot (2017 US News & World Report School Rankings)

Data Description:

The data consists of 222 observations, with 8 variables that describe the 2017 edition of the US News Universities Rankings as well as the cost of living population by state based on the US Census Bureau predictions for 2017. The databases for this analysis are available on and the US Census Bureau website.

Building the Model:

To determine the model, both stepwise and best subsets were used to determine best fit. Before stepwise regression, the full model was evaluated:

According to the summary of the full model, the adjusted R-squared is 0.6588, indicating that the full model is explaining 65.8% of the variance of the response variable. Since a p-value below .001 (2.2e-16), this association does not appear to have occurred by chance. Based on the results of the ANOVA of the full model, we can predict that there are several explanatory variables, including tuition and enrollment that could possibly be significant predictors for determining the best model fit.

Regression Assumptions of the Final Model:

The next step is to evaluate the regression assumption. The assumptions are listed below:

  • Linear: Mean ranking at each set of the explanatory variables is a linear function of the explanatory variables.
  • Independent: Any observation in the data set do not rely on each other.
  • Normal: Ranking at each set of the explanatory variables is normally distributed.
  • Equal Variance: Ranking at each set of the explanatory variables has equal variance (i.e. homoscedastic).

To analyze linearity and equal variance, residual vs. fitted value plot is used. To evaluate normality, a normal Q-Q plot is generated:

According to the residual vs. fitted value plot, we can see no pattern in the data and conclude that the equal variance assumptions has been met. According to the Q-Q plot, we can see some deviation at the tails of the distrubition, but it appears that the normality assumption has been met.

The Cook’s distance plot indicates three potential outliers influencing the line of best fit. Surprisedly, BYU was not one of the outliers exerting he most leverge on the model. Instead, those were:

  • University of Central Florida (#51) – Rank: 176; Tuition: 22467; Enrollment: 54513.
  • University of Hawaii at Manoa (#56) – Rank: 169; Tuition: 33764; Enrollment: 13689.
  • SUNY College of Environmental Science and Forestry (#141) – Rank: 99; Tuition: 17620; Enrollment: 1839.

Based on these finding, the full model should suffice for concluding that there is a meaningful relationship between tuition, enrollment, and university ranking. However, further exploration is needed to determing whether or not this is the best model to explain the potential relationship between explanatory variables and the response variable.

Model Development

Following both the stepwise and best subsets regression, we see that tuition, enrollment, and population are recommended as predictors in the regression model:

When comparing the reduced model to the full model through an F-test, we see that there is not a significant difference (p-value: .77) between the two models:

The stepwise regression indictes that the model with tuition, enrollment, and population has an AIC of 1628.18, while another model that includes cost of living has an AIC of 1630. A model that accounts for the interaction between population and cost of living is worth exploring. However, after tests for multicollinearity using Variance Inflation Factors we see significant evidence of multicollinearity between population & cost of living:

Finally, a Variance Inflation Factor (VIF) test was conducted on the reduced model, which found no evidence of multicollinearity, suggesting that the reduced model is a better fit. Once the VIF test was conducted, assumptions were checked again finding similar results as the full model, with all conditions being met. The model was then cross validated using k-fold.

Summary & Conclusions:

After the initial exploratory data analysis (EDA) found here , a number of patterns emerged. Clearly, it appears that cost of tuition is strongly associated with ranking in the US News & World Report. What was not clear is the effects of the other variables (enrollment, region, & cost of living) on the response variable. While enrollment is weakly associated with ranking, it is moderately associated with tuition and cost of living. After initially analyzing the full model, we find that there is statistical evidence of a relationship between tuition and enrollment on university ranking. After examining best subsets and stepwise regression, it is suggested that we use a model which included tuition, enrollment, and population as the predictor variables. Comparing this model against the full model did not yield a significant difference between the two and suggested that the smaller model that only examined tuition and enrollment would yield the best results. An additional model (model 3) was investigated to include the interaction of cost of living and population, but found significant evidence of multicollinearity between those two variables. Multicollinearity was examined in the reduced model (model 2) using a VIF test, finding no evidence of multicollinearity within that model. Assumptions of linearity, normality, and equal variance were satisfied after examining a plot of residuals vs. fitted values as well as QQ plot of residuals. With a sample of 221 observations and a p-value of less than .001, we have statistical evidence to suggest that tuition, enrollment, and population are significant predictors of performance in the US News and World Report’s Best College Ranking. The final model is summarized below:


As previously stated, cost of living data was available by state instead of by city or country where the university was listed. So, a university that is located in a community with a high cost of living may be in a state with an overall low COL index score, and vice versa. This eliminates some precision in our predictions. In addition, this list consists of the 231 schools which opten to participate in the US News & World Report Ranking Index. According to the US News, there are over 4,000 college and universities in the United States. This raises the concern of non-response bias and limits the generalizability beyond the scope of participating institutions in the US News Rankings.

One example of this is the University of Minnesota, which chose not to participate in the US News Best College Rankings. Minnesota’s in-state undergraduate tuition and fees are $14,142. The enrollment is 19,819, and the state population is 5,568,155 (in 2017):

This results in a 95% confidence interval of 111 to 266 for the University of Minnesota’s US News Best College Ranking. However, when we compare their US News 2019 ranking, we see that UM is ranked #76 (tied with Virginia Tech). This suggests that this model is not an accurate predictor of school ranking, but rather serves as an illustration of overall national trends between tuition, enrollment, and population with regard to university ranking in this report.

R Code Chunks:

# Heat Map Correlation Plot:
heat <- cor(rankingsreduced)
corrplot(heat, type = "upper", order = "hclust", 
         tl.col = "black", = 45)

# Full Model with all variables:
fullmodel <- lm(Rank ~ Tuition + Enrollment + Region + CostOfLiving + Population, data = rankings ) 
# Model Summary (Full Model):

# ANOVA Table (Full Model:

# Residuals vs. Fitted:
mplot(fullmodel, which =1)

#QQ PLot:
mplot(fullmodel, which =2)

# Cook's Distance:
mplot(fullmodel, which =4)

#Stepwise Regression:
step(fullmodel, direction="both")

# Best subsets:
BestSubsets <- regsubsets(Rank ~ Tuition + Enrollment + Region + CostOfLiving + Population, data = rankings, method = "exhaustive", nbest = 2)  
Result <- summary(BestSubsets)

# Append fit statistics to include R^2, adj R^2, Mallows' Cp, BIC:
data.frame(Result$outmat, Result$rsq, Result$adjr2, Result$cp, Result$bic) 

# Model #2 (Based on Stepwise & Best Fits):
mod2 <- lm(Rank ~ Tuition + Enrollment + Population, data = rankings)
# Model Summary (Reduced Model):

# Model Comparrison between Full & Reduced Model:
anova(mod2, fullmodel)

# Model with Interaction of Population & Cost of Living:
mod3 <- lm(Rank ~ Tuition + Enrollment + Population + CostOfLiving + Population:CostOfLiving, data = rankings)  

# Model Summary (Model 3):

# Variance inflation factor (Model 3):
VIFtest1 <- lm(formula = Rank ~ Tuition + Enrollment + Population + CostOfLiving + Population:CostOfLiving, data = rankings)

VIFtest2 <- lm(formula = Rank ~ Tuition + Enrollment + Population, data = rankings)
#Variance inflation factor (small model)

# Checking model accuracy against "real world" data:
minnesota <- data.frame(Tuition = 14142, Enrollment = 29819, Population = 5568155)

#Confidence Interval:
predict(mod2, minnesota, interval="prediction") 

2017 US News Rankings (Part 1)

Since 1985, the U.S. News and World Report has collected, compiled, and published a list of the top colleges and universities around the country. This report is based on annual surveys sent to each school as well as general opinion surveys of university faculties and administrators who do not belong to the schools on the list. These rankings are among the most widely quoted of their kind in the United States and have played an important role among students making their college decisions. However, other factors may prove to be meaningful when making these decisions. For example, is cost of tuition associated with the ranking of a university? Said another way, do “better” schools cost more money to attend? Are other factors, such as enrollment, state population, cost of living, and region associated with these rankings? For potential students looking to get ahead in a global economy, these may be important considerations, especially for those who come from lower socioeconomic backgrounds.

Objectives & Variables of Interest

The purpose of this study is to investigate the associations between tuition, enrollment, cost of living, population, and region of the country on the 2017 US News & World Report’s Best College Rankings. The variables of interest are university ranking, undergraduate tuition, undergraduate enrollment, cost of living index by state, state population according to 2017 Census data, and region of the country (Northeast, Midwest, South, & West). The response variable is ranking, and the potential explanatory variables are undergraduate tuition, undergraduate enrollment, cost of living index, state population, and region of the country.


Cost of living (COL) data was available by state instead of by community. A university that is located in a community with a high cost of living may be in a state with an overall low COL index score (and vice versa), which eliminates some precision in our prediction. In addition, many schools chose not to participate in this ranking report, which introduces non-response bias into the design.

Exploratory Data Analysis

The first step in exploratory data analysis for was to look at the shape of each of the variables (not pictured), before looking at the associations between continuous variables, represented in the correlation matrix and the correlation plots below:

Correlation Matrix (2017 US News & World Report School Rankings)
Correlation Plot (2017 US News & World Report School Rankings)

Looking at the data, we see some moderate to strong associations between rank, tuition, enrollment that warrant further investigation. Before building models, the data was explored by comparing some of these associations by region:

Boxplots of Rank by Region (2017 US News & World Report School Rankings)
Scatter Plot of Rank vs Enrollment, coded by Region (2017 US News & World Report School Rankings)

Finally, we can see the strongest association between two variables by visualizing Rank vs Tuition. When coded by Region we see some slight curvature in the data, but a similar negative shape and slope across parts of the United States:

Scatter Plot of Rank vs Tuition, coded by Region (2017 US News & World Report School Rankings)

Summary & Initial Observations of EDA:

Within the explanatory variables, we see a strong association between cost of tuition and ranking in the US News and World Report metric. One outlier (Brigham Young University–Provo) reports a relatively high ranking (68) in comparison to its tuition ($5,300). This observation appears to be influencing the line of best fit, lowering the correlation coefficient. Even with that influential data point, we still see a strong negative correlation (-.75) between tuition and rank. Enrollment does not appear to have a significant effect on university ranking, but it does appear to be positively associated with tuition (.37) cost of living (-.19). Cost of Living Index (COL) and state population appear to have a weak, negative assocation with university ranking, and a moderate positive association with the cost of tuition. The data may indicate an interaction between some of the explanatory variables, such as tuition, cost of living, enrollment, and rank, warranting further investigation.

Relationship between variables is summarized in the heat map below:

Through this EDA, we have a better understanding of the shape and relationships among the variables, which should inform model construction and analysis. The second part of this project tackles those objectives, found here. Code for the plots above can be found below:

R Code Chunks

# Correlation Matrix

# Plot 1:Corelation Plot

# Plot 2: Boxplots of Ranking (by Region):
ggplot(DATA, aes(x=Region, y=Rank, fill=Region)) + 
    geom_boxplot(alpha=0.3) +

# Plot 3:Interactive Plot of Tuition vs Rank (by Region)
ggplot(data = DATA, aes(x = Tuition, y = Rank)) +
  geom_point(aes(text = paste("Enrollment:", Enrollment)), size = .5) +
  geom_smooth(aes(colour = Region, fill = Region))

# Plot 4:Interactive Plot of Enrollment vs Rank (by Region)
p <- ggplot(data = DATA, aes(x = Enrollment, y = Rank)) +
  geom_point(aes(text = paste("Enrollment:", Enrollment)), size = .5) +
  geom_smooth(aes(colour = Region, fill = Region))

# Plot 5: Heat Map Correlation Plot
heat <- cor(DATA)
corrplot(heat, type = "upper", order = "hclust", 
         tl.col = "black", = 45)

Exploratory Data Analysis: Spotify Song Popularity

Who doesn’t want to know what makes a song popular? As a musician, I have spent decades trying to get an answer to that question, with little success. People just seem to like what they like. So, when I enrolled in an applied statistics course that took a deep-dive into regression analysis, I got my chance. We were required to conduct an exploratory data analysis (EDA) on a data set of our choosing, but we had to go out and find it. This led me to Kaggle and Spotify’s million song dataset.

The purpose of this EDA was to investigate what variables may influence song popularity while developing a greater understanding of statistical procedures. More specifically, the following questions were to be addressed:

  1. What variables are associated with popularity of song choice by Spotify users? 
  2. Is one variable associated with popularity above others? 
  3. If there is an association, is it linear? 

My intuition before conducting analysis was that danceability, energy, and valence will be the most highly associated with song popularity, but not necessarily in a linear manner (or in that order).

Description of Data

The original dataset included 228,159 observations and 17 variables that describe the Spotify Tracks Database created by Tim Igolo and posted on Data was harvested through Spotify for Developers in April of of 2019. Unfortunately, the data included music from soundtracks in addition to “movie music” in addition to Opera (but not classical) and a number of other musical styles that could make any kind of regression analysis difficult. So, I filtered the data to include only popular music types, resulting in 130,663 observations (i.e. songs), with 11 variables of interest.  Those variables of interest are popularity, acousticness, danceability, duration (in milliseconds), energy, liveness, loudness, mode (major or minor), speechiness, tempo, & valence. The response variable is popularity and the potential explanatory variables are acoustics, danceability, duration, energy, liveness, loudness, mode, speechiness, tempo, & valence. 

Exploratory Data Analysis (EDA)

Originally proposed by John Tukey, the inventory of the Tukey Test,  EDA’s are typically used to summarize summarize the data’s main characteristics. This can be through simple summary statistics (measures of central tendency, five number summary, etc), but often includes data visualization as well. Simply put, we use EDA’s to look for any patterns or problems in the data and right away I found one:

This variable indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is. As you can see, there are more songs in minor (79409), than in major (51254), and the mode does not appear to have an appreciable effect on the popularity of a song:

While these two visualizations seem pretty clear, there is one glaring issue: Most songs in popular music are neither major, nor minor. Instead, they are modal and they sometimes shift tonalities throughout, so categorizing all of these songs dichotomously is a problem. Upon further investigation, you see some songs listed multiple times and in multiple categories. This is almost certainly due to how Spotify classifies their songs, which is fine, but poses problems when trying to investigate the questions stated above. 

Adjusting Plots for Better Visualization:

I was very new to using R at this point and didn’t have the time to sort through these issues, so I ended up choosing another dataset to analyze, but I figured while I had this data I would go ahead and figure out some ways to visualize it. Fortunately, this dataset did provide some interesting obstacles to overcome with data visualization. 

The first problem was the sheer number of observations.  I used the mplot function within the Mosaic package for many of these initial plots, which is super convenient for beginners in R.  However, since the dataset was so large, many of the plot were not helpful, like these two below:

Correlation Plot (i.e. corrplot) of Continuous Variables
Popularity with respect to Speechiness in Spotify Dataset

These first two plot show the relationships between continuous variables, but are not the most helpful. We can get a sense on some linear relationships in the data, but it’s pretty tough to really see what is going on simply due to the number of data points. So, a solution was to use a heat plot instead to show correlations across continuous variables instead:

Heat Plot of Continuous Variables

Similarly, the plot below of these two variables is much more clear as to what patterns are in the data simply by changing the size and transparency of the data points themselves:

Popularity with respect to Speechiness in Spotify Dataset (adjusted size & transparency of data points)

While this was an early foray for me into R and not a dataset I wanted to investigate further to make inferences from, it did provide some interesting obstacles to overcome on how to use the software to visualize the data. Ultimately, my interests navigated to a different question: Investigating Factors for School Ranking in the US News & World Report, which can be found here and the code I used to create the plots above here:

R Code Chunks

# Plot 1: Correlation plot of Popularity vs Explanatory Variables

# Plot 2: Scatter Plot of Popularity vs Speechiness
gf_point(Popularity ~ Speechiness, data = DATA) %>% 
gf_labs(title = "Popularity vs Speechiness", caption = "Spotify Dataset")

# Plot 3: Heat plot of Popularity vs Explanatory Variables
heat <- cor(DATA)
corrplot(heat, type = "upper", order = "hclust", 
         tl.col = "black", = 45)

# Plot 4: Scatterplot of Speechiness vs. Popularity
DATA %>%
  ggplot( aes(x=Speechiness, y=Popularity)) +
    geom_point(color="darkblue", size=1.75, alpha=0.01) +

My Pathway to PBL

In 2008, I was hired to teach on the music faculty at Texas A & M – Commerce. Great job. Great people there. On my contract, the final sentence said “other duties as assigned,” and one of those dates ended up being a music technology course. I had no experience teaching a class like that and no formal training in a class like that, but that didn’t change the fact that I was going to teach it. So, I got started on figuring out how to do that and not feel like an idiot in the process.

The first step was to find an existing syllabus, of which there was none to be found. The next step was to ask the faculty what had been covered in the class in the past, which nobody really knew (this is not uncommon). There was a textbook that had been used, but it was not geared towards the population of students I was going to be teaching (according to my boss….and he wasn’t wrong). So, the next logical step was to get a sense from the faculty (and my bosses) what they would like me to cover in the course.

What I got was a laundry list of softwares and technologies students should be able to know, many of which were incredibly outdated or not relevant to the entire class. For example, being able to use a drill writing software like Pyware or EnVision is not going to be a meaningful exercise for people who intent to teach choir, orchestra, elementary, or middle school. It is also a really difficult to learn, because it requires knowledge in other areas besides just how to operate the software. This left me with these questions to resolve:

  1. What softwares and skills are transferable across all students in the class?
  2. What can we reasonably expect students to be able to know and be able to do within the confines of the semester and technological resources available.
  3. What kinds of activities and assessments can we create to achieve these goals?

The result was a series of projects that centered on 3 main areas of being a music teacher and how technology could be used to serve them. Those areas were teaching, creativity, and administration. Projects were designed to scaffold learners in a way that not only helped them gain knowledge and understanding of software, but how to leverage those skills against their own interests to create new knowledge and digital fluency. Some projects were the same across degree tracks (choral, instrumental, general music, etc), while others were specific to each tracks. Examples of projects that remained consistent across tracks included the missing part assignment and the digital audio assignments, while some of the projects that bifurcated were the orchestration projects and the final projects.

The result was something unexpected. Students got really into the projects and we all ended up helping each other learn together. Eventually I became the most knowledge person in the room with respect to the various technologies, but we never stopped learning together. With each passing year, I found that the more interesting and authentic assignments were and the more they allowed for students’ interests and creativity to come out, the more wonderful the projects became.

Years later, I came to understand that this approach was known as Problem Based Learning (PBL) and Authentic Context Learning (ACL). These approaches have become huge interests of mine across all domains of learning and have had some wonderful experiences in classes that use them in areas like the Applied Statistics program at Penn State. Now, as I finish my PhD in the time of Covid-19, I see the importance of creating meaningful projects that students and teachers can both learn from to increase engagement and understanding. Below are some examples of projects I have developed in the past for any of those that are interested:

Music Notation Projects:

  1. Recreate a Score
  2. Tranform a Score
  3. Creating & Composing with Music Notation Software

Digital Audio Projects:

  1. Recording & Editing Project
  2. Digital Audio Transformation Project
  3. Composing & Creating with Digital Audio Workstations

Music Administration:

  1. Data Cleaning Project
  2. Tracking Expenses
  3. Create a Mail Merge