Leveraging Power BI and R to Analyze Student Success in Higher Education

Table of Contents

Research question

How strongly does retention rate predict completion rate across different racial/ethnic groups in higher education, and how does this relationship vary by institutional type (4-year institutions vs. less than 4-year institutions)?

Model

Linear regression analysis using R; This study employs linear regression analysis using R to examine the relationship between retention rate and completion rate across different racial/ethnic groups and institutional types.

Factors

  • Independent variables (X): Retention rate, Race/Ethnicity
  • Dependent variable (Y): Completion rate
  • Analysis approach: Separate linear regression models were conducted for 4-year and less-than-4-year institutions to assess differences in the relationship between retention rate and completion rate across racial/ethnic groups

Data source

College Scorecard API: Click here to review the API Documentation

Selected Data Elements from College Scorecard

Name of Data ElementDeveloper-friendly nameAPI data type
First-time, full-time student retention rate at four-year institutionsretention_rate.four_year.full_time_pooledfloat
First-time, full-time student retention rate at less-than-four-year institutionsretention_rate.lt_four_year.full_time_pooledfloat
First-time, part-time student retention rate at four-year institutionsretention_rate.four_year.part_time_pooledfloat
First-time, part-time student retention rate at less-than-four-year institutionsretention_rate.lt_four_year.part_time_pooledfloat
Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion) for white studentscompletion_rate_4yr_150_whitefloat
Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion) for black studentscompletion_rate_4yr_150_blackfloat
Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion) for Hispanic studentscompletion_rate_4yr_150_hispanicfloat
Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion) for Asian studentscompletion_rate_4yr_150_asianfloat
Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion) for American Indian/Alaska Native studentscompletion_rate_4yr_150_aianfloat
  • Few things to note about the dataset
    • Number of Datasets: As of March 2025, the dataset contains data for approximately 6,400+ institutions. Since the data is connected via API, the number of rows may change as the dataset is updated.
    • Full-time students only: The analysis is based on full-time student data, as completion rates for part-time students are not available in the College Scorecard dataset.
    • Data collection period: As of March 2025, the data was last updated on January 16, 2025. Thus, the following visualized datasets reflect the data available as of that update. To refresh the analysis, download the .pbix file from my GitHub and hit . The most recent date update can always be found here: https://collegescorecard.ed.gov/data/
    • 150% completion rate: In the context of a 4-year institution, 150% completion rate refers to students who graduate within 6 years (150% of the standard 4-year program length).

Regression Analysis and Data Visualization as of March 2025

R script: Click here to view the R script on my GitHub page

Linear Regression Table

  • Race_Ethnicity: The dependent variable (completion rate) grouped by race/ethnicity
  • Retention Rate (Predictor): The predictor (independent variable) used in the regression model:
    • Baseline Completion Rate (if Retention=0) == Intercept: The expected completion rate when the retention rate is zero (not usually meaningful but part of the regression equation). It is not an independent variable; it’s a baseline value.
    • FT Retention Rate at 4yr Institutions: The coefficient that represents the change in completion rate for every 1-unit increase in the retention rate.
  • Predicted Completion Rate= Baseline Completion Rate (if Retention=0) + FT Retention Rate at 4yr Institutions * FT AVG Retention Rate (which is 71.8% for four-year institutions and 69.67% for less than four-year institutions)
  • Coefficient: How much the dependent variable (completion rate) is expected to change for each 1 percentage point increase in the independent variable (retention rate).
    • Predicted Completion Rate (if Retention=0): not always meaningful on its own; it is used to calculate Predicted Completion Rate
    • FT Retention Rate at 4yr Institutions: for every 1% increase in retention rate, the completion rate increases by the value. If the coefficient is closer to 1 or greater than 1, it suggests a strong positive relationship, meaning that as retention rate increases, completion rate increases at a similar or even greater rate. (*Note: If the coefficient is negative, it indicates an inverse relationship. e.g. as retention increases, completion decreases).
  • p_value: How the independent variable’s impact on the dependent variable is statistically meaningful. If p < 0.05, it can be interpreted that there’s strong evidence that retention rate affects completion rate (statistically significant). If p ≥ 0.05, there’s weak or no evidence of a real effect (not statistically significant).
  • R_squared: How well the independent variable explains the variability in the dependent variable. If the value is closer to 1, it means the model explains most of the variance in completion rates (stronger explanatory power). If the value is closer to 0, other factors not included in the model are influencing completion rates (weaker explanatory power). For educational data, an R² of 0.3 to 0.5 is typically considered moderate, while above 0.6 is strong.

4-year institutions

Linear Regression Table- 4 year institution
  • Overall, the retention rate has a positive impact on the completion rate for all racial groups. Given that the R-squared of 0.3 to 0.5 is typically considered moderate for educational data, races that the model explains the variability well are 2 or More races, Black and White students.
  • The impact of retention rate on completion rate is the highest for NHPI students with 85.76% predicted completion rate and coefficient of 1.20, followed by 2 or More students (78.68%, 1.10) and unknown students (76.57%, 1.07).
  • White students have the lowest coefficient (0.84), meaning retention rate has a comparatively smaller effect on completion for them.
  • P_value for all races is about 0.00 which can be interpreted that the impact of retention rate on completion rate is statistically significant across all races. It is worth noting that p_value of 0.00 may be due to the large data size (about 6500 institutions).

Less than 4-year institutions

Linear Regression Table- Less Than 4 year institution
  • Although the average predicted completion rate is lower compared to 4-year institutions, the retention rate has a positive impact on the completion rate for all racial groups who are enrolled at less than 4-year institutions. It is worth noting that the R-squared values for all racial groups are lower than 0.3; it can be interpreted that for students at less than 4-year institutions, there are more factors other than the retention rate that affect the completion rate.
  • From the predicted completion rate, NHPI students have the highest predicted completion rate (63.16%, coefficient of 0.91), which is the same as the 4-year institutions. The second and third-highest predicted completion rates were of Hispanic and White students (61.18%, 0.88, and 57.09%, 0.82 respectively).
  • Same as the P_values for students enrolled at 4-year institutions, students who study at less than 4-year institutions have a p_value of 0.00. Again, this may be due to the large data size (N= 6,480+).

Visualized plots for each race

  • What do the black dots represent?
    • Each dot represents a college or university.
    • The position of a dot on the graph shows:
      • The x-axis (horizontal): The average first-year retention rate at that institution (how many students return after their first year).
      • The y-axis (vertical): The completion rate for that student group at the institution (how many eventually graduate).
  • What does the colored line mean
    • The line shows the general trend between retention and completion rates.
    • A stepper line suggests a stronger relationship: Institutions with higher retention rates tend to have higher completion rates for that student group.

4-year institutions

Visualized plots for each race/ethnicity - 4 year institution

Less than 4-year institutions

Visualized plots for each race/ethnicity - Less Than 4 year institution

Findings

Comparison between students enrolled at 4-year institutions and less than 4-year institutions

A simplified regression table and a plot with multiple lines - 4 year institution
  • The average predicted completion rate for students who are enrolled at 4-year institutions (72.06%) is 18.27 percent points higher than students enrolled at less than 4-year institutions (53.79%).
  • Two racial groups that show the largest gap in the predicted completion rates between 4-year and less than 4-year institutions are AIAN and Unknown students (33.34 percent points difference and 32.69 percent points difference respectively). This suggests that AIAN and Unknown students may face greater barriers to completion at less than 4-year institutions compared to other racial groups.
  • On the other hand, the two racial groups that had the least differences in the predicted completion rates were Hispanic and White students (3.29 percent points difference and 3.19 percent points difference respectively). This could indicate that these groups experience more similar outcomes regardless of institutional type, though further exploration would be needed to understand why.

A simplified regression table and a plot with multiple lines - Less Than 4 year institution

Takeaways for college professionals

  • Retention Rate Matters but Is Not the Only Factor:
    • Retention rate has a statistically significant effect on completion rate (p-value~0.00). However, low R-squared values indicate that retention alone does not strongly explain completion rates. This means there are other key factors missing that affect completion rates. (look at the limitations/suggestions for future research).
  • Targeted Support for NHPI Students May Have a High Impact
    • Among racial groups, NHPI students show the highest predicted completion rates when retention improves regardless of the institution type.
    • Institutional interventions aimed at improving NHPI student retention—such as culturally responsive advising, mentoring, and academic support—are likely to significantly boost their completion rates.
  • Completion at Less-Than-4-Year Institutions Is More Complex
    • The lower R-squared values suggest that completion at less-than-4-year institutions is influenced by multiple factors beyond retention.

Limitations/suggestions for future analysis

  • Full-Time vs. Part-Time Students
    • The analysis is based on full-time student data, as completion rates for part-time students are not available in the College Scorecard dataset.
    • If future data includes part-time student completion rates, a more comprehensive analysis could assess whether retention impacts completion differently for part-time vs. full-time students.
  • Geographic Differences
    • This analysis does not account for regional variations in the relationship between retention and completion rates.
    • Future research could explore state-level differences to determine whether geographic differences influence completion outcomes.

Did you find it helpful?

Click the button below to check out other projects!

© copyright SEVIS SAVVY 2025