Five Modeling Lessons Learned From the Pandemic

This blog post is a follow-up to Anomalies: Drivers of Progress. In that article, we discussed data science anomalies and higher education, including current tactics (i.e., during the unfolding of the pandemic), the September postmortem diagnosis, and future model updates.

Here we will briefly revisit all of these areas, look at what actually happened, and how we can move forward.


The Takeaway Message

One question we regularly receive from our higher ed partners is how data and modeling are affected by the pandemic and whether using data and models is still possible.

The short answer is that despite changes in data and behavior due to the pandemic, data and models are still your best, and really only, bet in achieving your goals and obtaining your desired class (what would even be the alternative?).

While some adjustments have to be made, and one has to be careful to avoid certain types of bias, data can still be used to create models, make predictions, and gain insights.

Here are a few lessons learned about data and modeling during the pandemic for higher education.

Five Lessons Learned

  1. The Fundamentals Have Not Changed
    Regardless of the pandemic, data was still collected and used to build models and gain insights. There are roughly two main changes due to the pandemic that matter: (1) changes in data collection because of the pandemic (more about that below), and (2) changes in probabilities (or likelihoods) due to changes in behavior (e.g., not attending college due to safety concerns) and actions caused by the pandemic (e.g., more aid to try to offset a decrease in campus visits).

    The main issue we discuss now is that the pandemic may have changed a student’s likelihood of enrolling, either directly or indirectly. However, now that we are able to use the data from during the pandemic to update the models, it turns out that the probabilities (in most cases) did not change dramatically.

    Any sizeable changes are now reflected in the models because that same data was used to train the models. Of course, it is important to not just look at average probabilities but also take into account other factors such as distance, academic quality variable, aid allocation, etc. (all of these can impact the likelihood of enrollment) and by comparing it to previous years, you will be able to see the impact of the pandemic. This is something that can be achieved with our explainable AI, more about that later.

    Another relevant insight is that even if the probabilities changed across the board, their rankings (relative order) in terms of individual probabilities should still be approximately the same (obviously, the impact of the pandemic on some subpopulations could be larger than others, but everyone should be affected to some degree).

    More specifically, for many use cases, probabilities are ranked from low to high so that different actions can be performed for different probability ranges. This often takes the form of deciles, where the first decile contains the lowest probability individuals, and the highest decile contains the highest probability individuals.

    For example, these deciles can be used to determine who to focus on in marketing campaigns or who to invite for a visit. Even if the average probability is slightly lower due to the pandemic (and the models show in some cases), the relative probabilities should still be similar, and it is perfectly fine to use these.

  2. Lifecycle Modeling is Essential
    It is important to build separate models for different stages of the enrollment cycle because different variables are important at different stages. For example, distance tends to play an important role in early-stage models, and aid is obviously more important in later models.

    In addition, different lifecycle stages may have been impacted differently because of the timing of the pandemic, which makes it essential to use lifecycle modeling. As was hinted at above, using our explainable AI, it is possible to see how the current year is doing compared to the previous year for a given stage and pinpoint which (sub)populations are under- or over-performing and due to which variables. Especially with the volatility of the pandemic, this allows us to properly attribute the results and keep a close eye on progress.

    Lifecycle modeling makes it possible to track different behavior variables at different stages and optimize for them. With the timing of the pandemic, the later phases were impacted more, so the impact of the lack of visits may be something to look at. For example, it is possible to assess the impact of a decrease in visits, although obviously there is a difference between not visiting and not being able to visit.

    So, let’s look at that in a bit more detail.

  3. Virtual Visits can be Incorporated
    It turns out that in many cases, virtual visits are comparable to on-campus visits (expressing strong interest) and can be merged with on-campus visits, which is something the data can tell us.

    If it turns out there is a large difference in effectiveness between on-campus and virtual visits, we can apply a useful modeling trick and still model visits as a single variable (thereby keeping things simple). Essentially, we model the options from least positively impactful on enrollment probability to most positively impactful on enrollment probability.

    In most cases, this means “no visit,” “no virtual visit,” “a virtual visit,” and “an on-campus visit.”

    Assuming there is enough data for each of these options, we then use machine learning to assess the impact of each of these across all individuals and enrollment years.

    Our findings indicate that in most cases, virtual visits are not quite as impactful as on-campus visits, but they are close.

    The reason for this might be that often the visit is just validation and not really the deciding factor in making a college decision. In other words, causality in both directions happens, but one direction is more common (decide, then visit) than the other (visit, then decide).

  4. Going Test Optional is Supported
    Many institutions were already considering going test optional for admissions, and the pandemic has accelerated this trend. As in these cases, the historical data does not contain missing values for test scores, we cannot keep the test score variables in the model (note that it is also not a good idea to use a model with a test score variable for students that do submit test scores as this is not a random process and would be biased). We have run extensive experiments, and it turns out that after removing the test score variables, in most cases, the model performance only decreases slightly.

    Usually, there is enough other information, e.g., related to high school performance, to determine if someone is going to enroll or not (something similarly can be said for aid allocation, although it is more complicated). For example, when predicting who was historically admitted with and without test scores, the results were fairly similar in most cases.

    In our platform, we can provide a so-called "secondary predict" that indicates whether someone was admitted historically (but without knowing the test score!) that could be used by admissions counselors. Some of our partners have implemented this approach.

  5. There is Variation Across Institutions
    Our models are made to order and tailored to a particular institution’s situation, processes, and goals. This also means that if your data has certain trends due to the pandemic, they will be reflected in your models. We have seen different impacts among our partner institutions, often because of the geographical location and the local COVID-19 situation.

    Due to these regional differences, it is important to use an institution’s data to the fullest extent and utilize lifecycle modeling, which allows each institution to define their own set of variables for different stages of the enrollment funnel and to execute the right actions at the right time.

Conclusion: Data and Modeling Are Still Valuable 

While the pandemic caused some changes in student behavior and data collection, data and modeling are still extremely valuable in reaching your goals and obtaining your desired class. Looking back at what was discussed in Anomalies: Drivers of Progress, some adjustments had to be made, as discussed in the five points above, but no major overhauls were necessary.

We still have to listen very carefully to the stories the data and modeling tell us and make the best possible decision with all the available information.