Data Source: Study Group’s internal data
Rows: 25,059 | Features: 16 features regarding the student, where they’re studying and their course
Tech Used: Python, Pandas, Numpy, Scikit-learn, XGBoost, TensorFlow/Keras, Matplotlib, Seaborn

Colab Notebook: Here

🧩 The Problem?

Each year, universities invest heavily in international students, supporting them academically, socially, and financially. But not all students finish what they started. Some drop out quietly without enough early warning for universities to step in.

The mission: Use real-world student data to identify who might drop out, and why. This insight would help educational institutions offer targeted support and improve student outcomes.

To do this, I was provided data for students at 3 different stages - early in their studies, mid-way through their studies and late into their studies. I split this up into 3 stages and created models for each stage.

🔍 The Challenge:

This wasn’t a simple prediction problem. We had three separate datasets tracking:

Student background and course details
Attendance and engagement
Academic outcomes

Here’s what made things tricky:

The data wasn’t always clean - we had to remove messy or overly complex columns.
The target variable was imbalanced - 85% of students completed their course, so detecting dropouts (just 15%) was like finding needles in a haystack.
I couldn’t just rely on accuracy - a model predicting “everyone graduates” would still score 85% correct. So we had to go deeper, focusing on recall, precision, AUC, and confusion matrices to understand what the models were getting right (and wrong).

🛠️ The Methodology:

🧪 Model Evaluation: How We Measured Success

To truly understand how well our models performed, we looked beyond just accuracy and used a combination of metrics. This was especially important because the dataset was imbalanced, with far more students completing their courses than dropping out. Here's what I used and why:

Accuracy

What it is: The percentage of total predictions the model got right.
Why it’s not enough: Since dropouts are the minority class, accuracy can give a false sense of performance. What we really care about is:

Were our dropout predictions correct, or false alarms? (Precision)
Did we catch the students who were going to drop out? (Recall)

Precision

What it is: Out of all the students the model predicted would drop out, how many did?
Why it matters: High precision means fewer false positives - useful if you want to avoid mistakenly flagging students who are doing fine.

Recall

What it is: Out of all the students who dropped out, how many did the model correctly catch (true negatives)?
Why it matters: High recall ensures at-risk students don’t slip through the cracks.

F1-Score

What it is: A balance between precision and recall, is especially useful when you care about both.
Why it matters: It gives a single score that reflects both false positives and false negatives - ideal when the classes are imbalanced.

AUC (Area Under the ROC Curve)

What it is: A measure of how well the model separates dropouts from enrolled students, regardless of the prediction threshold.
Why it matters: The closer to 100%, the better the model is at distinguishing the two groups.

Confusion Matrix

What it is: A table showing how many students were correctly or incorrectly predicted as enrolled or dropped out.
Why it matters: It gives a detailed view of model performance - how many real dropouts were missed, and how many enrolled students were wrongly flagged.

How to read it:

🔵 Top-left: correctly predicted dropouts (True Negatives)
🔵 Top-right: missed dropouts (False Positives)

🔴 Bottom-left: predicted dropout but student actually completed (False Negatives)
🟡 Bottom-right: correctly predicted completions (True Positives)

🧩 Stage 1: Baseline Models

I trained and then separately tuned both an XGBoost model and a Neural Network on the initial dataset with demographic and course info.

XGBoost tuning: learning_rate, max_depth, and n_estimators
Neural Network tuning: optimizers (adam, rmsprop), activations (relu, tanh), and neuron counts

🟢 XGBoost got better at identifying dropouts (recall ↑ from 94.25% to 95.57%), but lost accuracy in predicting those who stayed enrolled (recall ↓ from 46.06% to 41.79%).

**Confusion matrices for an untuned XGBoost model (left) vs a tuned XGBoost model (right).**

🟡 Neural Network improved in identifying enrolled students and raised its F1-score, but made more false predictions overall.

**Confusion matrices for an untuned Neural Network model (left) vs a tuned Neural Network model (right).**

🧩 Stage 2: Adding Attendance

I introduced attendance-related features like AuthorisedAbsences. I dropped UnauthorisedAbsences to avoid multicollinearity.

🎯 Both models improved across the board:

Accuracy climbed (NN: 88.67%, XGBoost: 88.83%)
AUC scores jumped (NN: 86.89%, XGBoost: 88.60%)
Both became better at distinguishing students at risk, as shown by the increased number of True Negatives (top left) and decreased number of False Negatives (top right) in the confusion matrices below.

**Confusion matrices for a tuned Neural Network (left) vs tuned XGBoost model (right).**

This stage showed us that behavioural data (like attendance) is a much stronger indicator of risk than demographics alone.

🧩 Stage 3: Introducing Academic Performance

Next, I added academic features, AssessedModules and PassedModules, to measure progress.

🏆 XGBoost stole the show:

Accuracy: 98.92%
AUC: 99.90%
It outperformed the Neural Network across all key metrics, even without tuning

**Confusion matrices for an untuned Neural Network model (left) vs a untuned XGBoost model (right).**

📉 Neural Network struggled with the new complexity and delivered lower recall and AUC. We chose not to tune it further to keep the model comparison fair.

Conclusion: Dropout Risk Changes Over Time

This project shows that different features matter at different stages:

🟡 Early Risk (Stage 1 – Demographics)

Top feature: Age
Risk is tied to student background. Younger students may be less prepared.
Models struggled with recall, showing that demographics alone aren't enough.

🟠 Mid Risk (Stage 2 – Attendance)

Top feature: AuthorisedAbsenceCount
Students who miss more classes are at higher risk.
Models improved at spotting both dropouts and enrolled students.

🔴 Late Risk (Stage 3 – Academic Performance)

Top features: AssessedModules, PassedModules
Poor academic performance signals high dropout risk.
XGBoost performed best, showing strong accuracy and AUC.

📌 Key Takeaways:

Dropout risk shifts from who the student is, to how they behave, to how they perform. XGBoost is especially effective at late-stage predictions, making it ideal for early-warning systems.

Ultimately, we demonstrated that with the right data and models, dropout risk can be predicted early enough for educators to intervene. XGBoost, in particular, is a strong candidate for powering real-time support systems in education.

Study Group Project: Predicting Student Dropout Using Supervised Learning

🧩 The Problem?

🔍 The Challenge:

🛠️ The Methodology: