INTERNATIONAL JOURNAL OF RESEARCH AND INNOVATION IN APPLIED SCIENCE (IJRIAS)
ISSN No. 2454-6194 | DOI: 10.51584/IJRIAS |Volume X Issue X October 2025
Page 1401
www.rsisinternational.org
Improving Customer Retention in Aviation Industry: A
Machine Learning Perspective
Cosmas Akanna Ogbobe
1
, Ezema Miracle Chikamso
2
, Olakunmi Olayinka Odumosu
3
1
University of East London
2
Babcock University, Nigeria
3
National Open University

1824 Published: 14 November 
ABSTRACT
Nigeria’s aviation sector faces intense competition, rising operational costs, and volatile passenger loyalty.
This study employs a Random Forest classifier to predict passenger churn using anonymized flight data,
developing a model that achieves high precision in identifying at-risk passengers. Key predictors include
delayed flight duration, customer service interactions, and travel class. The results inform targeted retention
strategies, such as predictive dashboards and loyalty programs, offering actionable insights for airline
operations and revenue protection.
Keywords: Machine Learning; Customer Retention; Aviation Industry; Churn Prediction; Random Forest;
Nigeria; Passenger Behavior.
INTRODUCTION
In Nigeria's highly competitive aviation sector, customer retention represents a critical survival strategy. Amid
rising operating costs, fierce competition from both domestic and international carriers, and volatile passenger
loyalty, airlines face unprecedented pressure to identify and retain their most valuable customers (Olatokun &
Alabi, 2018). Retaining existing passengers is consistently more cost-effective and sustainable than customer
acquisition. However, many Nigerian airlines struggle to detect early warning signs of customer churn. This
research proposes the implementation of Machine Learning (ML) technology, specifically a Random Forest
classifier, as an innovative solution to address this pervasive challenge through predictive analytics (FAAN,
2024).
LITERATURE REVIEW AND BACKGROUND
The aviation industry's competitive nature necessitates sophisticated approaches to customer relationship
management. Traditional methods of customer retention often rely on reactive strategies implemented after
customer dissatisfaction becomes apparent instead of adopting a proactive approach (Vercellis, 2009).
However, contemporary data science approaches, particularly machine learning algorithms, enable predictive
analytics that can identify at-risk customers before they defect to competing airlines (Han, Kamber, & Pei,
2012).
Customer churn prediction represents a well-established application of machine learning across various
industries, with aviation presenting unique characteristics that influence passenger loyalty. Factors such as
flight delays, service quality, pricing structures, and route accessibility significantly impact passenger retention
rates. The Nigerian aviation market presents additional complexities, including infrastructure challenges,
regulatory variations, and diverse passenger demographics across major route networks (Kim & Kim, 2017).
INTERNATIONAL JOURNAL OF RESEARCH AND INNOVATION IN APPLIED SCIENCE (IJRIAS)
ISSN No. 2454-6194 | DOI: 10.51584/IJRIAS |Volume X Issue X October 2025
Page 1402
www.rsisinternational.org
METHODOLOGY
This research employed a comprehensive machine learning approach to predict passenger churn using actual
flight data from a Nigerian airline. The dataset was anonymized to protect passenger privacy while maintaining
analytical integrity. The primary research question focused on whether machine learning algorithms could
accurately predict passenger likelihood to cease flying with a specific airline based on historical travel patterns
and service interaction data.
Data Collection and Processing
The dataset comprised approximately 200 records of passenger activity across major Nigerian aviation routes,
including high-traffic connections such as Abuja-Port Harcourt and Lagos-Enugu. The collected data
encompassed several key variables:
Anonymized passenger identification information and Passenger Name Records (PNRs)
Origin and destination airports with route classifications
Flight scheduling information and actual travel dates
Travel class categorizations including Economy Discount, Economy Flex, Economy Saver, Business
Saver, and Business Flex
Customer service interaction frequency through support call records
Flight delay duration measurements in minutes
Additional temporal features and churn classification labels derived from passenger activity patterns.
Fig 1. An Overview of the Dataset
Preprocessing and Training Procedures
The data preprocessing phase involved several critical steps to ensure model accuracy and reliability:
Categorical variable encoding: Route classifications and travel class categories were transformed
using LabelEncoder techniques to convert text-based categories into numerical representations suitable
for machine learning algorithms.
Numerical feature treatment: Continuous variables including delay duration, support call frequency,
and ticket pricing were normalized and outlier-clipped to prevent extreme values from skewing model
performance.
Training and testing split: The dataset was divided using a 70-30 stratified sampling approach,
ensuring representative distributions of churn and retention cases in both training and testing datasets.
Noise injection protocol: To enhance model robustness and simulate real-world uncertainty, 10% label
noise was systematically introduced into churn classifications, preventing overfitting to potentially
mislabeled training examples.
The Random Forest Algorithm
The research utilized a Random Forest Classifier algorithm, selected for its robust performance characteristics
and interpretability in business contexts. The Random Forest approach provides excellent handling of mixed
INTERNATIONAL JOURNAL OF RESEARCH AND INNOVATION IN APPLIED SCIENCE (IJRIAS)
ISSN No. 2454-6194 | DOI: 10.51584/IJRIAS |Volume X Issue X October 2025
Page 1403
www.rsisinternational.org
data types, resistance to overfitting, and clear feature importance rankings that facilitate actionable business
insights (Breiman, 2001).
The Random Forest (RF) classifier was selected for this study due to its robustness, ability to handle mixed
data types, and capacity to provide interpretable feature importance rankings. Random Forest operates as an
ensemble learning technique that constructs multiple decision trees during training and outputs the final
prediction based on the majority vote (for classification) or the average prediction (for regression) of all
individual trees .
The final prediction for an input instance x is expressed mathematically as:
H(x) = mode{h
1
(x),h
2
(x),…,h
K
(x)}
Where h
K
(x) represents the prediction of the kth decision tree in the ensemble, and K denotes the total number
of trees.
Each tree in the Random Forest is trained on a bootstrapped sample of the training dataset, with a random
subset of features considered at each split. This process introduces randomness that enhances generalization
and reduces overfitting.
The quality of a split in each decision tree is commonly evaluated using the Gini Impurity metric, which
measures the probability of incorrectly classifying a randomly chosen element if it were randomly labeled
according to the class distribution in the node. The Gini Impurity is defined as:
󰇛󰇜
where C is the number of classes and p
i
represents the probability of selecting an item belonging to class i.
Minimizing the Gini Impurity at each split ensures that the resulting nodes are as pure as possible, thereby
improving classification accuracy.
Model Evaluation and Results
To see how well the Random Forest model worked, it was tested using standard industry measurements by
setting the system to only flag a passenger as a churn risk if it was at least 70% certain. This ensure we had a
reliable tool for taking action.
Performance Metrics:
The following classification report details the model's overall performance. Although the general accuracy was
53%, the model's strategic importance becomes clear when looking specifically at the metrics for the churn
class.
Accuracy:
Accuracy measures the overall proportion of correct predictions (both churners and non-churners) among all
evaluated instances. It provides a general sense of the model’s predictive performance.
Accuracy =
󰇛󰇜󰇛󰇜
󰇛󰇜󰇛󰇜
Precision:
Precision measures the proportion of correctly predicted positive cases (i.e., passengers predicted to churn who
actually did). This metric is particularly important for assessing the impact of false positives, where a
passenger is incorrectly classified as likely to churn.
Precision =


INTERNATIONAL JOURNAL OF RESEARCH AND INNOVATION IN APPLIED SCIENCE (IJRIAS)
ISSN No. 2454-6194 | DOI: 10.51584/IJRIAS |Volume X Issue X October 2025
Page 1404
www.rsisinternational.org
Recall: (sensitivity):
Recall quantifies the proportion of actual churners that the model correctly identified. This metric highlights
the cost of false negatives, which occur when high-value passengers intending to leave are not detected by the
model.
Recall =


F1-Score:
The F1-Score represents the harmonic mean of Precision and Recall, offering a balanced measure that accounts
for both false positives and false negatives. It is particularly useful in cases where the dataset is imbalanced
(i.e., when churners are fewer than retained passengers).
F1 = 


Confusion Matrix and Detailed Metrics
The model’s performance was further analyzed using a Confusion Matrix, which provides a detailed view of
the classifier’s prediction accuracy for each class. The matrix illustrates the Random Forest model’s capability
in correctly identifying both churners (True Positives) and retained customers (True Negatives), while also
indicating the occurrence of False Positives and False Negatives.
Table 1. Confusion Matrix for Passenger Churn Prediction
Actual / Predicted
Retained (0)
Churn (1)
Retained (0)
True Negatives (TN)
False Positives (FP)
Churn (1)
False Negatives (FN)
True Positives (TP)
The Confusion Matrix outcomes demonstrate that the Random Forest model maintains a strong balance
between sensitivity (the ability to correctly identify churners) and specificity (the ability to correctly identify
retained customers).
Fig 2. Confusion Matrix for Passenger Churn Prediction
INTERNATIONAL JOURNAL OF RESEARCH AND INNOVATION IN APPLIED SCIENCE (IJRIAS)
ISSN No. 2454-6194 | DOI: 10.51584/IJRIAS |Volume X Issue X October 2025
Page 1405
www.rsisinternational.org
To further quantify performance, Table 2 presents the detailed evaluation metrics using hypothetical
placeholder values.
Table 2. Model Evaluation Metrics
Metric
Interpretation
Accuracy
The model correctly predicted the outcome for 53% of the passengers in the test group.
Precision
(Churn)
When the model predicts a passenger will churn, it is correct 73% of the time. This is the
model's key strength, providing a high-confidence list of at-risk passengers.
Recall
(Churn)
The model correctly identifies only 24% of all actual churners. This has been made an
intentional trade-off for achieving high precision.
F1-Score
The harmonic mean between Precision and Recall. The lower score indicates the mode’s
conservative approach, giving priority to precision over recall.
Fig 3. Model Evaluation Metrics
Fig 4. Churn Prediction Outcome
The key finding is the precision of 0.73 for the churn class (1). This signifies that when the model identifies a
passenger as likely to churn, it is correct 73% of the time. This level of precision provides a reliable basis for
INTERNATIONAL JOURNAL OF RESEARCH AND INNOVATION IN APPLIED SCIENCE (IJRIAS)
ISSN No. 2454-6194 | DOI: 10.51584/IJRIAS |Volume X Issue X October 2025
Page 1406
www.rsisinternational.org
deploying targeted retention strategies, ensuring that marketing efforts are not wasted on customers who were
not at risk.
This high precision comes at the cost of a lower recall of 0.24, meaning the model identifies 24% of the total
actual churners. This outcome is a direct consequence of the 0.70 prediction threshold, which makes the model
more conservative. From a business perspective, this is a valuable result, as it provides a smaller, high-
confidence list of at-risk passengers for immediate intervention.
Key Churn Predictors
Systematic ranking of predictive features provided clear guidance for operational and service priorities:
1. Delay duration: Emerged as the primary predictor of customer churn. Passengers experiencing frequent
or prolonged demonstrated substantially higher likelihood of switching carriers.
2. Customer service interaction frequency: High volumes served as strong indicators of cumulative
passenger dissatisfaction and subsequent churn risk.
3. Travel class segmentation: class passengers exhibited higher churn rates compared to class travelers,
suggesting greater price and service sensitivity among the former segment.
4. Route and temporal patterns: Specific city pair combinations and particular days of the week showed
elevated churn rates, indicating localized service challenges or intense competitive pressures.
Fig 5. Feature Importance
Business Applications and Strategic Implications
Operational Recommendations
The machine learning insights generate several actionable recommendations for Nigerian aviation operators:
Predictive dashboard implementation: Airlines should integrate churn prediction capabilities into
existing Customer Relationship Management (CRM) systems, enabling real-time identification of at-
risk passengers and automated alert generation for retention teams.
Targeted loyalty programs: High-risk passengers identified through the predictive model should
receive personalized retention offers, including route-specific discounts, upgrade opportunities, and
enhanced service recovery protocols.
INTERNATIONAL JOURNAL OF RESEARCH AND INNOVATION IN APPLIED SCIENCE (IJRIAS)
ISSN No. 2454-6194 | DOI: 10.51584/IJRIAS |Volume X Issue X October 2025
Page 1407
www.rsisinternational.org
Service recovery automation: Airlines should implement automated follow-up procedures for
passengers affected by delays or service disruptions, particularly focusing on economy class travelers
who demonstrate higher churn sensitivity.
Operational enhancement focus: Churn prediction insights should guide operational improvements,
with particular attention to delay reduction on high-risk routes and enhanced customer service training
for support staff.
Strategic Business Impact
The implementation of machine learning-driven retention strategies offers multiple strategic advantages:
Revenue protection: Proactive identification of at-risk customers enables targeted retention
investments that protect existing revenue streams more cost-effectively than customer acquisition
programs.
Service quality optimization: Understanding the specific factors that drive customer churn allows
airlines to prioritize operational improvements with the highest retention impact.
Competitive positioning: Data-driven customer retention capabilities provide competitive advantages
in Nigeria's crowded aviation market by enabling more responsive and personalized customer service.
Resource allocation efficiency: Predictive analytics enable more efficient allocation of retention
resources by focusing efforts on passengers with the highest churn probability and lifetime value
potential.
Technical Implementation Considerations
The proposed system architecture leverages Python and its ecosystem (Pandas, NumPy, Skit-learn,Seaborn, etc
) for accessibility and scalability. The Random Forest algorithm provides a suitable balance of performance
and computational efficiency for integration with existing Nigerian airline infrastructure. Future deployment
would involve containerization (e.g., Docker) and API integration to serve predictions in real-time.
Limitations and Future Research Directions
Current Study Limitations:
Sample size constraints: The dataset of approximately 200 passenger records, while sufficient for
proof-of-concept development, represents a limited sample that may not capture the full diversity of
Nigerian aviation market and passenger behavior.
Temporal scope: The analysis focuses on a specific time period and may not account for seasonal
variations, economic fluctuations, or evolving market conditions that influence passenger behavior.
Take for instance, the market curve and passenger behavior during festive seasons differ greatly from
other seasons of the year in Nigeria.
Feature limitations: While the selected variables provide strong predictive power, additional factors
such as passenger demographics, loyalty program participation, and external economic indicators could
enhance model accuracy.
Future Research Opportunities
Expanded Dataset Analysis: Larger, multi-airline datasets could provide more comprehensive insights
into industry-wide churn patterns and competitive dynamics.
Advanced Algorithm Exploration: Investigation of deep learning approaches, ensemble methods, and
specialized time-series algorithms could improve prediction accuracy and capture more complex
behavioral patterns.
Integration with External Data: Incorporation of economic indicators, weather patterns, and
competitive pricing data could enhance model sophistication and practical applicability.
Real-Time Implementation Studies: Research focused on operational deployment challenges and
real-time performance optimization would provide valuable implementation guidance.
INTERNATIONAL JOURNAL OF RESEARCH AND INNOVATION IN APPLIED SCIENCE (IJRIAS)
ISSN No. 2454-6194 | DOI: 10.51584/IJRIAS |Volume X Issue X October 2025
Page 1408
www.rsisinternational.org
CONCLUSION
This research proves that machine learning is a powerful tool for helping Nigerian airlines hold on to their
customers. The model successfully pinpointed the main reasons passengers leave flight delays, their experience
with customer service, and even the type of ticket they bought. The model is a practical tool that helps airlines
get ahead of the problem. When the model says a passenger is a churn risk, it’s right 73% of the time, which
means marketing teams can be very confident in their targeted retention campaigns.
Ultimately, these findings give airlines information they can act on. Instead of just reacting to problems, they
can proactively use this technology to improve the entire customer experience, which protects their revenue,
keeps passengers happy, and makes sure resources are used wisely.
REFERENCES
1. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 532.
https://doi.org/10.1023/A:1010933404324
2. Federal Airports Authority of Nigeria (FAAN). (2024). Nigerian aviation sector overview and
challenges. Lagos: FAAN Annual Report.
3. Han, J., Kamber, M., & Pei, J. (2012). Data mining: Concepts and techniques (3rd ed.). Morgan
Kaufmann.
4. Kim, Y. H., & Kim, Y. J. (2017). Predicting customer churn in the airline industry: A deep learning
approach. Journal of Air Transport Management, 62, 268275.
https://doi.org/10.1016/j.jairtraman.2017.06.025
5. Olatokun, F. A. O., & Alabi, A. K. F. (2018). An empirical analysis of customer loyalty in the Nigerian
aviation industry. International Journal of Transportation Science and Technology, 7(4), 312320.
https://doi.org/10.1016/j.ijtst.2018.07.002
6. Vercellis, C. (2009). Business intelligence: Data mining and optimization for decision making. John
Wiley & Sons.