Robust Hybrid Data-Level Approach for Handling Skewed Fat-Tailed Distributed Datasets and Diverse Features in Financial Credit Risk

Szczegóły
Abstrakt

Tytuł:: Robust Hybrid Data-Level Approach for Handling Skewed Fat-Tailed Distributed Datasets and Diverse Features in Financial Credit Risk
Autorzy:: Musara, Keith R.
Ranganai, Edmore
Chimedza, Charles
Matarise, Florence
Munyira, Sheunesu
Data publikacji:: 2025
Słowa kluczowe:: skewed fat-tailed distributed datasets
diverse features
noisy instances
Język:: angielski
Dostawca treści:: BazTech
: Artykuł

Przejdź do źródła

Skewed fat-tailed distributed (imbalance or class-imbalance) datasets pose over- whelming aberrations in numerous machine learning (ML) algorithms, particularly in real-life applications, especially in the domain of credit risk modelling, where default cases (minority-classes) are often outnumbered by non-default cases (majority-classes) cases or vice versa. Data-level (DL) approaches have been suggested in the recent literature as remedies for skewed fat-tailed distributed datasets. The popularized DL approach in contemporary studies is the synthetic minority over-sampling technique (SMOTE) and its variants that are capable of mitigating the risk of overfitting and minimizing the generalization errors. However, these approaches can introduce noisy instances that adversely diminish the robustness of the ML algorithms. Also, they are often amenable to the presence of nominal features with mismatching labels that are inherent in real-world datasets. To bridge these gaps, we proposed a hybrid innovation framework that effectively mitigates the aberrations presented by nominal features with mismatching labels and noisy instances simultaneously. The proposed approach is the SMOTE-edited nearest neighbors-encoding nominal and continuous (SMOTEENN-ENC) features. The efficacy of our novelty was evaluated against DL approaches suggested in the literature, orchestrated to handle skewed fat-tailed distributed datasets with inherent diverse features. This approach was coupled with widely employed ensemble algorithms, namely the random forest (RF) and the extreme gradient boost (XGBoost). The results suggested that our novelty, SMOTEENN-ENC, integrated with the XGBoost algorithm demonstrated superiority and stability in the predictive performance when applied to skewed fat-tailed distributed datasets with inherent diverse features.

Informacja

Robust Hybrid Data-Level Approach for Handling Skewed Fat-Tailed Distributed Datasets and Diverse Features in Financial Credit Risk