Long Overdue Debtor (LOD) Prediction Model

This repository contains the data science and model source code for the AI Hack Thailand 2025 Competition, submitted by Team GoonSquad69.

The objective is to build a machine learning model that accurately predicts the probability of a loan applicant becoming a Long Overdue Debtor (LOD).

Training Datasets

Important

The actual datasets have been deleted because they contain confidential information and are strictly forbidden from public disclosure.

The dataset contains 42 columns, including the information listed below:

Loan Application Date
Address
Gender
Marital Status
Residence Information
Applicant's Company Information
Occupation & Position
Salary & Other Income
Loan Request Information
General Information
LOD Result (Only in the training dataset)

Data Science

Important

All outputs in the Jupyter notebooks have been deleted to prevent confidential data leakage.

File	Description
`data-preprocessing.ipynb`	Analyzes the information in all columns of the dataset. Identifies relationships and conflicts between columns.
`data-cleaning.ipynb`	Cleans both the training and evaluation datasets. Feature engineering techniques are also applied here.
`pseudo-label.ipynb`	Adds high-confidence rows from a previous submission to the training dataset (pseudo-labeling).

Data Preprocessing

This section details the analysis of all columns in the provided dataset, focusing on identifying relationships and data quality issues.

The provided dataset had many conflicts and missing values, such as:

The application_date column contained two different date formats.
The gender column used F1 for married females and F2 for unmarried females, but many rows conflicted with the marital_status column.
Some entries in the postal_code column were invalid.
The number_of_children and number_of_resident columns contained several extreme outlier values.
The column for the number of employees at the applicant's company had many rows with 9,999, which is an ambiguous value.
Some applicants had invalid loan histories or requests.
Some applicants were listed as 'single' but reported income from a spouse.
Some categorical columns contained invalid values.
Some columns had more than 50% missing or invalid values.

Data Cleaning

Based on the analysis, most columns were categorical. We applied the following cleaning techniques:

For columns with extreme outliers, we used the Interquartile Range (IQR) to identify and cap them.
For columns with a wide range of values, we used quantile binning to convert them into categorical features.
- If a column contained a special value (e.g., 9999), we separated it into its own bin.
Used one-hot encoding for categorical columns.
Added new interaction features for more insights, such as age_at_application, total_income, and expected_loan_div_income.

Note

After the private leaderboard was released, it appeared that one-hot encoding was causing overfitting. A better approach might have been to use target-based encoding for categorical columns and a log transformation for wide-range numerical columns.

Pseudo-Labeling

We added pseudo-labeled data from high-confidence predictions on a previous submission to improve model performance.

Added high-confidence negative predictions (rows with a predicted probability between 0.00 and 0.30).
Added high-confidence positive predictions (rows with a predicted probability between 0.70 and 1.00).

Model Training

Important

All outputs in the Jupyter notebooks have been deleted to prevent confidential data leakage.

File	Description
`models.ipynb`	Training for `CatBoost`, `LightGBM`, and `Logistic Regression` models.
`autogluon.ipynb`	Uses `AutoGluon` to automatically train models and generate a submission.
`coffee-blender.ipynb`	A custom script for manually blending submission files.

Models

First, we used a CatBoost model for feature selection to eliminate low-importance features.
Trained a second CatBoost model on the feature-selected dataset.
Trained a LightGBM model on the feature-selected dataset.
Trained a Logistic Regression model on the feature-selected dataset.
Finally, we manually set weights to blend the results from each model, generating our best submission.

Note

The combination of 40% CatBoost + 40% LightGBM + 20% Logistic Regression appeared to be the optimal blend.

Tip

Future work could include adding more models (beyond CatBoost, LightGBM, and Logistic Regression) to the blend for potentially different results.

AutoGluon

We used AutoGluon to automatically train a wide range of models and generate an optimized submission.

Warning

The XGBoost model consistently caused an error on Kaggle for an unknown reason.

Coffee Blending

This script imports high-quality submission files (as Kaggle datasets) and manually blends them using weighted averaging to generate a final, optimized submission.

Team GoonSquad69 Members

Leader: Worralop Srichainont (reisenx)
Member: Nuttapong Jinda
Member: Naytipat Phothipan (SNaytiP)
Member: Sippapas Chavanont (iammmaimai)
Member: Theeraphat Jinanarong (SzipeRy)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Long Overdue Debtor (LOD) Prediction Model

Training Datasets

Data Science

Data Preprocessing

Data Cleaning

Pseudo-Labeling

Model Training

Models

AutoGluon

Coffee Blending

Team GoonSquad69 Members

About

Uh oh!

Releases

Packages

Languages

License

GoonSquad67/LOD-PREDICTOR

Folders and files

Latest commit

History

Repository files navigation

Long Overdue Debtor (LOD) Prediction Model

Training Datasets

Data Science

Data Preprocessing

Data Cleaning

Pseudo-Labeling

Model Training

Models

AutoGluon

Coffee Blending

Team GoonSquad69 Members

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages