This repository contains the data science and model source code for the AI Hack Thailand 2025 Competition, submitted by Team GoonSquad69.
The objective is to build a machine learning model that accurately predicts the probability of a loan applicant becoming a Long Overdue Debtor (LOD).
Important
The actual datasets have been deleted because they contain confidential information and are strictly forbidden from public disclosure.
The dataset contains 42 columns, including the information listed below:
- Loan Application Date
- Address
- Gender
- Marital Status
- Residence Information
- Applicant's Company Information
- Occupation & Position
- Salary & Other Income
- Loan Request Information
- General Information
- LOD Result (Only in the training dataset)
Important
All outputs in the Jupyter notebooks have been deleted to prevent confidential data leakage.
| File | Description |
|---|---|
data-preprocessing.ipynb |
Analyzes the information in all columns of the dataset. Identifies relationships and conflicts between columns. |
data-cleaning.ipynb |
Cleans both the training and evaluation datasets. Feature engineering techniques are also applied here. |
pseudo-label.ipynb |
Adds high-confidence rows from a previous submission to the training dataset (pseudo-labeling). |
This section details the analysis of all columns in the provided dataset, focusing on identifying relationships and data quality issues.
The provided dataset had many conflicts and missing values, such as:
- The
application_datecolumn contained two different date formats. - The
gendercolumn usedF1for married females andF2for unmarried females, but many rows conflicted with themarital_statuscolumn. - Some entries in the
postal_codecolumn were invalid. - The
number_of_childrenandnumber_of_residentcolumns contained several extreme outlier values. - The column for the number of employees at the applicant's company had many rows with
9,999, which is an ambiguous value. - Some applicants had invalid loan histories or requests.
- Some applicants were listed as 'single' but reported income from a spouse.
- Some categorical columns contained invalid values.
- Some columns had more than 50% missing or invalid values.
Based on the analysis, most columns were categorical. We applied the following cleaning techniques:
- For columns with extreme outliers, we used the Interquartile Range (IQR) to identify and cap them.
- For columns with a wide range of values, we used quantile binning to convert them into categorical features.
- If a column contained a special value (e.g.,
9999), we separated it into its own bin.
- If a column contained a special value (e.g.,
- Used one-hot encoding for categorical columns.
- Added new interaction features for more insights, such as
age_at_application,total_income, andexpected_loan_div_income.
Note
After the private leaderboard was released, it appeared that one-hot encoding was causing overfitting. A better approach might have been to use target-based encoding for categorical columns and a log transformation for wide-range numerical columns.
We added pseudo-labeled data from high-confidence predictions on a previous submission to improve model performance.
- Added high-confidence negative predictions (rows with a predicted probability between
0.00and0.30). - Added high-confidence positive predictions (rows with a predicted probability between
0.70and1.00).
Important
All outputs in the Jupyter notebooks have been deleted to prevent confidential data leakage.
| File | Description |
|---|---|
models.ipynb |
Training for CatBoost, LightGBM, and Logistic Regression models. |
autogluon.ipynb |
Uses AutoGluon to automatically train models and generate a submission. |
coffee-blender.ipynb |
A custom script for manually blending submission files. |
- First, we used a
CatBoostmodel for feature selection to eliminate low-importance features. - Trained a second
CatBoostmodel on the feature-selected dataset. - Trained a
LightGBMmodel on the feature-selected dataset. - Trained a
Logistic Regressionmodel on the feature-selected dataset. - Finally, we manually set weights to blend the results from each model, generating our best submission.
Note
The combination of 40% CatBoost + 40% LightGBM + 20% Logistic Regression appeared to be the optimal blend.
Tip
Future work could include adding more models (beyond CatBoost, LightGBM, and Logistic Regression) to the blend for potentially different results.
We used AutoGluon to automatically train a wide range of models and generate an optimized submission.
Warning
The XGBoost model consistently caused an error on Kaggle for an unknown reason.
This script imports high-quality submission files (as Kaggle datasets) and manually blends them using weighted averaging to generate a final, optimized submission.
- Leader: Worralop Srichainont (
reisenx) - Member: Nuttapong Jinda
- Member: Naytipat Phothipan (
SNaytiP) - Member: Sippapas Chavanont (
iammmaimai) - Member: Theeraphat Jinanarong (
SzipeRy)