Skip to content

GoonSquad67/LOD-PREDICTOR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Long Overdue Debtor (LOD) Prediction Model

This repository contains the data science and model source code for the AI Hack Thailand 2025 Competition, submitted by Team GoonSquad69.

The objective is to build a machine learning model that accurately predicts the probability of a loan applicant becoming a Long Overdue Debtor (LOD).


Training Datasets

Important

The actual datasets have been deleted because they contain confidential information and are strictly forbidden from public disclosure.

The dataset contains 42 columns, including the information listed below:

  • Loan Application Date
  • Address
  • Gender
  • Marital Status
  • Residence Information
  • Applicant's Company Information
  • Occupation & Position
  • Salary & Other Income
  • Loan Request Information
  • General Information
  • LOD Result (Only in the training dataset)

Data Science

Important

All outputs in the Jupyter notebooks have been deleted to prevent confidential data leakage.

File Description
data-preprocessing.ipynb Analyzes the information in all columns of the dataset. Identifies relationships and conflicts between columns.
data-cleaning.ipynb Cleans both the training and evaluation datasets. Feature engineering techniques are also applied here.
pseudo-label.ipynb Adds high-confidence rows from a previous submission to the training dataset (pseudo-labeling).

Data Preprocessing

This section details the analysis of all columns in the provided dataset, focusing on identifying relationships and data quality issues.

The provided dataset had many conflicts and missing values, such as:

  • The application_date column contained two different date formats.
  • The gender column used F1 for married females and F2 for unmarried females, but many rows conflicted with the marital_status column.
  • Some entries in the postal_code column were invalid.
  • The number_of_children and number_of_resident columns contained several extreme outlier values.
  • The column for the number of employees at the applicant's company had many rows with 9,999, which is an ambiguous value.
  • Some applicants had invalid loan histories or requests.
  • Some applicants were listed as 'single' but reported income from a spouse.
  • Some categorical columns contained invalid values.
  • Some columns had more than 50% missing or invalid values.

Data Cleaning

Based on the analysis, most columns were categorical. We applied the following cleaning techniques:

  • For columns with extreme outliers, we used the Interquartile Range (IQR) to identify and cap them.
  • For columns with a wide range of values, we used quantile binning to convert them into categorical features.
    • If a column contained a special value (e.g., 9999), we separated it into its own bin.
  • Used one-hot encoding for categorical columns.
  • Added new interaction features for more insights, such as age_at_application, total_income, and expected_loan_div_income.

Note

After the private leaderboard was released, it appeared that one-hot encoding was causing overfitting. A better approach might have been to use target-based encoding for categorical columns and a log transformation for wide-range numerical columns.

Pseudo-Labeling

We added pseudo-labeled data from high-confidence predictions on a previous submission to improve model performance.

  • Added high-confidence negative predictions (rows with a predicted probability between 0.00 and 0.30).
  • Added high-confidence positive predictions (rows with a predicted probability between 0.70 and 1.00).

Model Training

Important

All outputs in the Jupyter notebooks have been deleted to prevent confidential data leakage.

File Description
models.ipynb Training for CatBoost, LightGBM, and Logistic Regression models.
autogluon.ipynb Uses AutoGluon to automatically train models and generate a submission.
coffee-blender.ipynb A custom script for manually blending submission files.

Models

  • First, we used a CatBoost model for feature selection to eliminate low-importance features.
  • Trained a second CatBoost model on the feature-selected dataset.
  • Trained a LightGBM model on the feature-selected dataset.
  • Trained a Logistic Regression model on the feature-selected dataset.
  • Finally, we manually set weights to blend the results from each model, generating our best submission.

Note

The combination of 40% CatBoost + 40% LightGBM + 20% Logistic Regression appeared to be the optimal blend.

Tip

Future work could include adding more models (beyond CatBoost, LightGBM, and Logistic Regression) to the blend for potentially different results.

AutoGluon

We used AutoGluon to automatically train a wide range of models and generate an optimized submission.

Warning

The XGBoost model consistently caused an error on Kaggle for an unknown reason.

Coffee Blending

This script imports high-quality submission files (as Kaggle datasets) and manually blends them using weighted averaging to generate a final, optimized submission.


Team GoonSquad69 Members

  • Leader: Worralop Srichainont (reisenx)
  • Member: Nuttapong Jinda
  • Member: Naytipat Phothipan (SNaytiP)
  • Member: Sippapas Chavanont (iammmaimai)
  • Member: Theeraphat Jinanarong (SzipeRy)

About

ML model for Long Overdue Debtor (LOD) prediction.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published