This repository contains the code and reproducible research pipeline for a machine learning project focused on ATR-FTIR spectra of saliva and gingival crevicular fluid.
The project investigates how data augmentation affects supervised model performance, probability calibration, PCA-based data geometry and the quality of synthetic spectra in biomedical datasets with limited sample size and class imbalance.
The repository supports a master’s thesis in Data Science and is designed as a compact, reproducible and verifiable research codebase.
The project includes three connected analysis tracks:
-
Supervised evaluation Comparison of baseline and augmented scenarios using identical splits, models, preprocessing profiles, threshold rules and calibration schemes.
-
Geometry-first analysis Analysis of changes in PCA-based data geometry and factor–PC associations, especially for the small-n gingival crevicular fluid dataset.
-
Synthetic-data quality control Evaluation of synthetic spectra using real-vs-synthetic AUC, kNN overlap, Wasserstein distance and downstream sanity checks.
The key methodological principle is:
The effect of augmentation is evaluated as a paired difference: Augmented − Baseline, while keeping all other experimental conditions fixed.
This design makes it possible to interpret performance and geometry changes as the effect of augmentation rather than as a consequence of different preprocessing, model settings or validation schemes.
In this project, augmentation is not treated only as a way to improve supervised metrics. It is also considered as a methodological intervention that may affect:
- the geometry of the spectral feature space;
- the stability of cluster structure;
- the distribution of clinical-factor information across principal components;
- the calibration and reliability of predicted probabilities;
- the risk of overly optimistic supervised results in small biomedical datasets.
The final summary tables and figure-ready plots used for reporting are stored in:
reports/final/reports/figs/
Large intermediate artifacts and full run-level reports are not stored in GitHub in order to keep the repository compact and reviewable.
The repository includes:
- processed datasets in
parquetformat; - a universal supervised learning pipeline;
- batch experiment scripts for saliva and GDB small-n datasets;
- scripts for PCA and factor–PC association analysis;
- scripts for cluster stability analysis;
- scripts for synthetic-data QC and comparison of classic augmentation, VAE and WGAN approaches;
- result aggregation scripts and summary table generation;
- figure-ready plotting scripts;
- a minimal smoke test to verify that the main supervised pipeline runs without errors.
The project uses processed versions of an open ATR-FTIR saliva dataset for COVID-19 screening:
data/processed/train.parquetdata/processed/external.parquet
train.parquet contains 183 spectra from 61 subjects, with three spectral replicates per subject. It is used for supervised modeling.
external.parquet contains one record per subject and is used for PCA and exploratory geometry-oriented analysis.
Important note: external.parquet is not an independent external test set, because subject identifiers overlap with train.parquet. Therefore, it is used only for exploratory and geometry-oriented analysis.
The diabetes saliva dataset is stored as:
data/processed/diabetes_saliva.parquet
It contains 1040 ATR-FTIR saliva spectra together with metadata such as population, gender, age, glucose, glucose_group, hemoglobin and other variables.
In the current version of the project, results for this dataset are interpreted at the sample level rather than at the patient level.
The gingival crevicular fluid dataset is a small-n biomedical dataset provided by the research group of P. V. Seredin and previously described in the following publications:
The GDB small-n dataset is not included in the public repository due to access restrictions. For this dataset, the repository provides scripts, aggregated tables, final figures and a description of the data preparation and analysis workflow.
The dataset contains 18 gingival crevicular fluid spectra and the following clinical factors:
GenderAge_factorcaries_factorParodontAnamnes_factor
It also includes derived binary classification tasks:
y_parodont_H_vs_pathy_anamnes_H_vs_pathy_healthy_vs_any
For this dataset, the main focus is not direct improvement of supervised metrics, but analysis of data geometry, PCA/dimdesc-like factor associations, cluster stability and synthetic-data quality control.
nir-ftir/
├── configs/ # Configuration files and service settings
├── data/
│ ├── raw/ # Raw data, stored locally and not versioned
│ └── processed/ # Processed parquet datasets
├── reports/ # Final summary tables and figure-ready plots
├── scripts/ # Experiment, aggregation and plotting scripts
├── src/ # Core pipeline code
├── tests/ # Minimal smoke tests
├── environment.yml # Conda/mamba environment
├── pyproject.toml
└── README.md
src/train_baselines.py is the main universal supervised learning pipeline.
It includes:
- loading parquet datasets;
- identifying spectral columns;
- preprocessing;
- leakage-safe splitting;
- train-only augmentation;
- model training;
- optional probability calibration;
- metric calculation;
- saving JSON reports.
src/prepare_data.py— preparation of saliva datasets;src/preprocess_diabetes_saliva.py— preparation of the diabetes saliva dataset;src/prepare_gdb_smalln.py— preparation of the GDB small-n dataset;src/eda_qc.py— EDA and QC for saliva datasets;src/cluster_analysis.py— extended exploratory and cluster analysis.
-
scripts/run_all_experiments.sh— main experiment series for saliva datasets:- COVID-19: baseline vs classic augmentation;
- diabetes: baseline vs strong augmentation.
-
scripts/run_gdb_study.sh— supervised stability experiments for GDB small-n; -
scripts/run_gdb_qc_r2.sh— synthetic-data QC and downstream sanity checks for GDB; -
scripts/run_gdb_dimdesc_r2.sh— PCA and dimdesc-like analysis for GDB.
scripts/pca_dimdesc_r2.py— PCA and factor–PC association analysis;scripts/cluster_pca_stability.py— cluster stability analysis;scripts/gdb_qc_r2_generators.py— QC comparison of classic augmentation, VAE and WGAN synthetic data.
scripts/aggregate_reports.py— summary tables for saliva experiments;scripts/aggregate_gdb_smalln_reports.py— aggregation of supervised GDB small-n runs;scripts/aggregate_dimdesc_r2.py— PCA/dimdesc-like analysis summaries;scripts/plot_summary.py— final comparative plots;scripts/plot_dimdesc_r2_curves.py— R² curves across principal components;scripts/make_figs.py— export of selected figure-ready plots.
All experiment series follow the same comparison principle:
baseline → augmentation
In the baseline scenario, models are trained only on real spectra.
In the augmented scenario, synthetically perturbed versions of the training spectra are added to the training data.
All other conditions are kept identical, including:
- data splits;
- preprocessing profile;
- model type;
- threshold selection rule;
- calibration setting;
- metric calculation.
This makes it possible to interpret Augmented − Baseline differences as the effect of augmentation.
The project covers two different analysis regimes.
For saliva datasets, augmentation is mainly evaluated as a possible regularizer for supervised models.
The following metrics are analyzed:
- Recall
- F1
- PR-AUC
- ROC-AUC
- Specificity
- Brier score
- Expected Calibration Error
For the GDB small-n dataset, augmentation is treated primarily as an intervention into data geometry.
Additional analyses include:
-
PCA and redistribution of variance;
-
dimdesc-like factor–PC associations;
-
best-PC and top-k analysis;
-
cluster stability;
-
synthetic-data QC:
- real-vs-synthetic AUC;
- kNN overlap;
- Wasserstein distance.
mamba env create -f environment.yml
mamba activate ftir311_localbash scripts/run_all_experiments.shbash scripts/run_gdb_study.shbash scripts/run_gdb_qc_r2.shbash scripts/run_gdb_dimdesc_r2.shpython scripts/aggregate_reports.py
python scripts/aggregate_gdb_smalln_reports.py
python scripts/aggregate_dimdesc_r2.pyA minimal smoke test is included to verify that the environment is correctly configured and that the main supervised pipeline can be imported and executed on a test example.
python -m pytest -q tests/test_smoke_pipeline.pyExpected result:
1 passed
This test does not reproduce the full experiment and does not replace the main experiment series. Its purpose is to quickly check the technical integrity of the repository after changes in code, environment or project structure.
Python file compilation can also be checked with:
python -m compileall src scripts testsThe repository intentionally includes:
- project code;
- configuration files;
- key processed parquet datasets;
- compact summary files;
- selected final figures.
The repository does not include:
- raw data;
- large intermediate artifacts;
- full run-level reports;
- automatically generated QC and figure folders.
For convenient review, the most important final artifacts are stored in compact folders.
This folder contains the main summary tables:
gdb_dimdesc_window_summary.csv— comparison of spectral windows for GDB small-n;gdb_dimdesc_best_pc_per_factor.csv— best-PC results for clinical factors;gdb_qc_amide3_method_summary.csv— synthetic-data QC for the Amide III region;gdb_qc_broad_method_summary.csv— synthetic-data QC for the broader control range;diabetes_meta_only_holdout.csv— compact summary for the diabetes saliva dataset.
This folder contains final figure-ready plots:
fig1_dimdesc_windows.png/.pdf;fig2_pc_curve_amide3_Anamnes_factor.png/.pdf.
These files serve as a compact and verifiable representation of the main project results.
The repository reflects the final research structure of the master’s thesis project:
- open saliva datasets;
- GDB small-n dataset;
- baseline vs augmentation experiments;
- supervised evaluation;
- geometry-first analysis;
- cluster stability checks;
- synthetic-data quality control.
ATR-FTIR, biomedical spectra, saliva, gingival crevicular fluid, data augmentation, machine learning, medical AI, healthcare analytics, PCA, synthetic data quality control, small-n biomedical data, model calibration, reproducible research.