Team: Pranav M R · Jayant Chandwani · Ishaan Gupta · Abhav Garg
Check out these slides for a detailed explanation of the entire pipeline and our approach!
🖼️ View the full presentation
This repository contains our submission for the Robust Vision challenge. The goal is to train on a noisy / label-poisoned source dataset and robustly adapt to an unlabeled target domain at test time.
Key idea: train a noise-robust classifier on source_toxic.pt, then apply lightweight, unsupervised test-time adaptation (BatchNorm stat alignment + entropy minimization) and label-shift correction (BBSE) on the target distribution.
train.py— end-to-end multi-phase pipeline (training → estimation → test-time adaptation → submission)model_submission.py— model definition (RobustClassifier) +load_weights(path)helper for gradersrequirements.txt— Python dependenciesdata/— datasets (place provided.ptfiles here)checkpoints/— intermediate checkpoints created bytrain.py
Create a virtual environment and install dependencies:
Windows (PowerShell)
conda create -n rv python=3.11
conda activate rv
pip install -r requirements.txtLinux/macOS (bash/zsh)
conda create -n rv python=3.11
conda activate rv
pip install -r requirements.txtPlace the challenge-provided files under data/:
source_toxic.pt— training set (expects keys:images,labels)static.pt— unlabeled target set (expects keys:images)test_suite_public.pt— scenario suite used to generate the final submission (expects scenario tensors)val_sanity.pt— small clean validation set (expects keys:images,labels)
Run the full pipeline:
# Full pipeline: train from scratch to generate submission.csv
python train.pytrain.py is fully resumable. If interrupted, re-running it will pick up from the last saved checkpoint. Delete checkpoints/ to force a clean retraining run.
Outputs:
weights.pth— final weights for submissionsubmission.csv— predictions in the expected formatcheckpoints/— phase checkpoints for resuming runs
- Phase 1 — Robust training: train from random initialization on noisy labels using Symmetric Cross-Entropy (SCE).
- Phase 2 — Confusion estimation: estimate and correct a confusion matrix used for label-shift estimation.
- Phase 3 — Test-time adaptation: re-estimate BatchNorm running stats on target data and optionally apply TENT-style entropy minimization (BN affine params only).
- Phase 4 — BBSE prior correction: estimate target priors and apply a log-prior correction to predictions.
Model: Custom ResNet-style classifier (RobustClassifier). It is a 3-stage residual network for 1×28×28 grayscale inputs: a stem (Conv→BN→ReLU), then three residual stages (64→64→128→256 channels; 28×28→14×14→7×7), followed by AdaptiveAvgPool, Dropout(0.25), and a Linear(256, 10) head. Each residual block contains two BatchNorm2d layers (plus one on the skip connection when channels or stride change), for 15 BatchNorm2d layers total. This is intentional because BatchNorm statistics are our main handle for test-time adaptation.
Initialization: Random initialization. No pre-trained weights are used at any stage.
Augmentation: Standard geometry and photometric transforms only: RandomCrop(28, padding=4, reflect), RandomHorizontalFlip, RandomRotation(15°), RandomErasing(p=0.25). We do not use corruption-simulating augmentations (no AugMix, PixMix, or corruption-mimicking transforms).
To compare approaches during development, we built a private eval_suite from labeled data, containing 8 corruption scenarios at fixed severity levels:
| Scenario | Type |
|---|---|
contrast_reduction |
Photometric |
defocus_blur |
Blur |
gaussian_noise |
Additive noise |
impulse_noise_medium |
Salt-and-pepper noise |
impulse_noise_heavy |
Salt-and-pepper noise (severe) |
pixelate_medium |
Spatial downsampling |
posterize |
Bit-depth reduction |
shot_noise |
Poisson noise |
Each scenario contains 5,000 labeled images with a deliberately uneven class distribution (e.g., Sandal = 35%, Trouser = 1.2%) to simulate the type of prior shift we expect in the hidden evaluation. The training set is near-uniform (≈6,000 per class), so the model must handle both covariate shift and label shift. We used Macro-F1 as the primary comparison metric because it weights all 10 classes equally regardless of support, making it sensitive to failures on minority classes. All ablations reported here (GCE vs. SCE, TENT on/off, temperature scaling) were measured on this suite.
Compliance note: eval_suite was used only for evaluation and hyperparameter tuning. We did not use it to train the submitted model in any form (no gradient updates and no mixing into the training set).
Approach: We train on source_toxic.pt (60,000 images, 30% symmetric label noise) using Symmetric Cross-Entropy (SCE) loss. Optimization uses SGD (momentum=0.9, weight_decay=5e-4) with a 5-epoch linear warm-up followed by cosine annealing, for 100 epochs total. We keep the checkpoint with the best noisy-validation accuracy.
Training-data restriction note: all learned weights come only from source_toxic.pt, as instructed. We do not train on any additional labeled data, and we avoid banned augmentation techniques (for example, Mixup). Only the permitted augmentations listed in the first section are used.
What we tried: We initially trained with Generalized Cross-Entropy (GCE) loss (
Justification: SCE combines a standard CE term with a Reverse CE (RCE) term:
With symmetric label noise at rate
Approach: After Phase 1, we compute
What we tried: We first tried a standard hard-prediction confusion matrix on the clean val_sanity.pt set, but that set has only 100 samples, so the estimates were noisy. Switching to the full 10% noisy validation split with soft counts was more stable. We also experimented with temperature scaling before BBSE inversion, aiming for a better
Justification: The key insight, formalized by Patrini et al., is that under class-dependent label noise the noisy posterior and the clean posterior are related by the transition matrix
Every clean label stays correct with probability
Because the model is trained on noisy labels, the resulting confusion matrix
Approach: For each test scenario, three sequential adaptation steps are applied to a freshly reloaded copy of the trained model:
-
BN Statistics Reset & Re-estimation (
adapt_bn): All BatchNorm running means and variances are zeroed and recomputed over the target batch in a single no-grad forward pass (momentum=Nonefor a cumulative average). This replaces source-domain BN statistics with target-domain statistics and helps correct covariate shift induced by sensor noise. -
TENT Entropy Minimization (
tent_adapt): Only the BN affine parameters ($\gamma$ ,$\beta$ ) are unfrozen and updated for 10 Adam steps (lr=2e-3), minimizing Shannon entropy$H = -\sum_k p_k \log p_k$ over the target batch. This encourages the model to produce confident, low-entropy predictions under the new domain. -
BBSE Prior Correction (
predict_with_prior): The target class prior$\hat{p_t}$ is estimated by inverting the BBSE equation:$\hat{\mu_t} = C_{true}^\top \hat{p_t}$ , where$\hat{\mu_t}$ is the empirical prediction-frequency vector. Final predictions use a log-ratio correction:
What we tried: We ran the full pipeline with and without TENT. For mild corruptions like contrast_reduction and defocus_blur, the difference was small. For harder shifts like pixelate_medium, TENT produced a meaningful improvement because entropy minimization encourages more decisive predictions when BN statistics alone do not fully close the domain gap. TENT can become unstable if the learning rate or step count is too high (it may collapse to a single class), so we kept it conservative at 8-10 steps with lr=2e-3. We found 8 steps to perform slightly better.
Generalization Justification: BN re-estimation and TENT are both unsupervised and react to the test distribution as it arrives, without assuming a specific corruption type. By updating only BN parameters (a small fraction of the total weights), the feature representation learned during Phase 1 stays largely intact. Only the internal normalization shifts to match the new domain, which is why we expect the approach to transfer to unseen corruptions in hidden_eval.pt.
- Symmetric Cross-Entropy Loss for Robust Learning with Noisy Labels (Wang et al., ECCV 2020) -> SCE
- Generalized Cross-Entropy Loss for Training Deep Neural Networks with Noisy Labels (Zhang et al., NeurIPS 2018) -> GCE
- Making Deep Neural Networks Robust to Label Noise: a Loss Correction Approach (Patrini et al., CVPR 2017) -> Noise Transition Matrix
- Test-Time Training with Self-Supervision for Generalization under Distribution Shifts (Sun et al., ICML 2020) -> BNStats
- Black Box Shift Estimation (Lipton et al., ICML 2018) -> BBSE
- Tent: Fully Test-Time Adaptation by Entropy Minimization (Wang et al., ICLR 2021) -> Tent
- On Calibration of Modern Neural Networks (Guo et al., ICML 2017) -> Temperature Scaling