Skip to content

petriok/wake_vision_challenge_2_data_centric_track

 
 

Repository files navigation

Data-Centric Approach for Wake Vision Challenge

This document outlines the data-centric methodology used to enhance the dataset for the Edge AI Foundation Wake Vision 2 challenge.

Participant: Petri Oksanen Date: June 7, 2025


1. Introduction

The primary goal of this project was to improve the performance of a given baseline classifier model. The development of the VQ-VAE model and augmentation strategies began on May 28th, 2025. The complete dataset training commenced on June 7th, 2025, and the final quantized model was generated on June 15th, 2025. Rather than modifying the model's architecture, the focus was entirely data-centric: enhancing the training dataset to make the model more robust and accurate. The core idea was to leverage a sophisticated generative model to create new, high-quality training samples that expose the classifier to a wider variety of visual scenarios. Critically, this method aims to enhance model accuracy without altering the final classifier's architecture, thereby preserving its small footprint and suitability for edge deployment.

2. Methodology: Latent Space Augmentation

The chosen approach involved training a powerful Vector Quantized Variational Autoencoder (VQ-VAE) on the provided training data. This generative model learns a compressed, quantized representation (the "latent space") of the images. A key feature of this specific VQ-VAE is its use of a residual quantization scheme, which employs multiple codebooks to capture image features at different levels of abstraction—from coarse shapes to fine-grained textures.

Throughout the development of the VQ-VAE, TensorBoard was used extensively for logging and visualization. This allowed for close monitoring of training metrics, analysis of the latent space, and visualization of reconstructed images, which was critical for guiding hyperparameter tuning and model improvements.

The overall project workflow followed three distinct phases, as illustrated below:

graph TD
    subgraph "Phase 1: Generative Model Training"
        A["Image Dataset"] --> B["Train VQ-VAE Model"];
        B --> C["Trained VQ-VAE <br/>(Encoder + Decoder + Codebooks)"];
    end

    subgraph "Phase 2: Classifier Initialization"
        C -- "Copy Encoder Layer Weights" --> E;
        D["New Classifier Model<br/>(Untrained)"] --> E{"Initialize First N Layers<br/>from VQ-VAE Encoder"};
        E --> F["Classifier with<br/>Pre-trained Feature Extractor<br/>(Frozen Initial Layers)"];
    end

    subgraph "Phase 3: Data-Centric Classifier Training"
        A -- "Provide Images for Augmentation" --> G["Augmentation Pipeline<br/>(Described in Chapter 3)"];
        F -- "Train on Augmented Data" --> H{"Train Classifier"};
        G -- "Generate Augmented Batch" --> H;
    end

    H --> I[("Final Trained Classifier<br/>(.tflite)")];

    classDef darkTheme fill:#222,color:#fff,stroke:#aaa,stroke-width:1px;
    class A,B,C,D,E,F,G,H,I darkTheme;
Loading

The following diagram illustrates this residual quantization process:

graph TD
    A["Input Image"] --> B("Encoder");
    B --> Z{"Latent Representation<br/>z"};

    subgraph "Residual Quantization Process"
        Z -- "r_0 = z" --> Stage1;
        
        subgraph "Stage 1"
            Stage1("Quantize r_0<br/>with Codebook 1");
            Stage1 --> z_q1["Quantized<br/>Vector z_q1"];
            Stage1 --> R1{"Residual r_1<br/>(r_0 - z_q1)"};
        end
        
        R1 -- "Input to Stage 2" --> Stage2;
        
        subgraph "Stage 2"
            Stage2("Quantize r_1<br/>with Codebook 2");
            Stage2 --> z_q2["Quantized<br/>Vector z_q2"];
            Stage2 --> R2{"Residual r_2<br/>(r_1 - z_q2)"};
        end
        
        R2 --> Dots["..."];
        
        Dots -- "Input to Stage 6" --> Stage6;
        
        subgraph "Stage 6"
            Stage6("Quantize r_5<br/>with Codebook 6");
            Stage6 --> z_q6["Quantized<br/>Vector z_q6"];
        end
    end

    subgraph "Aggregation"
        Sum["Σ"];
        Sum -- "Summation" --> FinalZq{"Final Quantized Latent<br/>z_q = Σ z_qi"};
    end
    
    z_q1 -.-> Sum;
    z_q2 -.-> Sum;
    z_q6 -.-> Sum;

    FinalZq --> Decoder("Decoder");
    Decoder --> Output["Reconstructed Image"];
Loading

To further enhance knowledge transfer, the initial blocks of the VQ-VAE's encoder were intentionally designed to mirror the architecture of the early layers of the baseline classifier. Upon training the final classifier, the weights from these aligned blocks in the pre-trained VQ-VAE encoder were used to initialize the corresponding layers in the classifier. These layers were then frozen. This strategy aimed to provide the classifier with a powerful, pre-trained feature extractor, potentially accelerating convergence and improving its ability to recognize low-level features.

Once the VQ-VAE was trained, it was used as a tool for a novel data augmentation strategy called Latent Space Perturbation. The process is as follows:

  1. An image from the training set labeled as "human" is selected.
  2. The image is passed through the VQ-VAE's encoder to obtain its latent representation.
  3. This latent representation is then intentionally manipulated or "perturbed."
  4. The perturbed latent representation is passed through the VQ-VAE's decoder to generate a new, unique image that is similar to the original but contains novel variations.

The specific perturbation technique used was Aggressive Channel Scaling. In this method, a random subset of channels in the latent space vector is selected. The values in these channels are then aggressively scaled—either amplified or attenuated. This has the effect of altering an image's textures and fine details while preserving its fundamental structure, creating a useful new training example.


3. Data Augmentation Pipeline

To create a robust training dataset, a combined augmentation strategy was employed. For each batch of images during classifier training, a random selection process determines which augmentation technique is applied to each individual image.

  • 75% of images are processed using the VQ-VAE-based Latent Space Perturbation.
  • 25% of images undergo standard Geometric Augmentation (random horizontal flips and rotations).

This hybrid approach ensures that the classifier benefits from both the novel variations generated by the VQ-VAE and the traditional robustness provided by geometric transformations. The entire pipeline is illustrated below.

graph TD
    subgraph "Combined Augmentation Strategy"
        Input[("Input Image Batch")] --> Branch{"For each image,<br/>randomly choose path"};

        Branch -- "75% Probability" --> VQVAE_Augmentation;
        Branch -- "25% Probability" --> Geometric_Augmentation;

        subgraph VQVAE_Augmentation ["VQ-VAE Latent Perturbation"]
            direction TB
            A1["Encode Image<br/>to Latent Space"] --> A2["Quantize Latent<br/>(using RVQ)"];
            A2 --> A3{"Apply Perturbations"};
            subgraph "Perturbation Steps"
                direction LR
                A3 --> A4["Channel Scaling<br/>(Amplify/Attenuate a<br/>subset of channels)"];
                A3 --> A5["Spatial Warping<br/>(Randomly shift<br/>latent features)"];
            end
            A5 --> A6["Decode to<br/>Generate New Image"];
            A4 --> A6
        end

        subgraph Geometric_Augmentation ["Standard Geometric Augmentation"]
            direction TB
            B1["Random<br/>Horizontal Flip"] --> B2["Random<br/>Rotation"];
        end

        VQVAE_Augmentation --> Merge["Combine Results"];
        Geometric_Augmentation --> Merge;

        Merge --> Output[("Final Augmented<br/>Training Batch")];
    end
Loading

Examples of Augmentation

The following images illustrate the process. An original "human" image is encoded and then decoded to create a standard reconstruction. A separate, perturbed version is also generated, showing clear variations from the original.

Example 1 Augmentation Example 1

Example 2 Augmentation Example 2

Example 3 Augmentation Example 3


4. Submission Details

Submitted Files

The final submission for this competition includes the following key files:

  • data_centric.py: The main script containing the logic for data loading, VQ-VAE training, latent space augmentation, and classifier training.
  • vqvae_model.py: The Python module defining the architecture of the Vector Quantized Variational Autoencoder (VQ-VAE).
  • wv_quality_mcunet-320kb-1mb_vww.tflite: The final, quantized classifier model ready for deployment.
  • README.md: This documentation file describing the project architecture, augmentation strategy, and implementation details.

Training Environment

Initially, the training scripts and the VQ-VAE model were developed and tested locally on a laptop using a small fraction of the training data. For full-scale training with the complete dataset, the environment was moved to a Google Cloud Platform (GCP) virtual machine instance equipped with an NVIDIA L4 GPU, using TensorFlow version 2.17.0.


5. Next Steps

Due to time constraints, several promising avenues for future work were identified but not fully explored. These represent potential next steps to further improve the data augmentation strategy:

  • Contrastive Loss: Implement a contrastive loss function during the VQ-VAE training phase. This would explicitly push the latent space representations of "human" and "non-human" images further apart, potentially leading to more distinct and effective augmentations.

  • Tuning Channel Scaling: The aggressive channel scaling technique, while effective, sometimes introduces visual artifacts into the augmented images. Further tuning of the scaling parameters could help minimize these artifacts while retaining the desired level of variation.

  • Region-Specific Augmentation: Use a separate object detection model (e.g., YOLO, R-CNN) to first identify the precise bounding box of the human in an image. The latent space augmentation could then be applied only to the corresponding region in the latent space, leaving the background untouched.

  • Codebook Frequency Analysis: The project included an initial analysis of the VQ-VAE's codebook usage, counting the frequency of codes for human vs. non-human images. This frequency data could be used in more sophisticated ways, such as guiding the generation of entirely new images from scratch by sampling codes based on their statistical likelihood of appearing in a "human" image.

  • Analyze Weight Initialization Strategy: The current approach initializes the classifier with weights from the pre-trained VQ-VAE encoder and freezes those layers. A systematic experiment is needed to quantify the true benefit of this strategy. This would involve training the classifier from scratch (random initialization) and comparing its performance against the VQ-VAE-initialized version to determine if the weight transfer provides a significant and reliable advantage.

  • Deployment and Optimization on Target Hardware: A next step is to deploy the final model to a Nordic Semiconductor NRF54H20 board. This would involve profiling using hardware tracing to measure and optimize inference speed and RAM usage, ensuring the model performs efficiently on the target edge device.

About

My entry to the Edge AI Foundation: Wake Vision Challenge 2

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%