DeepNet-Trojan: A Deep Learning and NLP Pipeline for Hardware Trojan Detection

This repository contains a complete Python implementation of the hardware security methodology presented in the IEEE paper: "Deep Learning-based Hardware Trojan Detection with Block-based Netlist Information Extraction."

This project provides a full, end-to-end pipeline to train a deep learning model that can detect malicious circuits (Hardware Trojans - HTs) by analyzing their structure in gate-level netlist files.

Based on the research paper:

S. Yu, C. Gu, W. Liu and M. O'Neill, "Deep Learning-based Hardware Trojan Detection with Block-based Netlist Information Extraction," IEEE Transactions on Emerging Topics in Computing, 2021.
DOI: 10.1109/TETC.2021.3116484

🧠 Core Concept: Circuits as a Language

The fundamental idea of this project is to treat the structural layout of a hardware circuit as a human language. This allows us to apply powerful Natural Language Processing (NLP) and Deep Learning techniques.

The pipeline works as follows:

File (.v) -> Graph: A gate-level netlist (Verilog file) is parsed into a detailed pin-to-pin directed graph using networkx.
Graph -> Sentences: We perform a Breadth-First Search (BFS) from every single gate (as a "center component") to extract all signal paths that pass through it. Each path becomes a "sentence" (known as a PCP Trace).
Connections -> Words: Each connection within this path (e.g., InputPin -> Cell -> OutputPin) is treated as a "word" (a PCP Word).
NLP Model (Net2Vec): A Word2Vec (Skip-gram) model is trained on millions of these "sentences" to build an embedding dictionary (net2vec.vectors). This dictionary maps each "word" (circuit connection) to a 100-dimension vector that represents its contextual meaning.
DL Model (LSTM): An LSTM (Long Short-Term Memory) network is trained on these vectorized sentences. It learns to distinguish the "grammar" and "structure" of sentences from normal circuits versus those found in Hardware Trojans.
Voter (Final Detection): To detect a Trojan in a new file, the model scans all traces. A Voter module aggregates the results. If a specific gate is the center of many "suspicious" traces, it is flagged as part of a Trojan.

✨ Features

Full 4-Phase Pipeline: Implements all stages described in the paper:
1. Netlist Parsing & Feature Extraction (Algorithm 1 & 2)
2. Net2Vec (NLP) Word Embedding
3. LSTM Detector Training
4. Component-Level Voter for Final Evaluation
Dynamic & Robust Parser: The parser (detector.py) is designed to handle multiple Verilog netlist formats, including:
- Explicit port mapping (e.g., .Q(wireA)) used in the TRIT-TC/TS training benchmarks.
- Implicit (positional) port mapping (e.g., (wireA, wireB)) used in standard ISCAS benchmarks.
Cell Normalization: Includes a normalization map (NORM_MAP) to translate different cell library names (e.g., nnd2s1, nand2_1, nand) into a single generic token (e.g., NAND2). This makes the model more robust and "library-agnostic".
Optimized for Large Datasets: The data loaders (dataset.py) are designed to handle massive datasets (7M+ samples) on low-RAM (16GB) machines by loading all data into memory and using num_workers=0.

📂 Project Structure

.
├── 📁 Dataset/                  # Contains training/testing data
├── 📁 English/                 # English implementation files and documentation
│   ├── 🐍 cleaner.py           # Data cleaning utilities
│   ├── 🐍 create_balanced_dataset.py  # Dataset balancing tools
│   ├── 🐍 dataset.py           # PyTorch dataset implementation
│   ├── 🐍 detector.py          # Main detection tool for scanning netlists
│   ├── 🐍 evaluate.py          # Model evaluation tools
│   ├── 🐍 model.py             # LSTM model definition
│   ├── 🐍 netlist_parser.py    # Netlist parsing utilities
│   ├── 🐍 phase1_graph_utils.py # Graph conversion and trace extraction
│   ├── 🐍 preprocess_nlp.py    # NLP preprocessing tools
│   ├── 🐍 process_originals.py # Processing original (non-trojan) designs
│   ├── 🐍 run_batch_extraction.py # Batch processing for feature extraction
│   ├── 🐍 train_detector_updated.py # LSTM model training
│   ├── 🐍 train_nlp.py         # Net2Vec model training
│   └── 📄 README.md            # This file
├── 📁 jsonl_dataset/           # JSONL formatted training data
│   ├── corpus_ALL.jsonl        # Complete corpus for NLP model
│   └── labeled_traces_BALANCED.jsonl  # Balanced training dataset
├── 📁 Model/                   # Trained models
│   ├── net2vec.vectors         # Trained word embeddings
│   └── trojan_detector_final.pth  # Trained LSTM detector
├── 📁 Persian/                 # Persian implementation files (duplicate of English)
│   ├── 🐍 cleaner.py
│   ├── 🐍 create_balanced_dataset.py
│   ├── 🐍 dataset.py
│   ├── 🐍 detector.py
│   ├── 🐍 evaluate.py
│   ├── 🐍 model.py
│   ├── 🐍 netlist_parser.py
│   ├── 🐍 phase1_graph_utils.py
│   ├── 🐍 preprocess_nlp.py
│   ├── 🐍 process_originals.py
│   ├── 🐍 run_batch_extraction.py
│   ├── 🐍 train_detector_updated.py
│   └── 🐍 train_nlp.py
└── 📄 .gitignore               # Git ignore patterns

🚀 Quickstart & Usage

1. Installation

Clone this repository:

git clone <your-repo-url>
cd VHDL_AI_Project

Create and activate a Python virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\\Scripts\\activate

Install all required libraries:

pip install torch torchvision torchaudio
pip install networkx gensim tqdm numpy

2. Option A: Re-Train The Entire Model (Recommended)

This is the full pipeline to replicate the paper's results from scratch.

# === Phase 1: Feature Extraction ===
# (This runs Alg 1 & 2 on all training files and creates .gpickle and .json files)
# (This will take a long time)
echo "--- Running Phase 1 ---"
python English/run_batch_extraction.py
python English/process_originals.py

# === Phase 2: NLP & Data Prep ===
# (This gathers all .json files and trains the language model)
echo "--- Running Phase 2 ---"
python English/preprocess_nlp.py
python English/train_nlp.py
python English/create_balanced_dataset.py

# === Phase 3: Train the Detector ===
# (This uses your GPU to train the LSTM)
echo "--- Running Phase 3 ---"
python English/train_detector.py

# === Phase 4: Evaluate the Model ===
# (This runs the Voter and gives you the final accuracy metrics)
echo "--- Running Phase 4 ---"
python English/evaluate.py

3. Option B: Use the Pre-Trained Model (Detection Tool)

Once you have the net2vec.vectors and trojan_detector_final.pth files, you can use detector.py as a standalone tool to scan new, unseen netlist files.

Usage:

python English/detector.py <path_to_your_netlist.v>

Example (scanning a known-good file):

python English/detector.py Test/s713.v

--- 🔬 Phase 1: Processing s713.v ---
  (1/3) 📄 Parsing Netlist (Dynamic Mode)...
  (2/3) 🧱 Generating Blocks (Alg 1): 100%|...| 412/412
  (3/3) 💬 Extracting Traces (Alg 2): 100%|...| 412/412
✅ Phase 1 completed. 1,095 traces extracted from 412 gates.

--- 🧠 Phase 3: Loading models ---
  (Using device: cuda)
  ✅ Net2Vec dictionary (vectors) loaded.
  ✅ Trojan Detector model (trojan_detector_final.pth) loaded.

--- 🤖 Phase 4: Running Inference and Voting ---
  (1/2) 🧠 Inference running: 100%|...| 9/9
  (2/2) 🗳️ Voting process: 100%|...| 412/412

==================================================
🏁 Scan complete
==================================================
  ✅ Result: No hardware trojans found in this file.

⏱️ Total scan time: 12.52 seconds

Example (scanning a known-bad file):

python English/detector.py Test/s38417_T0099_C.v

--- 🔬 Phase 1: Processing s38417_T0099_C.v ---
  ... (parsing and extraction) ...
✅ Phase 1 completed. 10,000+ traces extracted.
...
--- 🤖 Phase 4: Running Inference and Voting ---
  ... (inference and voting) ...

==================================================
🏁 Scan complete
==================================================
  🚨 Warning: 4 trojan-suspected gates found!
--------------------------------------------------
  List of suspicious gates:
    1. troj49_0_U1
    2. troj49_0_U2
    3. troj49_0_U3
    4. troj49_0_U4

⏱️ Total scan time: 14.20 seconds

Screenshots

📚 Citations

Primary Paper: S. Yu, et al. "Deep Learning-based Hardware Trojan Detection with Block-based Netlist Information Extraction." IEEE TETC, 2021.
Benchmark Data: J. Cruz, et al. "An Automated Configurable Trojan Insertion Framework for Dynamic Trust Benchmarks" DATE, 2018. (Provided by Trust-Hub)

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepNet-Trojan: A Deep Learning and NLP Pipeline for Hardware Trojan Detection

🧠 Core Concept: Circuits as a Language

✨ Features

📂 Project Structure

🚀 Quickstart & Usage

1. Installation

2. Option A: Re-Train The Entire Model (Recommended)

3. Option B: Use the Pre-Trained Model (Detection Tool)

Screenshots

📚 Citations

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.idea		.idea
English		English
Persian		Persian
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

DeepNet-Trojan: A Deep Learning and NLP Pipeline for Hardware Trojan Detection

🧠 Core Concept: Circuits as a Language

✨ Features

📂 Project Structure

🚀 Quickstart & Usage

1. Installation

2. Option A: Re-Train The Entire Model (Recommended)

3. Option B: Use the Pre-Trained Model (Detection Tool)

Screenshots

📚 Citations

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages