NeuraSearch

NeuraSearch is a fully self-built, open-source search engine backend written in Python. It crawls the web, indexes pages through a multi-stage NLP pipeline, ranks results using a multi-signal PageRank system, and exposes everything through a clean REST API.

The project mimics the core architecture of how a real search engine works, from the priority-queue frontier that decides which URL to fetch next, through lemmatization-then-stemming text normalization, all the way to Panda/Penguin quality scoring on top of iterative PageRank. Every component is readable, documented, and meant to be understood. The data stored is also designed to be rich enough for AI/ML training use cases.

Why We Built This

Most people treat search engines as a black box. After reading through various papers on information retrieval and spending time understanding how Google's early architecture worked, it became clear that the only real way to understand it was to build one from scratch.

NeuraSearch is the result of that effort. It implements every layer of the pipeline correctly: a polite resumable web crawler; a POS-aware lemmatization + stemming NLP pipeline; a BM25 inverted index with full term provenance; Named Entity Recognition; a PageRank algorithm with proper dangling-node handling; and Panda/Penguin content and link quality scorers.

Every stored data point is designed to be useful, not just for search, but also as a structured dataset for AI/ML training.

Features

Crawler

Multi-threaded crawling with a configurable thread pool (default 10 workers)
Priority-based URL frontier using a max-heap scored by depth, TLD weight, and in-link count
Full robots.txt compliance including Crawl-Delay directives
Per-domain rate limiting with random jitter
Automatic retry with exponential backoff on failed requests
Content deduplication via MD5 hash
Respects noindex and nofollow meta directives
Extracts h1–h6 headings, meta keywords, and canonical URL from each page
NLP-processed anchor text (lemmatize + stem) for higher-quality anchor index
Graceful shutdown on CTRL-C with automatic checkpoint save
Crash recovery: interrupted URLs automatically resume on next run

Text Processing

POS tagging (NLTK averaged_perceptron_tagger) before lemmatization
WordNet lemmatization (POS-aware) before Porter stemming: "running"→"run"→"run", "geese"→"goose"→"goos"
Full provenance per token: { stem: { lemma, originals, positions, pos_tag } }
Named Entity Recognition via NLTK ne_chunk: extracts PERSON, ORGANIZATION, GPE, LOCATION, etc.
Bigram extraction: common 2-gram phrases indexed as separate terms for phrase-match bonus
Snippet generation: extracts the most relevant context window around query hits
WordNet query expansion (optional, via ?expand=true)

Indexer

BM25 (Robertson-Walker) inverted index with per-(term, document) provenance
Every index entry stores: stemmed term, lemma, original word forms, word positions, field flags (title / description / heading / URL / anchor), dominant POS tag
Field-level boosts: title (4.0), headings (2.5), description (2.0), anchor (3.0), URL (1.5)
Bigrams indexed alongside unigrams; phrase query gets a +2.0 multiplier at query time
URL path tokenization: /wiki/Machine_learning → index terms machine, learn
Named entities stored in a dedicated named_entities table per page
Incremental: already-indexed pages are always skipped — safe to resume after any interruption
CTRL-C checkpointing: current page finishes cleanly and a JSON snapshot is exported

Ranking

Iterative PageRank with dangling-node handling and warm-start from previous run
Mid-run checkpointing: scores saved every 20 iterations so CTRL-C loses minimal work
Panda scorer: content quality from 5 signals (length, keyword density, readability, title quality, vocabulary diversity)
Penguin scorer: link quality from 3 signals (anchor diversity, domain diversity, link count)
Combined authority: PageRank × Panda × Penguin multiplied into BM25 at query time
Phrase/bigram match bonus in final scoring

API

FastAPI REST API with Swagger/ReDoc documentation
Paginated search with per-query timing
Query expansion (?expand=true) using WordNet synonyms
Snippet generation in every search result
Entity lookup endpoint: find pages by named entity
Autocomplete using lemma forms (human-readable) rather than raw stems
Background task management for all long-running operations
Full export endpoint (POST /export) to write all JSON data snapshots

System Architecture

                    +---------------------------+
                    |        Seed URLs          |
                    +------------+--------------+
                                 |
                                 v
+------------------------------------+------------------------------------+
|                              CRAWLER                                   |
|                                                                        |
|  +---------------------+         +------------------------------+     |
|  | Frontier (max-heap) | pop()   |   Thread Pool (10 workers)   |     |
|  | scored by:          +-------> |      _crawl_worker()         |     |
|  |  - depth            |         +---+----------+----------+----+     |
|  |  - TLD weight       |             |          |          |          |
|  |  - in-link count    |             v          v          v          |
|  +---------------------+        robots.txt  HTTP fetch  rate limit    |
|                                       |      + retry    per domain    |
|                                       v                               |
|                              +-------------------+                    |
|                              | HTML Parser       |                    |
|                              | - h1-h6 headings  |                    |
|                              | - meta_keywords   |                    |
|                              | - canonical URL   |                    |
|                              | - link extractor  |                    |
|                              | - NLP anchor text |                    |
|                              +--------+----------+                    |
|                                       |                               |
+---------------------------------------+-------------------------------+
                                        |
                                        v
                        +---------------+---------------+
                        |         SQLite Database       |
                        | pages / links / crawl_queue   |
                        | inverted_index / anchors      |
                        | named_entities / domain_stats |
                        +---+------------------+--------+
                            |                  |
               +------------+                  +-------------+
               |                                             |
               v                                             v
   +-----------+----------+                   +--------------+----------+
   |        INDEXER       |                   |    PAGERANK PIPELINE    |
   |                      |                   |                         |
   | NLP Pipeline:        |                   | 1. Panda (content)      |
   |  - POS tagging       |                   | 2. Penguin (links)      |
   |  - Lemmatization     |                   | 3. PageRank (graph)     |
   |  - Porter stemming   |                   | 4. authority =          |
   |  - Provenance        |                   |    PR × Panda × Penguin |
   |  - NER extraction    |                   |                         |
   |  - Bigram indexing   |                   | Mid-run checkpointing:  |
   |  - URL tokenization  |                   | saves every 20 iters    |
   |                      |                   |                         |
   | BM25 inverted index  |                   | CTRL-C: saves & exits   |
   | CTRL-C: saves & exits|                   |                         |
   +----------+-----------+                   +--------------+----------+
              |                                              |
              +--------------------+-------------------------+
                                   |
                                   v
                    +--------------+--------------+
                    |        FastAPI REST API      |
                    |                             |
                    | GET  /search?q=...&expand=  |
                    | GET  /suggest               |
                    | GET  /entity                |
                    | GET  /stats                 |
                    | POST /crawl                 |
                    | POST /index                 |
                    | POST /pagerank              |
                    | POST /export                |
                    +--------------+--------------+
                                   |
                                   v
                            Frontend Client

Crawler

The crawler is multi-threaded, polite (robots.txt + rate limits), resumable (all state in SQLite), and extracts rich metadata from each page.

What is extracted per page:

Field	Source
`title`	`<title>` tag
`description`	`<meta name="description">` or first `<p>`
`headings`	All `h1`–`h6` tags → JSON `{"h1":[...], "h2":[...]}`
`meta_keywords`	`<meta name="keywords">`
`canonical_url`	`<link rel="canonical">`
`content`	Full cleaned body text (boilerplate removed)
`word_count`	Word count of cleaned body
`content_hash`	MD5 for deduplication

Anchor text processing uses the full NLP pipeline (lemmatize + stem) so anchor terms in the index match the same stems used for body content.

Frontier (Priority Queue)

URL priority score = (1 / (1 + depth)) × TLD_boost × length_factor × clean_url_bonus

TLD boosts:  .edu → 1.4  |  .gov → 1.3  |  .org → 1.1  |  .com → 1.0
Heap: push O(log n) | pop O(log n) | dedup O(1) via MD5 seen-set
DB-backed: every URL written to crawl_queue so crashes are resumable

Robots.txt and Politeness

Each domain: fetch /robots.txt once → cache in memory → check NeuraSearchBot permission → enforce Crawl-Delay with random jitter (0–0.3 s).

Text Processing Pipeline

The NLP pipeline in indexer/text_processor.py is applied to both documents at index time and queries at search time, ensuring consistent normalization.

Raw text input
     |
     v
+---------------------------------------+
| Tokenize                              |
| regex: [a-zA-Z]+(?:['-][a-zA-Z]+)*   |
| lowercase, preserves hyphenation      |
+---------------------------------------+
     |
     v
+---------------------------------------+
| Filter                                |
| - remove stop words (NLTK + built-in)|
| - discard tokens len < 2 or > 50     |
| - discard non-alpha tokens           |
+---------------------------------------+
     |
     v
+---------------------------------------+
| POS Tagging (NLTK)                    |
| averaged_perceptron_tagger            |
| tags: NN, VB, JJ, RB ...             |
+---------------------------------------+
     |
     v
+---------------------------------------+
| WordNet Lemmatization (POS-aware)     |
| "running"  → "run"   (VB)            |
| "better"   → "good"  (JJ)            |
| "geese"    → "goose" (NN)            |
| "quickly"  → "quickly" (RB)          |
| cached: lru_cache(500,000)           |
+---------------------------------------+
     |
     v
+---------------------------------------+
| Porter Stemming                       |
| "running"  → "run"                   |
| "language" → "languag"               |
| applied to the lemma, not raw token  |
| cached: lru_cache(500,000)           |
+---------------------------------------+
     |
     v
Provenance dict per stemmed term:
{
  "languag": {
    "lemma":     "language",
    "originals": ["language", "languages"],
    "positions": [3, 17, 42],
    "pos":       "NN"
  },
  ...
}

Named Entity Recognition — runs separately over the first 6,000 characters of title + description + headings + content:

NLTK ne_chunk pipeline:
  word_tokenize → pos_tag → ne_chunk

Entity types returned:
  PERSON, ORGANIZATION, GPE (geo-political), LOCATION, FACILITY

Bigram extraction — sliding window over stemmed body tokens, stored in the same inverted index as terms like "machin learn". At query time, consecutive query stems are also tried as a bigram and given a +2.0 phrase-match multiplier.

Indexer and BM25 Scoring

Each (term, document) pair stored in inverted_index now carries:

Column	Description
`term`	Porter-stemmed form (search key)
`lemma`	WordNet-lemmatized intermediate
`original_forms`	JSON list of distinct raw word forms seen
`positions`	JSON list of word positions in cleaned stream
`frequency`	Raw count + synthetic field-boost counts
`bm25`	Robertson-Walker BM25 score
`in_title`	Term found in `<title>`
`in_description`	Term found in meta description
`in_url`	Term found in URL path segments
`in_anchor`	Term found in inbound anchor texts
`pos_tag`	Dominant NLTK POS tag for this term in this doc

BM25 formula:

IDF = log( (N - df + 0.5) / (df + 0.5) + 1 )

TF_norm = tf × (k1 + 1) / (tf + k1 × (1 - b + b × dl / avgdl))

BM25(t, D) = IDF × TF_norm      k1=1.5  b=0.75

Query-time scoring:

final_score(Q, D) =

  ( Σ_t [  BM25(t, D) × field_multiplier(t, D)
           + anchor_freq(t, D) × ANCHOR_BOOST     ]
    + bigram_bonus if consecutive query terms match a bigram
  )
  × coverage_bonus          (matched_terms / total_query_terms)
  × log-normalized PageRank factor
  × panda_score(D)
  × penguin_score(D)

Field multipliers:
  TITLE_BOOST       = 4.0
  DESCRIPTION_BOOST = 2.0
  HEADING_BOOST     = 2.5
  ANCHOR_BOOST      = 3.0
  URL_BOOST         = 1.5
  Bigram phrase match = +2.0

PageRank Pipeline

Step 1: Panda Scorer (content quality)
  5 signals → panda_score ∈ (0, 1]
  - Content length (logistic curve, saturates ~800 words)
  - Keyword density (ideal 2–8%, penalizes stuffing)
  - Flesch-Kincaid readability (bell curve around 60)
  - Title quality (word overlap with body content)
  - Vocabulary diversity (type-token ratio)

Step 2: Penguin Scorer (link quality)
  3 signals → penguin_score ∈ (0, 1]
  - Anchor text diversity (unique/total, log scale)
  - Linking domain diversity (log scale, caps at 50+ domains)
  - In-link count (log scale, saturates at 200)

Step 3: PageRank (iterative, Brin & Page 1998)
  Initialize: PR(p) = 1/N
  Each iteration:
    dangling_sum = d × Σ(PR(dangling)) / N
    PR(p) = (1-d)/N + dangling_sum + d × Σ(PR(q)/out_deg(q))
  Convergence: L1-delta < 1e-8  or  200 iterations
  Warm-start from previous DB scores (fewer iterations needed)
  Checkpoint: saves to DB every 20 iterations

Step 4: authority(P) = PageRank(P) × Panda(P) × Penguin(P)
  Written to pages.final_score

Step 5: BM25 refresh so updated authority affects search immediately

Database Schema

All data is stored in data/neurasearch.db (SQLite, WAL mode, per-thread connections).

pages
  id, url, title, description, content, content_hash
  word_count, language, crawled_at, last_modified
  status_code, depth, in_link_count, out_link_count
  pagerank_score, panda_score, penguin_score, final_score
  is_indexed
  headings        -- JSON {"h1":[...], "h2":[...], ...}
  entities_json   -- JSON [{"text":"...","label":"..."}]
  meta_keywords   -- raw meta keywords string
  canonical_url   -- canonical URL if different from url

links
  id, src_url, dst_url, anchor_text, rel

crawl_queue
  id, url, priority, depth, added_at
  status: pending | processing | done | failed

inverted_index
  id, term, doc_id
  frequency, positions        -- count + JSON positions
  tf_idf, bm25                -- relevance scores
  in_title, in_description    -- field flags (0/1)
  in_url, in_anchor           -- field flags (0/1)
  lemma                       -- lemmatized form
  original_forms              -- JSON list of raw forms
  pos_tag                     -- dominant POS tag

named_entities
  id, doc_id, entity, entity_type, frequency

anchor_index
  id, term, dst_url, frequency

domain_stats
  domain, page_count, avg_pagerank, last_crawled
  crawl_delay, robots_txt, is_blocked

Indexes:
  idx_inv_term, idx_inv_doc, idx_inv_lemma
  idx_links_src, idx_links_dst
  idx_pages_score, idx_queue_pri, idx_anchor_term
  idx_ne_entity, idx_ne_doc, idx_ne_type

Data Exports (JSON)

Run python main.py export (or POST /export) to write all data to human-readable JSON files inside data/.

data/
├── crawled/
│   ├── pages.json          all crawled pages + headings + entities + scores
│   └── links.json          full link graph {src: [dst, ...]}
├── pagerank/
│   └── scores.json         per-page authority scores, sorted best-first
├── index/
│   ├── vocabulary.json     each term with FULL PROVENANCE per source page:
│   │                         term, lemma, doc_freq,
│   │                         sources: [{url, title, field_flags,
│   │                                    frequency, positions, bm25,
│   │                                    original_forms, pos_tag,
│   │                                    page_scores}]
│   └── indexed_pages.json  all indexed pages with headings
└── entities/
    ├── named_entities.json all extracted entities + source pages
    └── entity_types.json   entity type distribution (PERSON, ORG, GPE ...)

The vocabulary.json file is the richest data store — every word in the index comes with a full history: where it was found, in which field, at which positions, with what BM25 score, what its original capitalized form was, and what part of speech it was tagged as. This makes it directly usable for AI/ML training tasks.

REST API

The API runs on http://localhost:8000 by default. All long-running operations are dispatched to background threads.

GET  /
     Health check.

GET  /stats
     { pages_indexed, links_found, unique_terms, unique_bigrams,
       named_entities, domains }

GET  /tasks
     Status of background tasks: "running" | "done" | "error"

GET  /search?q=...&page=1&limit=10&expand=false
     - q       : search query (required)
     - page    : result page (default 1)
     - limit   : results per page, max 50 (default 10)
     - expand  : if true, expand query with WordNet synonyms
     Returns: {
       query, processed_terms, expanded_terms,
       total, page, limit, took_ms,
       results: [{
         url, title, description,
         snippet,          ← context window around query term hits
         score, pagerank, panda, penguin,
         matched_terms     ← which query stems matched this result
       }]
     }

GET  /suggest?q=...&limit=8
     Autocomplete using lemma forms (real words, not stems).

GET  /entity?name=...&limit=10
     Find pages by named entity name.
     Returns: { entity, page_count, pages: [{url, title, pagerank}] }

POST /crawl    { seeds, limit, max_depth, max_workers }
POST /index
POST /pagerank
POST /export   → write all JSON data files in background

Interactive docs: http://localhost:8000/docs (Swagger) | http://localhost:8000/redoc

Getting Started

Requirements

Python 3.10+
pip

Installation

git clone https://github.com/neura-spheres/NeuraSearch.git
cd NeuraSearch/backend

python -m venv venv
source venv/bin/activate      # Windows: venv\Scripts\activate

pip install -r requirements.txt

# Download NLTK data (run once — includes POS tagger and NER models)
python main.py setup

Quick Start

# Step 1: Crawl pages from a seed URL
python main.py crawl https://en.wikipedia.org/wiki/Python --limit 500 --depth 3

# Step 2: Index crawled pages (builds NLP pipeline, BM25, NER, bigrams)
python main.py index

# Step 3: Run PageRank + Panda + Penguin scoring
python main.py pagerank

# Step 4: Export all data to JSON
python main.py export

# Step 5: Start the API server
python main.py serve

Or in one command:

python main.py all --seeds https://en.wikipedia.org/wiki/Python --limit 500

Then search:

curl "http://localhost:8000/search?q=python+programming&expand=true"

CLI Usage

python main.py <command> [options]

Commands:
  setup
    Download NLTK data (punkt, wordnet, POS tagger, NER models).
    Run this once before first use.

  serve [--port INT] [--reload]
    Start the FastAPI server (default port 8000).

  crawl [seeds...] [--limit INT] [--workers INT] [--depth INT]
    Crawl from one or more seed URLs.
    Automatically resumes any previously interrupted session.

  index
    Build the inverted index for all un-indexed pages.
    CTRL-C exits cleanly after the current page; next run resumes.

  pagerank
    Run Panda + Penguin + PageRank pipeline.
    CTRL-C saves scores accumulated so far; next run warm-starts.

  export
    Write all data to JSON files under data/.

  all [seeds...] [--limit] [--workers] [--depth] [--port] [--reload]
    Run crawl → index → pagerank → export → serve in sequence.

Examples:

# Crawl Wikipedia with 15 threads, depth 4
python main.py crawl https://en.wikipedia.org/wiki/Machine_learning \
               --limit 2000 --workers 15 --depth 4

# Resume the indexer (automatically skips already-indexed pages)
python main.py index

# Search with query expansion
curl "http://localhost:8000/search?q=neural+network&expand=true"

# Find pages mentioning a named entity
curl "http://localhost:8000/entity?name=Guido+van+Rossum"

Checkpoint and Resume

Every process is designed to survive interruption and resume from where it stopped.

Process	How it checkpoints
Crawler	Every URL is saved to `crawl_queue` with status `pending/processing/done`. On restart, any `processing` URL is reset to `pending`. JSON is saved every 500 pages. CTRL-C triggers a final save.
Indexer	Each page is marked `is_indexed=1` immediately after indexing. On restart, `get_pages_not_indexed()` skips all already-indexed pages. CTRL-C finishes the current page, exports a JSON checkpoint, and exits.
PageRank	Scores are written to the database every 20 iterations. CTRL-C saves scores accumulated so far. On next run, warm-start loads those saved scores and resumes near the interrupted point.

There is no manual intervention needed to resume. Simply re-run the same command and it picks up from where it left off.

Language Filter

You can restrict crawling to specific languages by setting CRAWL_LANGUAGES in config.py.

# English only
CRAWL_LANGUAGES = ["en"]

# English + Indonesian
CRAWL_LANGUAGES = ["en", "id"]

# No filter — crawl everything (default)
CRAWL_LANGUAGES = []

How detection works (in priority order):

<html lang="..."> attribute — most reliable; present on almost all modern sites
<meta http-equiv="Content-Language"> tag
langdetect library — install with pip install langdetect for content-based fallback detection

If the language cannot be detected and CRAWL_LANGUAGES is non-empty, the page content is skipped (not stored), but its outbound links are still followed — so if a French page links to an English page, the English page is still discovered and crawled.

Supported language codes (ISO 639-1):

Code	Language	Code	Language	Code	Language
`af`	Afrikaans	`id`	Indonesian	`sk`	Slovak
`ar`	Arabic	`it`	Italian	`sl`	Slovenian
`bg`	Bulgarian	`ja`	Japanese	`sq`	Albanian
`bn`	Bengali	`ka`	Georgian	`sr`	Serbian
`ca`	Catalan	`ko`	Korean	`sv`	Swedish
`cs`	Czech	`lt`	Lithuanian	`sw`	Swahili
`cy`	Welsh	`lv`	Latvian	`ta`	Tamil
`da`	Danish	`mk`	Macedonian	`te`	Telugu
`de`	German	`ml`	Malayalam	`th`	Thai
`el`	Greek	`mr`	Marathi	`tl`	Filipino
`en`	English	`ms`	Malay	`tr`	Turkish
`es`	Spanish	`mt`	Maltese	`uk`	Ukrainian
`et`	Estonian	`nl`	Dutch	`ur`	Urdu
`fa`	Persian	`no`	Norwegian	`vi`	Vietnamese
`fi`	Finnish	`pl`	Polish	`zh`	Chinese
`fr`	French	`pt`	Portuguese	`zu`	Zulu
`gu`	Gujarati	`ro`	Romanian
`hi`	Hindi	`ru`	Russian
`hr`	Croatian	`hu`	Hungarian

Configuration

All tuneable values live in config.py.

# ── Crawler ───────────────────────────────────────────────────────────────────
MAX_WORKERS      = 10          # concurrent crawler threads
CRAWL_LIMIT      = 50_000      # max pages per crawl session
REQUEST_TIMEOUT  = 10          # HTTP request timeout (seconds)
MIN_DELAY        = 0.5         # minimum per-domain politeness delay (seconds)
MAX_DELAY        = 3.0         # maximum per-domain delay (seconds)
MAX_RETRIES      = 3           # HTTP retry attempts before giving up
MAX_DEPTH        = 6           # max link depth from seed URLs
MAX_CONTENT_SIZE = 5_242_880   # max page content: 5 MB

# ── BM25 / Indexer ────────────────────────────────────────────────────────────
BM25_K1           = 1.5        # term frequency saturation
BM25_B            = 0.75       # document length normalization
TITLE_BOOST       = 4.0
DESCRIPTION_BOOST = 2.0
HEADING_BOOST     = 2.5        # h1-h6 headings
ANCHOR_BOOST      = 3.0
URL_BOOST         = 1.5

# ── NLP ───────────────────────────────────────────────────────────────────────
BIGRAM_MIN_FREQ    = 2         # minimum frequency for a bigram to be indexed
MAX_BIGRAMS_PER_DOC = 200      # cap bigrams per document
MAX_ENTITIES_PER_DOC = 50      # cap named entities per document
NER_TEXT_LIMIT     = 6000      # characters fed to NER (capped for speed)
SNIPPET_WINDOW     = 40        # words of context around a query hit

# ── PageRank ──────────────────────────────────────────────────────────────────
DAMPING_FACTOR          = 0.85
MAX_PAGERANK_ITERATIONS = 200
PAGERANK_TOLERANCE      = 1.0e-8

# ── FastAPI ───────────────────────────────────────────────────────────────────
API_HOST                = "0.0.0.0"
API_PORT                = 8000
DEFAULT_RESULTS_PER_PAGE = 10
MAX_RESULTS             = 100

Project Structure

backend/
│
├── main.py                  CLI entry point
├── config.py                All tunable constants
├── setup_nltk.py            NLTK data downloader (run once)
│
├── crawler/
│   ├── crawler.py           Crawler class, thread pool, heading extraction
│   ├── frontier.py          Priority queue (max-heap) + DB persistence
│   ├── url_utils.py         URL normalization, scoring, link extraction
│   └── robots_handler.py    robots.txt fetching, caching, checking
│
├── indexer/
│   ├── indexer.py           Indexing orchestrator (NER, bigrams, checkpoint)
│   ├── inverted_index.py    BM25 builder + query engine (provenance, bigrams)
│   └── text_processor.py    NLP: POS tag → lemmatize → stem, NER, bigrams,
│                             snippet generation, readability
│
├── pagerank/
│   ├── pagerank.py          Iterative PageRank + checkpoint callback + SIGINT
│   ├── panda_scorer.py      Content quality (5 signals)
│   └── penguin_scorer.py    Link quality (3 signals)
│
├── database/
│   └── db.py                SQLite layer (schema, migration, all queries)
│
├── api/
│   └── app.py               FastAPI routes: search, suggest, entity, export
│
├── utils/
│   └── json_exporter.py     Rich JSON exports (vocabulary with full provenance,
│                             named entities, pages, pagerank, link graph)
│
└── data/                    Auto-generated at runtime (gitignored)
    ├── neurasearch.db
    ├── crawled/             pages.json, links.json
    ├── index/               vocabulary.json (with provenance), indexed_pages.json
    ├── pagerank/            scores.json
    └── entities/            named_entities.json, entity_types.json

Roadmap

Known Limitations

Scale. SQLite works well up to a few hundred thousand pages. Millions of documents would require Elasticsearch or a custom shard-based index.
NER accuracy. NLTK's ne_chunk is a statistical chunker trained on newswire. Accuracy on general web text varies. SpaCy would be significantly more accurate.
PageRank recomputation. The full graph is recomputed every run. Incremental updates would help for large crawls.
English only. The NLP pipeline and stop word list are English. Non-English pages index poorly.
Memory. The frontier and PageRank graph are held in memory. Very large crawls (10M+ URLs) would need on-disk alternatives.

Contributing

Contributions of any kind are welcome — bug fixes, new features, documentation improvements, or ideas.

Please read CONTRIBUTING.md before submitting a pull request.

Fork the repository and create your branch from main.
Write clear, readable code that matches the existing style.
Describe what you changed and why in your PR.
If fixing a bug, include reproduction steps.

To report a bug or suggest a feature, open an issue on GitHub.

License

This project is licensed under the MIT License. See LICENSE for the full text.

Acknowledgements

The Anatomy of a Large-Scale Hypertextual Web Search Engine — Brin & Page (1998). The original PageRank paper.
Introduction to Information Retrieval — Manning, Raghavan, and Schütze. BM25 derivation in chapter 11.
Google Panda and Penguin — algorithm documentation and SEO research for the quality signal design.
The Python ecosystem: FastAPI, BeautifulSoup, NLTK, SQLite, and the standard library.

Built by students, for anyone who wants to understand how search really works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NeuraSearch

Table of Contents

Why We Built This

Features

System Architecture

Crawler

Frontier (Priority Queue)

Robots.txt and Politeness

Text Processing Pipeline

Indexer and BM25 Scoring

PageRank Pipeline

Database Schema

Data Exports (JSON)

REST API

Getting Started

Requirements

Installation

Quick Start

CLI Usage

Checkpoint and Resume

Language Filter

Configuration

Project Structure

Roadmap

Known Limitations

Contributing

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__pycache__		__pycache__
api		api
crawler		crawler
database		database
indexer		indexer
pagerank		pagerank
utils		utils
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Launch WebGraph.bat		Launch WebGraph.bat
README.md		README.md
config.py		config.py
gui.py		gui.py
main.py		main.py
requirements.txt		requirements.txt
setup_nltk.py		setup_nltk.py

Folders and files

Latest commit

History

Repository files navigation

NeuraSearch

Table of Contents

Why We Built This

Features

System Architecture

Crawler

Frontier (Priority Queue)

Robots.txt and Politeness

Text Processing Pipeline

Indexer and BM25 Scoring

PageRank Pipeline

Database Schema

Data Exports (JSON)

REST API

Getting Started

Requirements

Installation

Quick Start

CLI Usage

Checkpoint and Resume

Language Filter

Configuration

Project Structure

Roadmap

Known Limitations

Contributing

License

Acknowledgements

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages