Patent Scraper → Patent Intelligence Pipeline

Keywords: patent search · patent analytics · Google Patents · patent scraper · tech transfer · IP intelligence · intellectual property · patent data Python · assignee search · patent landscape

A Python-based pipeline for patent search, retrieval, and structuring across all major patent jurisdictions (US, WO, EP, JP, CN, and more). Built for tech transfer professionals, IP attorneys, researchers, and analysts who need structured, analysis-ready patent data without manual searching.

Enables downstream workflows including patent landscape analysis, competitive intelligence, prior art research, and technology trend detection — all from a single command-line tool.

Overview

Patent data is large-scale, semi-structured, and difficult to work with directly.
This project automates:

Retrieval of patent records across jurisdictions
Parsing and normalization of metadata (assignees, inventors, dates)
Deduplication and consolidation of records
Export into analysis-ready structured formats

The output is designed for immediate use in Excel, data science workflows, or strategic analysis.

Key Features

Search any assignee across global patent systems (US, WO, EP, JP, CN, etc.)
Filter by publication date (from a specified start date through present, or a specific date range)
Dual dataset output:
- Granted patents only
- All activity (grants + applications)
Ensures completeness by merging queries so no granted patents are missed
Deduplicates patents with multiple titles (merged into a single record)
Built-in assignee review step ensures accuracy by letting users validate results before saving
Tracks API credit usage per run
Caches inventor and co-assignee data for efficiency across runs
Parallelized data retrieval (optional) for faster execution
Auto-installs missing Python dependencies on first run

Example Output

Each run produces a structured Excel file and summary:

search_20260101_120000_Thomas_Jefferson_University_20250101/
├── TJU_20250101_patents.xlsx
└── TJU_20250101_summary.txt

Sample Record (for illustrative purposes)

Patent Number	Title	Assignee	Co-Assignees	Publication Date	Grant Date
US12269155B2	Multivalent vaccines for rabies virus and coronaviruses	Thomas Jefferson University	University of Maryland Baltimore, US Department of Health and Human Services	2025-04-08	2025-04-08
AU2022206776B2	Methods and compositions for treating cancers	Thomas Jefferson University	None	2025-05-08	2025-05-08

Why This Matters

Structured patent data enables:

Identification of emerging technology trends
Competitive intelligence on company R&D pipelines
Prior art and landscape analysis
Supporting IP strategy, licensing decisions, and technology landscape analysis across all industries

Technical Design

Pipeline overview:

SerpAPI → Retrieval Layer → Parsing & Normalization → Deduplication → Structured Output (Excel)

Key components:

API querying (SerpAPI / Google Patents)
HTML parsing (BeautifulSoup)
Data structuring (pandas)
Local caching for performance optimization

Getting Started

Requirements

Python 3.8 or higher — download at python.org/downloads
SerpAPI key — free at serpapi.com (250 searches/month, no credit card required). Once logged in, your API key is available at serpapi.com/manage-api-key
Python packages are installed automatically on first run (or manually via pip install -r requirements.txt)

Setup

git clone https://github.com/elichter/patent-scraper.git
cd patent-scraper
cp config.yaml.example config.yaml
nano config.yaml

Replace your_serpapi_key_here with your actual key, then save with Ctrl+O, Enter, Ctrl+X.

Run

python3 scrape_serpapi.py

You will be prompted for:

Auto-install missing packages? (y/n) — enter y to have the script check for and install any missing dependencies automatically. Enter n if you prefer to manage dependencies yourself — note that the script will error if any required packages are missing.
Assignee name — e.g. Thomas Jefferson University
Date filter — choose between a start date through present (e.g. 20250101) or a specific date range (e.g. 20230101 to 20251231)
Parallel fetch? (y/n) — enter y for faster inventor/co-assignee scraping using parallel requests, or n for sequential (safer, less likely to be rate-limited). If y, you will also be asked how many parallel workers to use (recommended: 3-10, default 5)
- Assignee review — after fetching, the script displays all unique assignee names returned by SerpAPI. Type the numbers to INCLUDE (e.g. 1,6,8) and press Enter to keep only those, or press Enter alone to keep all. Auto-proceeds with all assignees if left unattended (timeout scales with list size — 5s per assignee, minimum 30s)

Notes

Typical run uses ~2 API credits (one for granted, one for all activity). The script tracks monthly usage locally in usage_tracker.json and auto-resets each month — displaying credits used this run, total used this month, and estimated remaining against your plan limit (default 250 for free tier, configurable in config.yaml). If you upgrade to a paid plan, update monthly_limit in config.yaml and the warning threshold adjusts automatically
Large assignees (Fortune 500, major universities) may return 100+ results — use parallel fetch with 5-10 workers
Ambiguous names (city-based institutions, common words) — always review the assignee list carefully
Non-English patents — Japanese, Korean, and Chinese assignee names are expected and correct; do not exclude them unless you are certain they refer to a different entity
Cache management — patent_cache.json accumulates over time. Delete it only if you suspect stale data
Inventor and co-assignee data is cached locally in patent_cache.json — repeat runs skip HTTP requests for patents already seen, making subsequent runs faster. The cache is shared across searches so running multiple institutions on the same device will reuse cached data
config.yaml, patent_cache.json, and any output directories are excluded from version control via .gitignore
To install dependencies manually: pip install -r requirements.txt (includes requests, pandas, openpyxl, pyyaml, beautifulsoup4, matplotlib, scikit-learn, wordcloud)
conda users who prefer not to mix pip: run conda install requests pandas openpyxl pyyaml beautifulsoup4 matplotlib scikit-learn && pip install wordcloud before running the script, then answer n when prompted about auto-installing packages

Examples

📊 See EXAMPLES.md for full walkthroughs with charts, ML analysis, and sample outputs.

Sample output: granted patents by year for Thomas Jefferson University (illustrative data)

Walkthroughs include:

University tech transfer searches (Thomas Jefferson University)
Handling ambiguous assignee names (Philadelphia University)
Large corporate assignees with international filings
Downstream analysis and visualization: grants by year, co-assignees, jurisdiction activity, word cloud, technology clustering (t-SNE), and semantic similarity search
Monitoring a patent portfolio over time with incremental runs

Known Limitations

Non-US jurisdiction data gaps — Google Patents does not always index grant dates, priority dates, or co-assignee information for non-US patents (e.g. NZ, AU, some EP filings). These fields may appear as N/A even when the patent is granted.
SerpAPI broad matching — the assignee filter may return patents from similarly named institutions. The interactive assignee review step mitigates this but manual verification is recommended for ambiguous names.
Data currency — SerpAPI reflects Google Patents data which may lag official patent office records by days to weeks.
Rate limiting — parallel fetching may trigger rate limiting from Google Patents. Reduce worker count or switch to sequential mode if you encounter errors.

Contributing

Contributions are welcome. Please fork the repo and submit a pull request.

This project uses git-cliff for changelog generation — see generate_changelog.sh for the release workflow.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
images		images
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
EXAMPLES.md		EXAMPLES.md
LICENSE		LICENSE
README.md		README.md
cliff.toml		cliff.toml
config.yaml.example		config.yaml.example
generate_changelog.sh		generate_changelog.sh
requirements.txt		requirements.txt
scrape_serpapi.py		scrape_serpapi.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Patent Scraper → Patent Intelligence Pipeline

Overview

Key Features

Example Output

Sample Record (for illustrative purposes)

Why This Matters

Technical Design

Getting Started

Requirements

Setup

Run

Notes

Examples

Known Limitations

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Patent Scraper → Patent Intelligence Pipeline

Overview

Key Features

Example Output

Sample Record (for illustrative purposes)

Why This Matters

Technical Design

Getting Started

Requirements

Setup

Run

Notes

Examples

Known Limitations

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages