🌐 Live Demo: https://wowo515151.github.io/Scrape/
Scrape Engine is a high-performance HTML/text harvester and automated prompt-injection security analysis scanner. It allows you to search the web using multiple fallback tiers of DuckDuckGo, download targets directly while bypassing common blockages, extract clean prose, and review the natural English text for potential prompt-injection threats.
Warning
Security Guard Disclaimer: Although harvested text is thoroughly scanned for prompt injection threats, these automated scans are not 100% foolproof and there are no guarantees of absolute safety. Always review and verify extracted files before importing them into downstream LLM pipelines.
- Tiered DuckDuckGo Scraping Proxy: Bypasses rate limits and scraping protections using cascading fallback strategies (DuckDuckGo Lite forms POST -> DuckDuckGo HTML GET -> DuckDuckGo Lite GET -> Wikipedia OpenSearch -> Static fallbacks) without requiring any paid API keys.
- Wayback Machine Fallback: If a live page returns access restriction codes (e.g., HTTP 403 Forbidden or 401 Unauthorized), the engine automatically checks and downloads the closest text snapshot from the Internet Archive's Wayback Machine.
- Smart English Prose Extractor: Heuristically extracts core article prose by purging HTML markup, stripping code blocks/syntaxes, filtering script/stylesheet assets, excluding title-cased promotional headers, and filtering out noisy short sections (<=3 words with no periods).
- Security Guard Analysis: Evaluates extracted prose for prompt injection attacks using rolling window splits.
- React 19 & Tailwind CSS: Elegant, dark-mode-first slate dashboard with real-time logging, status tags, and layout animations.
- Express Backend: Secure full-stack Node.js server serving the API endpoints (
/api/searchand/api/fetch).
You will need Node.js (v18 or higher) installed.
-
Clone the repository:
git clone https://github.com/wowo515151/Scrape.git cd Scrape -
Install dependencies:
npm install
-
(Optional) Configure environment variables. See
.env.exampleto define any custom API keys or values required.
To start the full-stack development server:
npm run devThe application will launch on http://localhost:3000.
Run the integrated suite of heuristic extractor and security guard tests:
npm run testCompile both the React bundle and compile the TypeScript Express server down into a production-optimized bundle:
npm run build
npm startTo maximize versatility, the Scrape Engine is built with a Dual-Use Engine Design that runs wonderfully either as a full-stack system or a purely static site:
- Automatic Backend Detection: On startup, the UI automatically checks
/api/health. If the Express server is detected and healthy, the app operates in Full-Stack Mode, routing search and download steps through secure server-side proxies. - Serverless Static Mode Fallback: If hosted on a static-only provider like GitHub Pages where the backend is unreachable, the system gracefully disables the Express mode and forces Static Mode.
- CORS Proxy Support: In Static Mode:
- Search queries fall back to direct, client-safe endpoints (e.g. Wikipedia OpenSearch API and related integrations that support browser requests).
- Article fetches utilize customizable CORS-Anywhere proxies (like
https://api.allorigins.win/raw?url=) configurable right in the sidebar settings.
We have preconfigured a fully automated deployment pipeline inside .github/workflows/deploy.yml:
- Create a new repository on GitHub and commit this codebase.
- Go to your repository Settings -> Pages (in the sidebar).
- Under Build and deployment -> Source, select GitHub Actions.
- Push your changes to the
mainormasterbranch. The Actions pipeline will compile the production-ready build and publish it instantly!
If you prefer to run the full-stack version to leverage direct, proxy-free server-side scraping:
- Google Cloud Run: Preconfigured container host (builds and runs automatically using standard files in the repository).
- Render: Connect your repository as a "Web Service":
- Build Command:
npm run build - Start Command:
npm run start
- Build Command:
- Fly.io: Execute
fly launchin your directory to autoconfigure and establish the Express backend service. - Railway: Link your repository to launch the container instantly.
This project is open-source and licensed under the MIT License.