Built to uncover hidden marketing signals on Reddit — and help power smarter growth for Cronlytic.com 🚀
📺 Click the thumbnail above to watch a full explainer — why I built this tool, how it works, and how you can use it to automate Reddit lead generation using GPT-4.
A Python application that scrapes Reddit for potential marketing leads, analyzes them with GPT models, and identifies high-value opportunities. Includes an interactive Streamlit dashboard for browsing and filtering results.
- Overview
- Setup
- Configuration
- Running
- GUI Dashboard
- Results
- Project Structure
- Cost Controls
- Contributors
- Why This Exists
- License
- Third-Party Licenses
This tool uses a combination of Reddit's API and AI models (OpenAI or Anthropic) to:
- Scrape relevant subreddits for discussions across diverse domains (tech, finance, parenting, fitness, business, and more)
- Identify posts that express pain points with real product-building potential
- Score and analyze posts using multi-dimensional metrics including technical depth, implementability, and emotional intensity
- Store results in a local SQLite database for review
- Browse and filter results through an interactive web dashboard
The application maintains a balance between focused and exploratory subreddits, intelligently refreshing the exploratory list based on discoveries. This exploration process happens automatically as part of the main workflow.
- Python 3.10+
- Reddit API credentials (create an app here)
- OpenAI API key or Anthropic API key (configurable via
config.yaml)
-
Clone the repository:
git clone https://github.com/Mohamedsaleh14/Reddit_Scrapper.git cd Reddit_Scrapper -
Create a virtual environment:
python3 -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate -
Install dependencies:
pip install -r requirements.txt -
Set up environment variables by copying
.env.templateto.env:cp .env.template .env -
Edit
.envand add your API credentials:REDDIT_CLIENT_ID=your_client_id REDDIT_CLIENT_SECRET=your_client_secret REDDIT_USER_AGENT=script:cronlytic-reddit-scraper:v1.0 (by /u/yourusername) OPENAI_API_KEY=your_openai_api_key ANTHROPIC_API_KEY=your_anthropic_api_key
Configure the application by editing config/config.yaml. Key settings include:
- AI provider: Choose between
openaioranthropicas the batch processing backend - Target subreddits: Primary subreddits and exploratory subreddit settings
- Post age range: Only analyze posts within the configured age window
- API rate limits: Prevent hitting Reddit API limits
- AI models: Per-provider model configuration for filtering and deep analysis
- Monthly budget: Cap total API spending
- Scoring weights: How to weight different factors (relevance, pain point clarity, emotional intensity, implementability, technical depth) when scoring posts
- Token limits: Per-model enqueued token limits for batch API submissions
To run the pipeline once:
python3 main.py
This will:
- Scrape posts from configured primary subreddits
- Automatically discover and scrape from exploratory subreddits
- Analyze all posts with GPT models
- Store results in the database
To run the pipeline daily at the configured time (TODO, Fix scheduler):
python3 scheduler/daily_scheduler.py
After running the pipeline at least once, you can explore the results using the interactive Streamlit dashboard:
./run_gui.shOr run it directly:
streamlit run gui/gui.py --server.port 8501 --server.address localhostThe dashboard provides:
- Score filtering — Adjust sliders for ROI, relevance, pain score, emotion score, implementability, and technical depth to focus on the posts that matter most
- Subreddit filters — Multi-select filters to narrow results by source subreddit
- Sorting — Sort by any score metric (including technical depth) or post date, ascending or descending
- Pagination — Browse through large result sets 10 posts at a time
- Post cards — Each post displays scores, pain point summary, product opportunity, technical depth, tags, and a link to the original Reddit thread
- Expandable details — Click into any post to read the body text, AI-generated justification, affected audience, business type, existing alternatives, build complexity, business model, and technical moat analysis
- Summary statistics — Sidebar shows total posts with average relevance, pain, emotion, and tech depth scores for the current filter
Results are stored in a SQLite database at data/db.sqlite. Besides the GUI, you can query it directly:
-- Today's top leads
SELECT * FROM posts
WHERE processed_at >= DATE('now')
ORDER BY roi_weight DESC, relevance_score DESC
LIMIT 10;
-- Posts with specific tag
SELECT * FROM posts
WHERE tags LIKE '%serverless%'
ORDER BY processed_at DESC;Reddit_Scrapper/
├── config/ # Configuration files
│ ├── config.yaml # Main configuration
│ └── config_loader.py # Config + prompt loading
├── db/ # Database interaction
│ ├── schema.py # Table definitions
│ ├── reader.py # Read queries
│ ├── writer.py # Write operations
│ └── cleaner.py # Old entry cleanup
├── gpt/ # AI integration (OpenAI & Anthropic)
│ ├── batch_api.py # OpenAI Batch API submission & polling
│ ├── anthropic_batch.py # Anthropic Message Batches API integration
│ ├── batch_provider.py # Provider routing layer (OpenAI/Anthropic)
│ ├── filters.py # Pre-filtering prompt builder
│ ├── insights.py # Deep insight prompt builder
│ └── prompts/ # Prompt templates
│ ├── filter.txt
│ ├── insight.txt
│ ├── community_discovery.txt
│ └── community_discovery_system.txt
├── gui/ # Web dashboard
│ └── gui.py # Streamlit application
├── reddit/ # Reddit API interaction
│ ├── scraper.py # Post & comment scraping
│ ├── discovery.py # Exploratory subreddit discovery
│ └── rate_limiter.py # API rate limiting
├── scheduler/ # Scheduling & cost tracking
│ ├── runner.py # Main pipeline orchestration
│ └── cost_tracker.py # Monthly budget tracking
├── utils/ # Utility functions
│ ├── helpers.py # Token estimation, sanitization
│ └── logger.py # Logging setup
├── scripts/ # Utility scripts
│ └── clean_openai_storage.py # Clean accumulated OpenAI batch files
├── .env.template # Template for environment variables
├── main.py # Application entry point
├── run_gui.sh # GUI launcher script
└── requirements.txt # Python dependencies
The application includes several safeguards to control API costs:
- Monthly budget cap (configurable in
config.yaml) - Efficient batch processing using OpenAI's Batch API or Anthropic's Message Batches API
- Per-model enqueued token limits to avoid provider quota issues
- Automatic OpenAI storage cleanup (removes accumulated batch input/output files)
- Parallel batch submission with token-aware scheduling
- Partial result recovery from expired batches
- Pre-filtering with less expensive models before using more powerful models
- Cost tracking and logging
| Feature | Status | Notes |
|---|---|---|
| Reddit Scraping (Posts & Comments) | ✅ Done | Age-filtered, deduplicated, tracked via history table |
| Primary & Exploratory Subreddit Logic | ✅ Done | With refreshable exploratory_subreddits.json |
| GPT Filtering | ✅ Done | Via batch API, scoring + threshold-based selection |
| GPT Insight Extraction | ✅ Done | With batch API, structured JSON, ROI + tags |
| SQLite Local DB Storage | ✅ Done | Full schema, type handling (post/comment) |
| Rate Limiting | ✅ Done | Real limiter applied to avoid Reddit bans |
| Budget Control | ✅ Done | Tracks monthly cost, blocks over-budget batches |
| Daily Runner Pipeline | ✅ Done | Logs step-by-step, fail-safe batch handling |
| Anthropic Batch API Provider | ✅ Done | Full alternative to OpenAI with config-based switching |
| Parallel Batch Processing | ✅ Done | Token-aware scheduling, partial result recovery |
| Technical Depth Scoring | ✅ Done | Measures engineering complexity and defensibility |
| Implementability Scoring | ✅ Done | Feasibility assessment with willingness-to-pay signals |
| OpenAI Storage Cleanup | ✅ Done | Auto-cleans accumulated batch files before each run |
| Cached Summaries → GPT Discovery | ✅ Done | Based on post text, fallback if prompt fails |
| Comment scraping toggle | ✅ Done | Controlled via config key (include_comments) |
| Retry on GPT Batch Failures | ✅ Done | With exponential backoff and item-level retry |
| Streamlit GUI Dashboard | ✅ Done | Filter, sort, browse, and analyze results visually |
| Feature | Status | Suggestion |
|---|---|---|
| Parallel subreddit fetching | 🟡 Manual (sequential) | Consider async/threaded fetch in future |
| Tagged CSV Export / CLI | 🟡 Missing | Useful for non-technical review/debug |
| Multi-language / non-English handling | 🟡 Not supported | Detect & skip or flag for English-only use |
| Unit tests / mocks | 🟡 Not present | Add test coverage for scoring and DB logic |
Thanks to the following people who have contributed to this project:
| Contributor | Contributions |
|---|---|
| @Mohamedsaleh14 | Creator & maintainer |
| @Dieterbe | Bug fixes, prompt system refactoring, enhanced logging, GUI, batch optimization, and many quality-of-life improvements |
| Claude Code | AI pair programmer — code implementation, issue triage, and PR integration |
This tool was created as part of the growth strategy for Cronlytic.com — a serverless cron job scheduler designed for developers, indie hackers, and SaaS teams.
If you're building something and want to:
- Run scheduled webhooks or background jobs
- Get reliable cron-like execution in the cloud
- Avoid over-engineering with full servers
👉 Check out Cronlytic — and let us know what you'd love to see.
This project is open source for personal and non-commercial use only. Commercial use (including hosting it as a backend or integrating into products) requires prior approval.
See the LICENSE file for full terms.
This project uses open source libraries, which are governed by their own licenses:
- PRAW — MIT License
- APScheduler — MIT License
- OpenAI Python SDK — MIT License
- Anthropic Python SDK — MIT License
- Streamlit — Apache License 2.0
- Pandas — BSD 3-Clause License
- Reddit API — Subject to Reddit's Terms of Service
Use of this project must also comply with these third-party licenses and terms.
