This crawler gathers structured content from a collection of Nepali-language websites, turning raw articles, posts, and pages into an organized dataset. It’s great for researchers, data scientists, or anyone looking to build language-specific corpora, text analytics data, or localized content archives.
Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for Nepali Dataset Crawler you've just found your team — Let's Chat. 👆👆
The Nepali Dataset Crawler fetches public webpages from Nepali web domains and normalizes their content — including text bodies, metadata, titles, publish dates, and possibly images — into a structured dataset. It’s ideal for NLP tasks, local-market research, cultural analysis, or language-specific data pipelines.
- Crawl multiple Nepali websites quickly and at scale.
- Collect article content, metadata, and publication information.
- Build a clean text dataset for NLP, sentiment analysis, or translation projects.
- Archive web content from Nepali sources for future processing.
| Feature | Description |
|---|---|
| Multi-site Crawling | Supports scraping from multiple Nepali websites. |
| Content Extraction | Extracts titles, article content, publish date, and metadata. |
| Structured Output | Returns cleaned data in JSON or other export-friendly formats. |
| Dataset Building | Useful for building full corpora for NLP or research. |
| Scalable Execution | Can process many pages with efficient crawling logic. |
| Field Name | Field Description |
|---|---|
| url | Original article page URL. |
| title | Title of the article or content. |
| publishDate | Date when the content was published (if available). |
| author | Author name (if available). |
| contentText | Main article text stripped of HTML tags. |
| contentHtml | Full content including HTML (optional). |
| language | Language of the content (expected “Nepali”). |
| metadata | Any additional metadata (tags, categories, categories, etc.). |
[
{
"url": "https://nepalnews.example.com/article/1234",
"title": "काठमाडौं उपत्यकामा मौसम अपडेट",
"publishDate": "2024-08-15",
"author": "राम शर्मा",
"contentText": "आज काठमाडौं उपत्यकामा मौसम सफा रहेको छ ...",
"contentHtml": "<p>आज काठमाडौं उपत्यकामा ...</p>",
"language": "ne",
"metadata": {
"tags": ["weather", "local news"]
}
}
]
Nepali Dataset Crawler/
├── src/
│ ├── main.js
│ ├── crawler/
│ │ ├── site_list_loader.js
│ │ ├── page_fetcher.js
│ │ └── content_parser.js
│ ├── utils/
│ │ ├── normalizer.js
│ │ └── html_cleaner.js
│ └── config/
│ └── settings.example.json
├── data/
│ ├── sample_input.json
│ └── sample_output.json
├── package.json
└── README.md
- NLP Researchers build Nepali language corpora for modeling, translation, or sentiment analysis.
- Journalism Analysts archive news articles for trend tracking.
- Cultural Researchers collect content for sociolinguistic or cultural studies.
- Developers build localized content feeds or recommendation systems.
- Data Engineers ingest scraped data into data warehouses or pipelines.
Which websites does it crawl?
It supports a configurable list of Nepali-language websites — you supply domains in settings.
What formats are outputted?
Structured JSON is default; data can be converted to CSV or other formats as needed.
Can it handle large-scale crawls?
Yes — the crawler is built to scale across many pages efficiently.
Is language detection supported?
Content is tagged as Nepali (“ne”) by default; further language checks can be added.
Primary Metric:
Processes dozens of pages per second under optimal network conditions.
Reliability Metric:
High success rate (~98%) on publicly accessible sites with consistent HTML structure.
Efficiency Metric:
Lightweight parsing and normalization reduce overhead and speed up processing.
Quality Metric:
Produces cleaned, normalized text outputs ideal for analysis, modeling, or archiving.
