-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Milestone
Description
Reimplement simplified scraping tool and make it available within this repository.
Make package ParCzech::Scraper, because it will be needed in multiple scripts
features
- downloading (
ParCzech::Scraper::Down)- allow using delays between http requests
- allow saving downloaded file (use the same path as url)
- save metadata in tsv file if available (see change folder structure #212)
- allow using cached data
- change relative urls to absolute
- parsing and processing (
ParCzech::Scraper::Parse)- preprocess function
- make some buildin preprocess finction, that can be used (allow some text polishing, like character replacements)
- allow parsing data at all - save raw (eg for audio files)
- traversing data (allow adding context node)
- string and node results
- scalar or array context
Metadata
Metadata
Assignees
Labels
No labels