Skip to content

make web scraper accessible #213

@matyaskopp

Description

@matyaskopp

Reimplement simplified scraping tool and make it available within this repository.
Make package ParCzech::Scraper, because it will be needed in multiple scripts

features

  • downloading (ParCzech::Scraper::Down)
    • allow using delays between http requests
    • allow saving downloaded file (use the same path as url)
    • save metadata in tsv file if available (see change folder structure #212)
    • allow using cached data
    • change relative urls to absolute
  • parsing and processing (ParCzech::Scraper::Parse)
    • preprocess function
    • make some buildin preprocess finction, that can be used (allow some text polishing, like character replacements)
    • allow parsing data at all - save raw (eg for audio files)
    • traversing data (allow adding context node)
      • string and node results
      • scalar or array context

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions