Long context Wikipedia dataset constructed by expanding links in Wikipedia articles.
make installWe use Dumpster Dive to extract the data from a Wikipedia dump.
Install nodejs (at least v6), mongodb (at least v3)
# install this script
npm install -g dumpster-dive # (that gives you the global command `dumpster`)
# start mongo up
mongod --config /opt/homebrew/etc/mongod.conf# Choose your target language (e.g., "da" for Danish, "en" for English)
LANG="da"# Download the latest Wikipedia dump for your chosen language
# Note: This can be several GB and may take time depending on your connection
wget "https://dumps.wikimedia.org/${LANG}wiki/latest/${LANG}wiki-latest-pages-articles.xml.bz2"# Unzip the compressed XML file
bzip2 -d "./${LANG}wiki-latest-pages-articles.xml.bz2"# Parse Wikipedia XML and load into MongoDB (ensure MongoDB is running)
dumpster "./${LANG}wiki-latest-pages-articles.xml" \
--infoboxes=false --citations=false --categories=false --images=false --links=true --plaintext=true --db_url "mongodb://127.0.0.1:27017" --db "${LANG}wiki"Note: This step requires MongoDB to be running. Start it with the command from step 1.
# Export the processed data from MongoDB to JSON
mongoexport --db="${LANG}wiki" --collection=pages --out="${LANG}wiki_pages.jsonl"Build five JSON files that will be used to construct the expanded Wikipedia dataset.
python src/scripts/process.py --jsonl-file="${LANG}wiki_pages.jsonl"Build the expanded Wikipedia dataset by expanding links in the articles.
python src/scripts/build_dataset.py --include-strategy=prepend --max-link-expansions=10 --num-tokens-threshold=15000Developer:
- Oliver Kinch ([email protected])
To install new PyPI packages, run:
uv add <package-name>
To remove them again, run:
uv remove <package-name>
To show all installed packages, run:
uv pip list