Skip to content

alexandrainst/wiki_expanded

Expanded Wikipedia Dataset

Long context Wikipedia dataset constructed by expanding links in Wikipedia articles.

Installation

make install

Extract data from a Wikipedia dump

We use Dumpster Dive to extract the data from a Wikipedia dump.

1️⃣ Install dependencies

Install nodejs (at least v6), mongodb (at least v3)

# install this script
npm install -g dumpster-dive # (that gives you the global command `dumpster`)
# start mongo up
mongod --config /opt/homebrew/etc/mongod.conf

2️⃣ Prepare the Wikipedia dump

2a. Set language code

# Choose your target language (e.g., "da" for Danish, "en" for English)
LANG="da"

2b. Download Wikipedia dump

# Download the latest Wikipedia dump for your chosen language
# Note: This can be several GB and may take time depending on your connection
wget "https://dumps.wikimedia.org/${LANG}wiki/latest/${LANG}wiki-latest-pages-articles.xml.bz2"

2c. Extract the dump

# Unzip the compressed XML file
bzip2 -d "./${LANG}wiki-latest-pages-articles.xml.bz2"

2d. Load data into MongoDB

# Parse Wikipedia XML and load into MongoDB (ensure MongoDB is running)
dumpster "./${LANG}wiki-latest-pages-articles.xml" \
  --infoboxes=false --citations=false --categories=false --images=false --links=true --plaintext=true --db_url "mongodb://127.0.0.1:27017" --db "${LANG}wiki"

Note: This step requires MongoDB to be running. Start it with the command from step 1.

2e. Export processed data

# Export the processed data from MongoDB to JSON
mongoexport --db="${LANG}wiki" --collection=pages --out="${LANG}wiki_pages.jsonl"

3️⃣ Process the extracted data

Build five JSON files that will be used to construct the expanded Wikipedia dataset.

python src/scripts/process.py --jsonl-file="${LANG}wiki_pages.jsonl"

4️⃣ Build the expanded Wikipedia dataset

Build the expanded Wikipedia dataset by expanding links in the articles.

python src/scripts/build_dataset.py --include-strategy=prepend --max-link-expansions=10 --num-tokens-threshold=15000

Code Coverage Documentation License: MIT LastCommit Contributor Covenant

Developer:

Adding and Removing Packages

To install new PyPI packages, run:

uv add <package-name>

To remove them again, run:

uv remove <package-name>

To show all installed packages, run:

uv pip list

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published