Expanded Wikipedia Dataset

Long context Wikipedia dataset constructed by expanding links in Wikipedia articles.

Installation

make install

Extract data from a Wikipedia dump

We use Dumpster Dive to extract the data from a Wikipedia dump.

1️⃣ Install dependencies

Install nodejs (at least v6), mongodb (at least v3)

# install this script
npm install -g dumpster-dive # (that gives you the global command `dumpster`)
# start mongo up
mongod --config /opt/homebrew/etc/mongod.conf

2️⃣ Prepare the Wikipedia dump

2a. Set language code

# Choose your target language (e.g., "da" for Danish, "en" for English)
LANG="da"

2b. Download Wikipedia dump

# Download the latest Wikipedia dump for your chosen language
# Note: This can be several GB and may take time depending on your connection
wget "https://dumps.wikimedia.org/${LANG}wiki/latest/${LANG}wiki-latest-pages-articles.xml.bz2"

2c. Extract the dump

# Unzip the compressed XML file
bzip2 -d "./${LANG}wiki-latest-pages-articles.xml.bz2"

2d. Load data into MongoDB

# Parse Wikipedia XML and load into MongoDB (ensure MongoDB is running)
dumpster "./${LANG}wiki-latest-pages-articles.xml" \
  --infoboxes=false --citations=false --categories=false --images=false --links=true --plaintext=true --db_url "mongodb://127.0.0.1:27017" --db "${LANG}wiki"

Note: This step requires MongoDB to be running. Start it with the command from step 1.

2e. Export processed data

# Export the processed data from MongoDB to JSON
mongoexport --db="${LANG}wiki" --collection=pages --out="${LANG}wiki_pages.jsonl"

3️⃣ Process the extracted data

Build five JSON files that will be used to construct the expanded Wikipedia dataset.

python src/scripts/process.py --jsonl-file="${LANG}wiki_pages.jsonl"

4️⃣ Build the expanded Wikipedia dataset

Build the expanded Wikipedia dataset by expanding links in the articles.

python src/scripts/build_dataset.py --include-strategy=prepend --max-link-expansions=10 --num-tokens-threshold=15000

Developer:

Oliver Kinch ([email protected])

Adding and Removing Packages

To install new PyPI packages, run:

uv add <package-name>

To remove them again, run:

uv remove <package-name>

To show all installed packages, run:

uv pip list

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
data		data
models		models
notebooks		notebooks
src		src
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
dependabot.yaml		dependabot.yaml
makefile		makefile
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Expanded Wikipedia Dataset

Installation

Extract data from a Wikipedia dump

1️⃣ Install dependencies

2️⃣ Prepare the Wikipedia dump

2a. Set language code

2b. Download Wikipedia dump

2c. Extract the dump

2d. Load data into MongoDB

2e. Export processed data

3️⃣ Process the extracted data

4️⃣ Build the expanded Wikipedia dataset

Adding and Removing Packages

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

alexandrainst/wiki_expanded

Folders and files

Latest commit

History

Repository files navigation

Expanded Wikipedia Dataset

Installation

Extract data from a Wikipedia dump

1️⃣ Install dependencies

2️⃣ Prepare the Wikipedia dump

2a. Set language code

2b. Download Wikipedia dump

2c. Extract the dump

2d. Load data into MongoDB

2e. Export processed data

3️⃣ Process the extracted data

4️⃣ Build the expanded Wikipedia dataset

Adding and Removing Packages

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages