eth-easl
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md‎
Lines changed: 56 additions & 0 deletions b/‎README.md‎
Lines changed: 56 additions & 0 deletions
diff --git a/‎cmake/dependencies.cmake‎
Lines changed: 16 additions & 15 deletions b/‎cmake/dependencies.cmake‎
Lines changed: 16 additions & 15 deletions
diff --git a/‎examples/clariden/Dockerfile‎
Lines changed: 29 additions & 0 deletions b/‎examples/clariden/Dockerfile‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎examples/client_local_example.py‎
Lines changed: 53 additions & 19 deletions b/‎examples/client_local_example.py‎
Lines changed: 53 additions & 19 deletions
@@ -34,6 +34,7 @@ var/
 *.egg
 .eggs/
 *.egg-info
+build_*/
 
 # postgresql
 postgres-data/
 
@@ -0,0 +1,56 @@
+<div align="center">
+<h1>Mixtera</h1>
+
+---
+[![GitHub Workflow Status](https://github.com/eth-easl/mixtera/actions/workflows/workflow.yaml/badge.svg)](https://github.com/eth-easl/mixtera/actions/workflows/workflow.yaml)
+[![License](https://img.shields.io/github/license/eth-easl/mixtera)](https://img.shields.io/github/license/eth-easl/mixtera)
+
+Mixtera is an open-source data-centric training data plane built for modern LLM/VLM training. It enables ML engineers to declaratively filter, mix, and distribute large-scale training datasets on the fly, while supporting dynamic adjustment based on model feedback. Learn more in our [paper](https://mboether.com/assets/pdf/bother2024mixtera.pdf).
+
+</div>
+
+## ⚡️ Quickstart
+
+Mixtera can run as a server, or, for single-GPU training, in-process. In both cases, you will need to install the necessary dependencies and install Mixtera in your environment, for example as follows:
+
+```bash
+# In case you don't have micromamba yet
+# macos:
+brew install micromamba
+# alternatively:
+"${SHELL}" <(curl -L micro.mamba.pm/install.sh)
+
+# Start here if you have micromamba already
+micromamba env create -f ./environment.yml
+micromamba activate mixtera
+pip install -e .
+pip install -r dev-requirements.txt
+```
+
+The Mixtera server can then be started using the `mixtera-server` command.
+
+## 🔁 What is Mixtera used for?
+Modern large language and vision models rely on training datasets with fine-grained properties such as language, source, topic, or license. Traditionally, ML engineers have managed these datasets manually using ad hoc scripts and directory structures, which is time-consuming, tedious, and prone to errors. Mixtera addresses these issues by providing a lightweight, declarative data plane that lets you seamlessly filter and dynamically mix data on the fly without the overhead of redundant data processing.
+
+Whether you need to enforce fixed data ratios—say, 70% JavaScript code and 30% Python, or want to adjust proportions during training using feedback-driven algorithms like [ADO](https://arxiv.org/abs/2410.11820), Mixtera offers a flexible interface for both static and dynamic mixing. Beyond efficiency, Mixtera ensures that distributed training jobs receive identical, reproducible data inputs across all nodes, crucial for consistency and accurate model results.
+
+Mixtera is a centralized sample management layer, building upon DuckDB. It abstracts away the complexities of file-system-based data management. It supports data samples stored in various formats (e.g., jsonl, parquet, webdataset), letting users focus on model research rather than data wrangling.
+
+## 🚀 Usage
+
+Using Mixtera typically consists of (1) registering your data and (2) running queries/trainings on top of it. We maintain several [examples](https://github.com/eth-easl/mixtera/blob/main/examples/) of how to use Mixtera. A good first read is the [local-only example](https://github.com/eth-easl/mixtera/blob/main/examples/client_local_example.py). That script walks you through the basics of registering data in Mixtera and running a query on that. Afterwards, the [server example](https://github.com/eth-easl/mixtera/blob/main/examples/client_server_example.py) shows you how to run a server with the `mixtera-server` command, and how to register data and query it via client-server interaction.
+
+We provide a [full guide](examples/torchtitan.md) on how to run a training with Mixtera and torchtitan, in particular on how to run the server, register the dataset, and then start training jobs, for both bare-metal and slurm (e.g., SwissAI/CSCS/Alps/Clariden) deployments.
+
+## ✨ Mixtera’s System Overview
+
+<div align="center">
+<img src="img/system.png" height=300 alt="Mixtera system design"/>
+</div>
+
+Mixtera follows a server-client model. During training, the server runs on a node and each training node runs client instances. The query is executed at the server in two phases. First, Mixtera applies static filters from the query (e.g., English-only) to obtain all samples we could train on. This gives us a [QueryResult](https://github.com/eth-easl/mixtera/blob/main/mixtera/core/query/query_result.py). Second, during training, the server distributes [chunks](https://github.com/eth-easl/mixtera/blob/main/mixtera/core/query/result_chunk.py) of that query result to the client(s). A chunk is a collection of pointers to samples in files. These pointers tell the receiving client which samples in the file to load (e.g., sample 10 in file `wikipedia.jsonl.zst`).
+
+## ✉️ About
+
+Mixtera is being developed at the [Efficient Architectures and Systems Lab (EASL)](https://anakli.inf.ethz.ch/#Group) at the [ETH Zurich Systems Group](https://systems.ethz.ch/).
+Please reach out to `mboether [at] inf [dot] ethz [dot] ch` or open an issue on Github if you have any questions or inquiry related to Mixtera and its usage.
@@ -45,21 +45,6 @@ FetchContent_Declare(
 FetchContent_MakeAvailable(indicators)
 target_compile_options(indicators INTERFACE -Wno-zero-as-null-pointer-constant -Wno-sign-compare)
 
-################### abseil ####################
-
-message(STATUS "Making abseil available.")
-
-FetchContent_Declare(
-    absl
-    GIT_REPOSITORY https://github.com/abseil/abseil-cpp.git
-    GIT_TAG        20240722.0
-  )
-FetchContent_MakeAvailable(absl)
-
-# Required for GCC
-target_compile_options(absl_flat_hash_map INTERFACE -Wno-pedantic)
-target_compile_options(absl_base INTERFACE -Wno-pedantic)
-
 
 ################### Arrow ####################
 
@@ -104,3 +89,19 @@ else()
 endif()
 
 target_compile_options(Arrow::arrow_shared INTERFACE -Wno-redundant-move)
+
+################### abseil ####################
+
+# Abseil needs to be loaded after arrow, otherwise we run into issues on the alps/clariden cluster.
+message(STATUS "Making abseil available.")
+
+FetchContent_Declare(
+    absl
+    GIT_REPOSITORY https://github.com/abseil/abseil-cpp.git
+    GIT_TAG        20240722.0
+  )
+FetchContent_MakeAvailable(absl)
+
+# Required for GCC
+target_compile_options(absl_flat_hash_map INTERFACE -Wno-pedantic)
+target_compile_options(absl_base INTERFACE -Wno-pedantic)
@@ -0,0 +1,29 @@
+FROM nvcr.io/nvidia/pytorch:25.01-py3
+
+RUN apt-get update && apt-get upgrade -y && apt-get install ca-certificates lsb-release wget python3-pip neovim autoconf build-essential gdb software-properties-common curl unzip cmake gzip protobuf-compiler libtool zstd liblz4-dev lz4 -y
+
+RUN wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
+RUN apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
+RUN apt update
+RUN apt install -y -V  libparquet-glib-dev libparquet-dev libarrow-dataset-glib-dev libarrow-dataset-dev libarrow-glib-dev libarrow-dev
+
+RUN pip install pip==24.*
+
+# If you encounter pyarrow issues, ensure the version here matches the version downloaded above!!
+RUN pip install tqdm loguru psutil numpy==1.26.4 dill datasets transformers pyarrow==19.*  xxhash xopen scipy tenacity
+RUN pip install duckdb polars==1.15 pillow pybind11 pytest flake8 mypy pylint autopep8 isort black tensorboard tiktoken blobfile tabulate wandb torchdata>=0.8.0 tomli>=1.1.0 dacite pyyaml packaging safetensors sentencepiece jupyter seaborn webdataset lz4  git+https://github.com/tmbdev/[email protected] mosaicml-streaming grain
+RUN pip install lm_eval typer # for evaluation
+
+# Test torch nightly
+RUN pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124
+
+RUN git clone --recurse-submodules -b v1.64.3 --depth 1 --shallow-submodules https://github.com/grpc/grpc && \
+    cd grpc && mkdir -p cmake/build && cd cmake/build && \
+    cmake -DgRPC_PROTOBUF_PROVIDER=module -DABSL_ENABLE_INSTALL=On -DgRPC_BUILD_CSHARP_EXT=Off -DABSL_BUILD_TESTING=Off -DgRPC_INSTALL=ON -DgRPC_BUILD_TESTS=OFF -DCMAKE_BUILD_TYPE=Release ../.. && \
+    make -j64 && make install && cd ../../
+
+RUN bash -c "cp /usr/local/lib/libutf8* /usr/lib"
+
+## For nanotron
+RUN pip uninstall -y ninja && pip install ninja
+RUN MAX_JOBS=12 numactl --membind=0-3 pip install  flash-attn --no-build-isolation
@@ -47,17 +47,31 @@ class TestMetadataParser(MetadataParser):
     def get_properties(cls) -> list[MetadataProperty]:
         return [
             MetadataProperty(
-                name="language", dtype="ENUM", multiple=False, nullable=False, enum_options={"JavaScript", "HTML"}
+                name="language",
+                dtype="ENUM",
+                multiple=False,
+                nullable=False,
+                enum_options={"JavaScript", "HTML"},
             ),
             MetadataProperty(
-                name="license", dtype="STRING", multiple=False, nullable=False, enum_options={"CC", "MIT"}
+                name="license",
+                dtype="STRING",
+                multiple=False,
+                nullable=False,
+                enum_options={"CC", "MIT"},
             ),  # Could be ENUM but we are using string to test
             MetadataProperty(
-                name="doublelanguage", dtype="ENUM", multiple=True, nullable=False, enum_options={"JavaScript", "HTML"}
+                name="doublelanguage",
+                dtype="ENUM",
+                multiple=True,
+                nullable=False,
+                enum_options={"JavaScript", "HTML"},
             ),
         ]
 
-    def parse(self, line_number: int, payload: Any, **kwargs: Optional[dict[Any, Any]]) -> None:
+    def parse(
+        self, line_number: int, payload: Any, **kwargs: Optional[dict[Any, Any]]
+    ) -> None:
         metadata = payload["meta"]
         self.add_metadata(
             sample_id=line_number,
@@ -69,49 +83,69 @@ def parse(self, line_number: int, payload: Any, **kwargs: Optional[dict[Any, Any
 
 def parsing_func(sample):
     import json
+
     return json.loads(sample)["text"]
 
+
 def setup_local_client(directory: Path):
     # Writing JSONL data to the directory, which simulates the dataset.
     write_jsonl(directory / "testd.jsonl")
-    
+
     # Instantiating a client from a local directory to interact with the datasets locally.
     client = MixteraClient.from_directory(directory)
-    
+
     # Register the metadata parser.
     client.register_metadata_parser("TEST_PARSER", TestMetadataParser)
-    
+
     # Registering the dataset with the client.
-    client.register_dataset(
-        "local_integrationtest_dataset", directory / "testd.jsonl", JSONLDataset, parsing_func, "TEST_PARSER"
-    )
-    
+    if not client.register_dataset(
+        "local_integrationtest_dataset",
+        directory / "testd.jsonl",
+        JSONLDataset,
+        parsing_func,
+        "TEST_PARSER",
+    ):
+        raise RuntimeError("Error while registering dataset!")
+
     return client
 
+
 def run_query(client: MixteraClient, chunk_size: int):
-    job_id = str(round(time.time() * 1000)) # Get some job ID based on current timestamp
-    query = Query.for_job(job_id).select(("language", "==", "JavaScript")) # In our example, we want to query all samples tagged JavaScript
+    job_id = str(
+        round(time.time() * 1000)
+    )  # Get some job ID based on current timestamp
+    query = Query.for_job(job_id).select(
+        ("language", "==", "JavaScript")
+    )  # In our example, we want to query all samples tagged JavaScript
 
     mixture = ArbitraryMixture(chunk_size=chunk_size)
     qea = QueryExecutionArgs(mixture=mixture)
     client.execute_query(query, qea)
+    client.wait_for_execution(job_id)
 
     rsa = ResultStreamingArgs(job_id=job_id)
     result_samples = list(client.stream_results(rsa))
-    
+
     # Checking the number of results and their validity.
-    assert len(result_samples) == 500, f"Got {len(result_samples)} samples instead of the expected 500!"
-    for _, sample in result_samples: # The first argument is the index in the current chunk, needed for state recovery
+    assert (
+        len(result_samples) == 500
+    ), f"Got {len(result_samples)} samples instead of the expected 500!"
+    for (
+        _,
+        _,
+        sample,
+    ) in result_samples:  # The first argument is the index in the current chunk, needed for state recovery. The second argument is the domain id.
         assert int(sample) % 2 == 0, f"Sample {sample} should not appear for JavaScript"
 
+
 def main():
     with tempfile.TemporaryDirectory() as temp_dir:
         # Setup the local client with a temporary directory.
         # This also populates the database with a dummy dataset, where 50% of data is tagged HTML and 50% is tagged JavaScript.
         client = setup_local_client(Path(temp_dir))
-        chunk_size = 42 # Size of the result chunks of the query
+        chunk_size = 42  # Size of the result chunks of the query
         run_query(client, chunk_size)
-        
+
 
 if __name__ == "__main__":
-    main()
+    main()