⚡ arnio

The C++ fueled pre-processor for Pandas.
Stop wasting time writing ad-hoc cleaning scripts for messy CSVs.

The Problem • The Solution • Benchmarks • Quickstart

Pandas is incredible for analysis. It is notoriously slow and memory-hungry for ingesting and cleaning raw CSVs.
Arnio exists to do exactly one thing: intercept your messy CSVs, clean them natively in C++, and hand you a pristine Pandas DataFrame in half the time.

🧨 The Problem

Every data project starts the same way. You load a CSV. It crashes your RAM. You load it again in chunks. You find random nulls, weird capitalization, and trailing whitespaces. You write a 15-line script chaining .apply(), .dropna(), and .str.strip(). You copy-paste this script into your next 5 Jupyter notebooks.

It's slow. It's unreadable. It's error-prone.

✨ The Solution: Arnio

Arnio replaces your messy ingestion script with a high-performance, declarative pipeline powered by pybind11 and C++.

❌ The Old Way (Pandas)	⚡ The Arnio Way
Memory Spikes: Python loads the entire raw string file before casting.	C++ Native: Parses and infers types directly into columnar memory.
Spaghetti Code: `.apply()` lambda functions scattered across cells.	Declarative: A strict, readable list of cleaning steps.
Slow Execution: Python loops over strings to strip whitespaces.	Blazing Fast: Cleaning primitives run at near metal speeds.

🚀 Getting Started

If you have Python 3.9+, you are 5 seconds away from faster data pipelines.

pip install arnio

The 3-Step Workflow

Drop Arnio into the very top of your Jupyter Notebook or Python script.

import arnio as ar

# 1. Load the raw file using the C++ core (no Python overhead)
frame = ar.read_csv("messy_sales_data.csv")

# 2. Define a strict, readable cleaning pipeline
clean_frame = ar.pipeline(frame, [
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("fill_nulls", {"value": 0.0, "subset": ["revenue"]}),
    ("drop_nulls",),
    ("drop_duplicates",),
])

# 3. Export to a clean pandas DataFrame and start your analysis!
df = ar.to_pandas(clean_frame)

# -> Now, use `df` exactly like you always have.

🏎️ Benchmarks: Arnio vs Pandas

Arnio isn't just cleaner to write—it is significantly faster to run.

Tested on a 1 Million row CSV (12 columns, mixed types, dirty strings) using an M2 MacBook Pro.

Metric	`pandas.read_csv` + cleaning	`arnio.pipeline`	Improvement
Execution Time	`4.24 seconds`	`2.11 seconds`	🔥 2x Faster
Peak Memory	`620 MB`	`380 MB`	📉 40% Less RAM

🔍 Want to peek at a massive file without loading it?

Arnio lets you instantly scan a massive CSV to infer its schema without loading the data into memory.

import arnio as ar

schema = ar.scan_csv("100GB_file.csv")
print(schema) 
# {'id': 'INT64', 'name': 'STRING', 'is_active': 'BOOL'}

🛠️ What's Inside?

Arnio ships with a growing library of hyper-optimized C++ cleaning primitives:

drop_nulls: Rip out bad rows instantly.
fill_nulls: Patch holes with scalar values.
drop_duplicates: Deduplicate rows based on exact matches.
strip_whitespace: Trim invisible spaces from string columns.
normalize_case: Force upper or lower case instantly.
rename_columns & cast_types: Shape your data exactly how you need it.

🤝 Join the Movement

We are actively looking for contributors! Arnio is a hybrid Python/C++ project, making it the perfect playground if you want to learn pybind11, columnar memory formats, or high-performance Python.

git clone https://github.com/im-anishraj/arnio.git
cd arnio
pip install -e ".[dev]"
pytest tests/ -v

Have a feature request? Want a new cleaning primitive? Drop an issue in the repo!

Stop fighting your data. Let Arnio clean it.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github		.github
arnio		arnio
benchmarks		benchmarks
bindings		bindings
cpp		cpp
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
intro.gif		intro.gif
pyproject.toml		pyproject.toml
sample.csv		sample.csv
setup.py		setup.py
test_ws.csv		test_ws.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡ arnio

🧨 The Problem

✨ The Solution: Arnio

🚀 Getting Started

The 3-Step Workflow

🏎️ Benchmarks: Arnio vs Pandas

🛠️ What's Inside?

🤝 Join the Movement

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⚡ arnio

🧨 The Problem

✨ The Solution: Arnio

🚀 Getting Started

The 3-Step Workflow

🏎️ Benchmarks: Arnio vs Pandas

🛠️ What's Inside?

🤝 Join the Movement

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages