Falcon: GPU-Based Floating-point Adaptive Lossless Compression

Falcon is a high-performance GPU-accelerated lossless compression framework specifically designed for floating-point time series data. It achieves unprecedented compression ratios and throughput by leveraging modern GPU architectures through three key innovations: asynchronous pipeline, precise float-to-integer conversion, and adaptive sparse bit-plane encoding.

📊 Performance Highlights

Compression Ratio: Average 0.299 (21% improvement over best CPU competitors)
Compression Throughput: Average 10.82 GB/s (2.43× faster than fastest GPU competitors)
Decompression Throughput: Average 12.32 GB/s (2.4× faster than fastest GPU competitors)

🚀 Key Features

🎯 Asynchronous Pipeline

Event-Driven Scheduler: Hides I/O latency during CPU-GPU data transmission
Multi-stream Processing: Supports up to 16 concurrent streams
Bidirectional PCIe Utilization: Overlaps H2D and D2H communications

🔢 Precision-Preserving Conversion

Theoretical Guarantees: Eliminates floating-point arithmetic errors
Adaptive Digit Transformation: Handles both normal (β≤15, α≤22) and exceptional cases
Lossless Recovery: Exact reconstruction of original floating-point values

🎚️ Adaptive Sparse Bit-Plane Encoding

Dual Storage Schemes: Sparse storage for zero-dominated planes, dense storage for others
Outlier Resilience: Mitigates sparsity degradation caused by anomalies
Warp Divergence Minimization: Optimized for GPU parallel execution

🛠️ Prerequisite

Verified Environments

Base Environment 1 (WSL2)

OS: Ubuntu 22.04.5 LTS
Compiler: g++ 11.4
Build System: CMake 3.22.1
CUDA: nvcc 12.8/11.6
GPU: NVIDIA GeForce RTX 3050

Base Environment 2 (Native Ubuntu)

OS: Ubuntu 24.04.2 LTS
Compiler: g++ 11.4
Build System: CMake 3.28.1
CUDA: nvcc 12.0
GPU: NVIDIA GeForce RTX 5080

Required Dependencies

Essential Build Tools

# For Ubuntu 22.04/24.04
sudo apt update && sudo apt upgrade
sudo apt install -y git build-essential

CMake Installation

# Ubuntu 22.04 (CMake 3.22)
sudo apt install -y cmake

# Ubuntu 24.04 (CMake 3.28) or for newer version
wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | sudo apt-key add -
sudo apt-add-repository 'deb https://apt.kitware.com/ubuntu/ jammy main'
sudo apt update
sudo apt install -y cmake

CUDA Toolkit Installation

# For CUDA 12.x (compatible with RTX 3050/5080)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-0

# For CUDA 11.x (if needed for compatibility)
sudo apt install -y cuda-toolkit-11-8

Required Libraries

# Boost (program_options component)
sudo apt install -y libboost-all-dev

# Google Test (GTest)
sudo apt install -y libgtest-dev
cd /usr/src/gtest
sudo cmake .
sudo make
sudo cp lib/*.a /usr/lib

# Google Benchmark
sudo apt install -y libbenchmark-dev

# NVIDIA nvcomp (for baseline comparisons)
sudo apt-get -y install nvcomp-cuda-11
# or
sudo apt-get -y install nvcomp-cuda-12

Environment Verification

# Check compiler versions
g++ --version
cmake --version
nvcc --version

# Verify CUDA installation
nvidia-smi

🏗️ Code Architecture

Header Files Structure

GPU Base Version (1025 elements per thread)

Falcon_compressor.cuh - Optimized GPU compressor (1 thread processes 1025 elements)
Falcon_decompressor.cuh - Optimized GPU decompressor (1 thread processes 1025 elements)

GPU Single Precision Version

Falcon_float_compressor.cuh - Single precision floating-point GPU compressor
Falcon_float_decompressor.cuh - Single precision floating-point GPU decompressor

GPU Pipeline Version

Falcon_pipeline.cuh - Pipeline implementation with ablation interfaces
Falcon_float_pipeline.cuh - Single precision floating-point pipeline implementation

Source Implementation

text

src/
├── gpu/           # GPU kernel implementations
└── utils/         # Bit stream utilities and helper functions

Parallelism Design

Chunk Size: 1025 elements per GPU thread
Thread Mapping: Each thread processes one complete chunk
Warp Efficiency: Optimized for 32-thread warp execution
Memory Access: Coalesced global memory access patterns

🔨 Building

Quick Build Script

#!/bin/bash
set -x
mkdir -p build
cd build
cmake ..
make -j$(nproc)

Manual Building

Clone the repository:
```
git clone <repository-url>
cd Falcon
```

Generate CMake building system:

cmake -S . -B ./build -DCMAKE_BUILD_TYPE=Release

Build all targets:

cmake --build ./build --config Release -j$(nproc)

🧪 Testing

Test Structure

test/
├── baseline/          # Comparison algorithms (ALP, ndzip, elf, etc.)
├── data/             # Test datasets
├── Falcon_test_*.cu  # Main GPU test suites
└── test_*.cpp/cu     # Specific algorithm tests

Running Tests

Basic Usage for All Tests

./test/test_${test_name} --dir ../test/data/use/

Benchmark Tests (vs Baselines)

# Main GPU implementation 
./test/test_gpu --dir ../test/data/use/

# GPU without packing optimization
./test/test_gpu_nopack --dir ../test/data/use/

# GPU with bit-reduction optimization
./test/test_gpu_br --dir ../test/data/use/

# GPU with sparse optimization
./test/test_gpu_spare --dir ../test/data/use/

Multi-stream Performance Tests

# Multi-stream with 3-step blocking
./test/test_muti_3step_block --dir ../test/data/use/

# Multi-stream with 3-step non-blocking
./test/test_muti_3step_noblock --dir ../test/data/use/

# Optimized multi-stream
./test/test_muti_stream --dir ../test/data/use/

Ablation Studies

Encoding Strategy Ablation

Full Sparse: All bit-planes use sparse storage
Full Dense: All bit-planes use dense storage
Brute-force Error: Inaccurate decimal place calculation
Standard: Adaptive sparse/dense selection (default)

Pipeline Ablation

Single-stream: Sequential processing
Blocking: Synchronous multi-stream
Non-blocking: Asynchronous multi-stream
Standard: Event-driven scheduler (default)

Complete Test Script

#!/bin/bash
set -x
cd Falcon
mkdir -p build
cd build

# Compile project
cmake ..
make -j

# Run all tests
run_test() {
    local test_name=$1
    echo "===== Running ${test_name} ====="
    ./test/test_${test_name} --dir ../test/data/use/
}

# Core GPU tests
run_test "gpu"
run_test "gpu_nopack"
run_test "gpu_br"
run_test "gpu_spare"

# Multi-stream tests
run_test "muti_3step_block"
run_test "muti_3step_noblock"
run_test "muti_stream_opt"

📊 Experimental Results

Compression Ratio Comparison

Method	Average Ratio	Improvement vs Falcon
Falcon	0.299	-
ALP	0.329	9.1% worse
Elf*	0.339	13.4% worse
Elf	0.380	27.1% worse
ndzip	0.996	233% worse

Throughput Performance

Operation	Falcon	Best Competitor	Speedup
Compression	10.82 GB/s	4.46 GB/s (GDeflate)	2.43×
Decompression	12.32 GB/s	5.13 GB/s (GPU:Elf*)	2.4×

🔧 Configuration

Default Parameters

Chunk Size: 1025 elements per thread
Batch Size: 1025 × 1024 × 4 elements
Pipeline Streams: 16
GPU Architecture: Compute Capability 7.0+

Chunk Size Considerations

1025 elements: Optimized for memory space utilization
Thread Mapping: Each GPU thread processes exactly one chunk

Build Options

-DCMAKE_BUILD_TYPE=Release for optimized performance
-DCMAKE_CUDA_ARCHITECTURES=70 for specific GPU architecture

📚 Citation

If you use Falcon in your research, please cite:

@article{falcon2025,
  title={Falcon: GPU-Based Floating-point Adaptive Lossless Compression},
  author={Li, Zheng and Wang, Weiyan and Li, Ruiyuan and Chen, Chao and Long, Xianlei and Zheng, Linjiang and Xu, Quanqing and Yang, Chuanhui},
  journal={PVLDB},
  volume={14},
  number={1},
  pages={XXX--XXX},
  year={2025},
  publisher={VLDB Endowment}
}

👥 Authors

Zheng Li (Chongqing University) - [email protected]
Weiyan Wang (Chongqing University) - [email protected]
Ruiyuan Li (Chongqing University) - [email protected]
Chao Chen (Chongqing University) - [email protected]
Xianlei Long (Chongqing University) - [email protected]
Linjiang Zheng (Chongqing University) - [email protected]
Quanqing Xu (OceanBase, Ant Group) - [email protected]
Chuanhui Yang (OceanBase, Ant Group) - [email protected]

📄 License

This project is available for academic and research use. Please refer to the specific license terms in the repository.

🔗 Related Publications

Note: This project has been verified to work on both WSL2 (Ubuntu 22.04) and native Ubuntu 24.04 environments with the specified dependencies. For questions about specific implementations or performance characteristics, please refer to the corresponding header files and test cases.

Name		Name	Last commit message	Last commit date
Latest commit History 205 Commits
.vscode		.vscode
Serf		Serf
build/test/baseline/cuSZp		build/test/baseline/cuSZp
build_debug		build_debug
include		include
src		src
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
.gitmodules.swn		.gitmodules.swn
.gitmodules.swo		.gitmodules.swo
CMakeLists.txt		CMakeLists.txt
README.md		README.md
README_CN.md		README_CN.md

Spatio-Temporal-Lab/Falcon

Folders and files

Latest commit

History

Repository files navigation