Falcon is a high-performance GPU-accelerated lossless compression framework specifically designed for floating-point time series data. It achieves unprecedented compression ratios and throughput by leveraging modern GPU architectures through three key innovations: asynchronous pipeline, precise float-to-integer conversion, and adaptive sparse bit-plane encoding.
- Compression Ratio: Average 0.299 (21% improvement over best CPU competitors)
- Compression Throughput: Average 10.82 GB/s (2.43× faster than fastest GPU competitors)
- Decompression Throughput: Average 12.32 GB/s (2.4× faster than fastest GPU competitors)
- Event-Driven Scheduler: Hides I/O latency during CPU-GPU data transmission
- Multi-stream Processing: Supports up to 16 concurrent streams
- Bidirectional PCIe Utilization: Overlaps H2D and D2H communications
- Theoretical Guarantees: Eliminates floating-point arithmetic errors
- Adaptive Digit Transformation: Handles both normal (β≤15, α≤22) and exceptional cases
- Lossless Recovery: Exact reconstruction of original floating-point values
- Dual Storage Schemes: Sparse storage for zero-dominated planes, dense storage for others
- Outlier Resilience: Mitigates sparsity degradation caused by anomalies
- Warp Divergence Minimization: Optimized for GPU parallel execution
- OS: Ubuntu 22.04.5 LTS
- Compiler: g++ 11.4
- Build System: CMake 3.22.1
- CUDA: nvcc 12.8/11.6
- GPU: NVIDIA GeForce RTX 3050
- OS: Ubuntu 24.04.2 LTS
- Compiler: g++ 11.4
- Build System: CMake 3.28.1
- CUDA: nvcc 12.0
- GPU: NVIDIA GeForce RTX 5080
# For Ubuntu 22.04/24.04
sudo apt update && sudo apt upgrade
sudo apt install -y git build-essential# Ubuntu 22.04 (CMake 3.22)
sudo apt install -y cmake
# Ubuntu 24.04 (CMake 3.28) or for newer version
wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | sudo apt-key add -
sudo apt-add-repository 'deb https://apt.kitware.com/ubuntu/ jammy main'
sudo apt update
sudo apt install -y cmake# For CUDA 12.x (compatible with RTX 3050/5080)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-0
# For CUDA 11.x (if needed for compatibility)
sudo apt install -y cuda-toolkit-11-8# Boost (program_options component)
sudo apt install -y libboost-all-dev
# Google Test (GTest)
sudo apt install -y libgtest-dev
cd /usr/src/gtest
sudo cmake .
sudo make
sudo cp lib/*.a /usr/lib
# Google Benchmark
sudo apt install -y libbenchmark-dev
# NVIDIA nvcomp (for baseline comparisons)
sudo apt-get -y install nvcomp-cuda-11
# or
sudo apt-get -y install nvcomp-cuda-12# Check compiler versions
g++ --version
cmake --version
nvcc --version
# Verify CUDA installation
nvidia-smiFalcon_compressor.cuh- Optimized GPU compressor (1 thread processes 1025 elements)Falcon_decompressor.cuh- Optimized GPU decompressor (1 thread processes 1025 elements)
Falcon_float_compressor.cuh- Single precision floating-point GPU compressorFalcon_float_decompressor.cuh- Single precision floating-point GPU decompressor
Falcon_pipeline.cuh- Pipeline implementation with ablation interfacesFalcon_float_pipeline.cuh- Single precision floating-point pipeline implementation
text
src/
├── gpu/ # GPU kernel implementations
└── utils/ # Bit stream utilities and helper functions
- Chunk Size: 1025 elements per GPU thread
- Thread Mapping: Each thread processes one complete chunk
- Warp Efficiency: Optimized for 32-thread warp execution
- Memory Access: Coalesced global memory access patterns
#!/bin/bash
set -x
mkdir -p build
cd build
cmake ..
make -j$(nproc)-
Clone the repository:
git clone <repository-url> cd Falcon
-
Generate CMake building system:
cmake -S . -B ./build -DCMAKE_BUILD_TYPE=Release -
Build all targets:
cmake --build ./build --config Release -j$(nproc)
test/
├── baseline/ # Comparison algorithms (ALP, ndzip, elf, etc.)
├── data/ # Test datasets
├── Falcon_test_*.cu # Main GPU test suites
└── test_*.cpp/cu # Specific algorithm tests
./test/test_${test_name} --dir ../test/data/use/# Main GPU implementation
./test/test_gpu --dir ../test/data/use/
# GPU without packing optimization
./test/test_gpu_nopack --dir ../test/data/use/
# GPU with bit-reduction optimization
./test/test_gpu_br --dir ../test/data/use/
# GPU with sparse optimization
./test/test_gpu_spare --dir ../test/data/use/# Multi-stream with 3-step blocking
./test/test_muti_3step_block --dir ../test/data/use/
# Multi-stream with 3-step non-blocking
./test/test_muti_3step_noblock --dir ../test/data/use/
# Optimized multi-stream
./test/test_muti_stream --dir ../test/data/use/- Full Sparse: All bit-planes use sparse storage
- Full Dense: All bit-planes use dense storage
- Brute-force Error: Inaccurate decimal place calculation
- Standard: Adaptive sparse/dense selection (default)
- Single-stream: Sequential processing
- Blocking: Synchronous multi-stream
- Non-blocking: Asynchronous multi-stream
- Standard: Event-driven scheduler (default)
#!/bin/bash
set -x
cd Falcon
mkdir -p build
cd build
# Compile project
cmake ..
make -j
# Run all tests
run_test() {
local test_name=$1
echo "===== Running ${test_name} ====="
./test/test_${test_name} --dir ../test/data/use/
}
# Core GPU tests
run_test "gpu"
run_test "gpu_nopack"
run_test "gpu_br"
run_test "gpu_spare"
# Multi-stream tests
run_test "muti_3step_block"
run_test "muti_3step_noblock"
run_test "muti_stream_opt"| Method | Average Ratio | Improvement vs Falcon |
|---|---|---|
| Falcon | 0.299 | - |
| ALP | 0.329 | 9.1% worse |
| Elf* | 0.339 | 13.4% worse |
| Elf | 0.380 | 27.1% worse |
| ndzip | 0.996 | 233% worse |
| Operation | Falcon | Best Competitor | Speedup |
|---|---|---|---|
| Compression | 10.82 GB/s | 4.46 GB/s (GDeflate) | 2.43× |
| Decompression | 12.32 GB/s | 5.13 GB/s (GPU:Elf*) | 2.4× |
- Chunk Size: 1025 elements per thread
- Batch Size: 1025 × 1024 × 4 elements
- Pipeline Streams: 16
- GPU Architecture: Compute Capability 7.0+
- 1025 elements: Optimized for memory space utilization
- Thread Mapping: Each GPU thread processes exactly one chunk
-DCMAKE_BUILD_TYPE=Releasefor optimized performance-DCMAKE_CUDA_ARCHITECTURES=70for specific GPU architecture
If you use Falcon in your research, please cite:
@article{falcon2025,
title={Falcon: GPU-Based Floating-point Adaptive Lossless Compression},
author={Li, Zheng and Wang, Weiyan and Li, Ruiyuan and Chen, Chao and Long, Xianlei and Zheng, Linjiang and Xu, Quanqing and Yang, Chuanhui},
journal={PVLDB},
volume={14},
number={1},
pages={XXX--XXX},
year={2025},
publisher={VLDB Endowment}
}- Zheng Li (Chongqing University) - [email protected]
- Weiyan Wang (Chongqing University) - [email protected]
- Ruiyuan Li (Chongqing University) - [email protected]
- Chao Chen (Chongqing University) - [email protected]
- Xianlei Long (Chongqing University) - [email protected]
- Linjiang Zheng (Chongqing University) - [email protected]
- Quanqing Xu (OceanBase, Ant Group) - [email protected]
- Chuanhui Yang (OceanBase, Ant Group) - [email protected]
This project is available for academic and research use. Please refer to the specific license terms in the repository.
- Elf: Erasing-Based Lossless Floating-Point Compression
- ALP: Adaptive Lossless Floating-Point Compression
- Serf: Streaming Error-Bounded Floating-Point Compression
Note: This project has been verified to work on both WSL2 (Ubuntu 22.04) and native Ubuntu 24.04 environments with the specified dependencies. For questions about specific implementations or performance characteristics, please refer to the corresponding header files and test cases.