Just follow this readme and run the code snippets, everything is automated.
ALL data including raw CSV, graphs, decision tree PDFs, rules of the tree and node info for each experiment run are stored in the
CSVfolder of the repo.
This code assumes a user called gpuio is present in the/home directory.
cd /home/gpuio/
git clone https://github.com/aalhadsawane/gpuIO.git
cd gpuIOthis folder (gpuIO) is recommended workspace folder
./build_hdf5_with_vfd_gds.shBuild Instructions:
- This script builds HDF5 with MPI and Direct VFD (vfd_gds) support
- Installation location:
/home/gpuio/gpuIO/hdf5_install(within workspace) - Builds from source in
/home/gpuio/hdf5_build/hdf5 - Automatically sets up environment variables in
~/.bashrc
Key HDF5 build flags used:
cmake -DCMAKE_INSTALL_PREFIX=$HDF5_HOME \
-DCMAKE_C_COMPILER=mpicc \ # Use MPI compiler
-DHDF5_ENABLE_PARALLEL=ON \ # Enable MPI parallel I/O
-DHDF5_ENABLE_THREADSAFE=ON \ # Enable thread safety
-DHDF5_ENABLE_DIRECT_VFD=ON \ # Enable Direct VFD (vfd_gds)
-DHDF5_ENABLE_Z_LIB_SUPPORT=ON \ # Enable compression
-DHDF5_ENABLE_SZIP_SUPPORT=ON \ # Enable SZIP compression
-DALLOW_UNSUPPORTED=ON \ # Allow experimental features
..Installation location: /home/gpuio/gpuIO/hdf5_install (within workspace)
./build_h5bench.shBuild Instructions:
- This script builds a custom version of h5bench with CUDA memory allocation support
- Uses the HDF5 installation from step 1 (
/home/gpuio/gpuIO/hdf5_install) - Builds in
benchmarks/h5bench/build_cuda/directory - Links against CUDA runtime libraries for GPU memory allocation
- All executables are built with MPI and CUDA support
This builds a custom version of h5bench with CUDA memory allocation support for GPU Direct Storage benchmarking.
What's different in our custom h5bench:
- Replaced all
malloc()/calloc()withcudaMalloc()in dataset buffer allocation - Replaced all
free()withcudaFree()for GPU memory deallocation - Added
cudaMemcpy()for data initialization (CPUβGPU transfer) - Files modified:
commons/h5bench_util.c: Core memory allocation functionsh5bench_patterns/h5bench_write.c: Write pattern data preparationh5bench_patterns/h5bench_read.c: Read pattern memory allocationh5bench_patterns/h5bench_append.c: Append pattern memory allocation
- Updated
CMakeLists.txtto include CUDA support:- Added
find_package(CUDA REQUIRED) - Linked CUDA libraries to all executables
- Added CUDA include directories
- Fixed HDF5 include/library paths for proper linking
- Added
- Read operations now work with GPU memory (not just writes)
- HDF5 reads directly into GPU VRAM when using vfd_gds
- Data remains on GPU for processing after read operations
- Both read and write patterns use the same GPU memory allocation strategy
- Added programmatic vfd_gds configuration in
h5bench_util.c - Automatic detection of
HDF5_DRIVER=gdsenvironment variable - Graceful fallback to traditional HDF5 driver when vfd_gds unavailable
- Runtime switching between GPU Direct Storage and traditional paths
- Traditional Path: SSD β CPU DRAM β GPU VRAM (via
cudaMemcpy) - Direct Path: SSD β GPU VRAM (via vfd_gds + GPU memory allocation)
- Automatic benchmarking of both paths for performance comparison
// CPU memory allocation
void *buf = malloc(size);
// ... use buf for HDF5 operations ...
free(buf);// GPU memory allocation
void *buf;
cudaError_t err = cudaMalloc(&buf, size);
if (err != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed: %s\n", cudaGetErrorString(err));
exit(1);
}
// ... use buf for HDF5 operations (now in GPU VRAM) ...
cudaFree(buf);// Allocate and initialize on CPU first
float *data_cpu = malloc(particle_cnt * sizeof(float));
// ... initialize data_cpu with random values ...
// Allocate GPU memory
float *data_gpu;
cudaMalloc(&data_gpu, particle_cnt * sizeof(float));
// Copy from CPU to GPU
cudaMemcpy(data_gpu, data_cpu, particle_cnt * sizeof(float), cudaMemcpyHostToDevice);
// Free CPU memory
free(data_cpu);
// Use data_gpu for HDF5 operations- Memory Alignment:
cudaMalloc()provides 4KB alignment required for GDS I/O - Error Handling: All CUDA operations include proper error checking
- Memory Management: CPU structs still use
malloc(), GPU data arrays usecudaMalloc() - Compatibility: Falls back gracefully when vfd_gds is not available
- Performance: Enables true zero-copy SSDβGPU transfers when using vfd_gds
The benchmark uses specific pattern numbers that determine which memory preparation function is called:
typedef enum write_pattern {
WRITE_PATTERN_INVALID, // 0
CONTIG_CONTIG_1D, // 1 - calls prepare_data_contig_1D()
CONTIG_COMPOUND_1D, // 2
COMPOUND_CONTIG_1D, // 3
COMPOUND_COMPOUND_1D, // 4
CONTIG_CONTIG_STRIDED_1D, // 5
CONTIG_CONTIG_2D, // 6
CONTIG_COMPOUND_2D, // 7
COMPOUND_CONTIG_2D, // 8
COMPOUND_COMPOUND_2D, // 9
CONTIG_CONTIG_3D, // 10 - calls prepare_data_contig_3D()
} write_pattern;Important: The benchmark configuration MEM_PATTERN=CONTIG and FILE_PATTERN=CONTIG with NUM_DIMS=3 maps to pattern 10 (CONTIG_CONTIG_3D), not pattern 1 (CONTIG_CONTIG_1D). This is why we modified both prepare_data_contig_1D() and prepare_data_contig_3D() functions.
True verification of GPU usage is through nvidia-smi dmon (not snapshots):
# Real-time GPU monitoring (better than nvidia-smi snapshots)
nvidia-smi dmon -s pucvmet -c 100
# Or use our monitoring script
./monitor_gpu.sh -d 3600 -i 5Expected output when GPU memory is allocated:
# gpu pwr gtemp mtemp sm mem enc dec mclk pclk
# Idx W C C % % % % MHz MHz
0 22 42 - 0 42 0 0 6001 300
The mem column shows GPU memory utilization percentage, and pwr shows increased power consumption.
- Added
find_package(CUDA REQUIRED)for CUDA detection - Linked
${CUDA_LIBRARIES}to all executable targets - Added
${CUDA_INCLUDE_DIRS}for CUDA headers - Fixed HDF5 include/library paths for proper compilation
prepare_contig_memory(): Replacedmalloc()withcudaMalloc()for all data arraysprepare_contig_memory_multi_dim(): Same GPU memory allocation patternfree_contig_memory(): Replacedfree()withcudaFree()for GPU arraysconfigure_vfd_gds_fapl(): New function for programmatic vfd_gds setup
prepare_data_interleaved(): CPU allocation β initialization β GPU copy β CPU freeprepare_data_contig_1D(): Multiple GPU arrays with individual CPUβGPU transfersdata_free(): Updated to usecudaFree()for GPU-allocated data
- Added CUDA headers for GPU memory support
- Read operations now work with GPU-allocated buffers
- Added CUDA headers for consistency
- Append operations support GPU memory allocation
- Added declaration for
configure_vfd_gds_fapl()function
Custom h5bench with CUDA support has been successfully built!
h5bench_write- Main write benchmark with CUDA memory allocationh5bench_read- Read benchmark with GPU memory supporth5bench_append- Append benchmark with CUDA supporth5bench_overwrite- Overwrite benchmark with GPU memoryh5bench_write_unlimited- Unlimited write benchmarkh5bench_write_var_normal_dist- Variable normal distribution write
- β
MPI-enabled HDF5 installed (
libhdf5-openmpi-dev) - β CUDA runtime and libraries linked
- β All memory allocation functions converted to GPU memory
- β MPI headers added to all pattern files
- β Build configuration updated for OpenMPI HDF5
The custom h5bench with CUDA support has been successfully built! All executables are working correctly with proper MPI support.
Key achievements:
- β Full MPI I/O support: Using custom-built HDF5 with MPI enabled
- β CUDA memory allocation: All dataset buffers use GPU memory (cudaMalloc)
- β GPU Direct Storage ready: vfd_gds plugin integration prepared
- β All executables functional: Tested and working correctly
Remaining warnings (non-critical):
- Format string mismatches (cosmetic, doesn't affect functionality)
- Volatile qualifier warnings (non-critical, doesn't affect functionality)
The build is production-ready for GPU Direct Storage benchmarking!
- Zero-copy transfers: Direct SSDβGPU VRAM when using vfd_gds
- Reduced memory bandwidth: Eliminates CPU DRAM bottleneck
- Lower latency: Fewer memory copies in the data path
- Higher throughput: GPU memory bandwidth utilization
- Direct comparison: Traditional vs GPU Direct Storage paths
- Real-world scenarios: Actual HDF5 workload performance testing
- Scalability analysis: Multi-threaded GPU I/O performance
- Bottleneck identification: CPU vs GPU memory transfer analysis
- GPU Direct Storage evaluation: Performance characteristics of vfd_gds
- HDF5 optimization: Memory allocation strategy impact
- I/O pattern analysis: Different access patterns with GPU memory
- System tuning: Optimal configuration for GPU-accelerated HDF5
NOTE: What has been done till now (Step 0 to 2) needs to be done on a node only once as setup.
The steps ahead will be reused every experiment.
Edit
benchmark_config.confin source dir of repo.
Set number of IO threads, dataset sizes, block sizes and modes (GPU/CPU) there.
Available modes:
- GPU: Uses GPU Direct Storage (vfd_gds) for direct SSD β GPU VRAM transfers
- CPU: Uses traditional path (SSD β CPU DRAM β GPU VRAM via cudaMemcpy)
NOTE: These IO threads, are the hardware threads not virtual and for colvas they max out to 20. Do not set number of io threads more than 20.
For WRITE benchmarks:
nohup ./run_h5bench_write.sh > h5bench_write_output.log 2>&1 &For READ benchmarks:
nohup ./run_h5bench_read.sh > h5bench_read_output.log 2>&1 &With GPU Monitoring (recommended):
# Run with GPU VRAM monitoring to confirm CUDA memory allocation
./run_benchmark_with_monitoring.sh run_h5bench_write.sh
./run_benchmark_with_monitoring.sh run_h5bench_read.shManual GPU monitoring:
# Show current GPU status
./monitor_gpu.sh -s
# Monitor GPU during benchmarks (optimized for long runs)
./monitor_gpu.sh # Monitor for 2 hours (10s intervals)
./monitor_gpu.sh -d 3600 # Monitor for 1 hour
./monitor_gpu.sh -i 5 # Sample every 5 seconds
# Analyze existing monitoring data
./monitor_gpu.sh -a # Analyze gpu_demon.log
./monitor_gpu.sh -a custom.log # Analyze custom log fileWhat the monitoring tracks:
- GPU Memory: VRAM usage (confirms CUDA memory allocation)
- GPU Utilization: SM%, Memory bandwidth usage
- GPU Clocks: Graphics, Memory, SM clock speeds
- Temperature & Power: Thermal and power consumption
- CPU Usage: Overall CPU utilization and load average
- Processes: Running processes on GPU (PIDs, memory usage)
- Timestamps: Precise timing for correlation with benchmarks
The scripts automatically:
- Run both GPU and CPU modes as configured in
benchmark_config.conf - Switch between vfd_gds (GPU Direct Storage) and traditional paths
- Compare performance between direct SSDβGPU vs SSDβCPUβGPU transfers
The output (raw_output.csv) is stored in a new folder /home/gpuio/gpuIO/CSV/Run*/
where Run* indicates latest Run folder (made automatically)
A txt file called nodename.txt inside the Run folder is also made which stores the hostname (ex colva2).
u can check the h5bench_*_output.log file to check where the experiment has reached. (Look for Benchmark progress).
To kill the experiment midway: ps aux | grep run_h5bench and kill -9 the PID. You can also send a SIGINT (ctrl+c) signal with kill -2 to skip a single benchmark that might be stuck.
Create a venv with
virtualenv .venv
source .venv/bin/activate
pip install -r requirements.txtMy plotting script is given as
plot.pyin source dir of repo. Please make a copy of it if u wanna edit it; dont change this one.
sudo apt-get install graphviz
python plot.pyIt will take the input csv as CSV/Run*/raw_output.csv which is the latest output CSV made by the run_h5bench.sh script.
It will store all PNG images in CSV/Run*/graphs
where Run* indicates latest Run.
It will also make a decision tree as a PDF and a txt file defining rules of the decision tree in CSV/RUN*,
###$ Custom csv path and output path:
plot.py takes 2 arguments:
--data-path : default it will use the
raw_output.csvin latest run folderCSV/Run*
--output-dir : default it will make a
graphsdir in the latest run folderCSV/Run*
example of using custom output directory for graphs:
python plot.py --output-dir /home/gpuio/gpuIO/custom_graphsIf u make any large file, please make sure to add them to .gitignore to avoid committing them to the repository.
Push after every succesful run.
Sometimes individual benchmark configurations may fail or produce anomalous results (blank entries or 0 values) in the raw_output.csv file. writing a script to handle that, should be ready soon.