Automater for running Experiments

Just follow this readme and run the code snippets, everything is automated.

ALL data including raw CSV, graphs, decision tree PDFs, rules of the tree and node info for each experiment run are stored in the CSV folder of the repo.

0. directory structure

This code assumes a user called gpuio is present in the/home directory.

cd /home/gpuio/
git clone https://github.com/aalhadsawane/gpuIO.git
cd gpuIO

this folder (gpuIO) is recommended workspace folder

1. install hdf5 with vfd_gds support

./build_hdf5_with_vfd_gds.sh

Build Instructions:

This script builds HDF5 with MPI and Direct VFD (vfd_gds) support
Installation location: /home/gpuio/gpuIO/hdf5_install (within workspace)
Builds from source in /home/gpuio/hdf5_build/hdf5
Automatically sets up environment variables in ~/.bashrc

Key HDF5 build flags used:

cmake -DCMAKE_INSTALL_PREFIX=$HDF5_HOME \
      -DCMAKE_C_COMPILER=mpicc \           # Use MPI compiler
      -DHDF5_ENABLE_PARALLEL=ON \          # Enable MPI parallel I/O
      -DHDF5_ENABLE_THREADSAFE=ON \        # Enable thread safety
      -DHDF5_ENABLE_DIRECT_VFD=ON \        # Enable Direct VFD (vfd_gds)
      -DHDF5_ENABLE_Z_LIB_SUPPORT=ON \     # Enable compression
      -DHDF5_ENABLE_SZIP_SUPPORT=ON \      # Enable SZIP compression
      -DALLOW_UNSUPPORTED=ON \             # Allow experimental features
      ..

Installation location: /home/gpuio/gpuIO/hdf5_install (within workspace)

2. install h5bench - the benchmark suite (with CUDA support)

./build_h5bench.sh

Build Instructions:

This script builds a custom version of h5bench with CUDA memory allocation support
Uses the HDF5 installation from step 1 (/home/gpuio/gpuIO/hdf5_install)
Builds in benchmarks/h5bench/build_cuda/ directory
Links against CUDA runtime libraries for GPU memory allocation
All executables are built with MPI and CUDA support

This builds a custom version of h5bench with CUDA memory allocation support for GPU Direct Storage benchmarking.

What's different in our custom h5bench:

🔧 Core Modifications Made

1. Memory Allocation Changes

Replaced all malloc()/calloc() with cudaMalloc() in dataset buffer allocation
Replaced all free() with cudaFree() for GPU memory deallocation
Added cudaMemcpy() for data initialization (CPU→GPU transfer)
Files modified:
- commons/h5bench_util.c: Core memory allocation functions
- h5bench_patterns/h5bench_write.c: Write pattern data preparation
- h5bench_patterns/h5bench_read.c: Read pattern memory allocation
- h5bench_patterns/h5bench_append.c: Append pattern memory allocation

2. Build System Changes

Updated CMakeLists.txt to include CUDA support:
- Added find_package(CUDA REQUIRED)
- Linked CUDA libraries to all executables
- Added CUDA include directories
- Fixed HDF5 include/library paths for proper linking

3. Read Operations Enhancement

Read operations now work with GPU memory (not just writes)
HDF5 reads directly into GPU VRAM when using vfd_gds
Data remains on GPU for processing after read operations
Both read and write patterns use the same GPU memory allocation strategy

4. vfd_gds Integration

Added programmatic vfd_gds configuration in h5bench_util.c
Automatic detection of HDF5_DRIVER=gds environment variable
Graceful fallback to traditional HDF5 driver when vfd_gds unavailable
Runtime switching between GPU Direct Storage and traditional paths

5. Data Path Comparison

Traditional Path: SSD → CPU DRAM → GPU VRAM (via cudaMemcpy)
Direct Path: SSD → GPU VRAM (via vfd_gds + GPU memory allocation)
Automatic benchmarking of both paths for performance comparison

📝 Code Change Examples

Before (Stock h5bench):

// CPU memory allocation
void *buf = malloc(size);
// ... use buf for HDF5 operations ...
free(buf);

After (Custom h5bench):

// GPU memory allocation
void *buf;
cudaError_t err = cudaMalloc(&buf, size);
if (err != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed: %s\n", cudaGetErrorString(err));
    exit(1);
}
// ... use buf for HDF5 operations (now in GPU VRAM) ...
cudaFree(buf);

Data Initialization Pattern:

// Allocate and initialize on CPU first
float *data_cpu = malloc(particle_cnt * sizeof(float));
// ... initialize data_cpu with random values ...

// Allocate GPU memory
float *data_gpu;
cudaMalloc(&data_gpu, particle_cnt * sizeof(float));

// Copy from CPU to GPU
cudaMemcpy(data_gpu, data_cpu, particle_cnt * sizeof(float), cudaMemcpyHostToDevice);

// Free CPU memory
free(data_cpu);

// Use data_gpu for HDF5 operations

🔍 Key Technical Details

Memory Alignment: cudaMalloc() provides 4KB alignment required for GDS I/O
Error Handling: All CUDA operations include proper error checking
Memory Management: CPU structs still use malloc(), GPU data arrays use cudaMalloc()
Compatibility: Falls back gracefully when vfd_gds is not available
Performance: Enables true zero-copy SSD→GPU transfers when using vfd_gds

H5bench Data Pattern Numbering

The benchmark uses specific pattern numbers that determine which memory preparation function is called:

typedef enum write_pattern {
    WRITE_PATTERN_INVALID,     // 0
    CONTIG_CONTIG_1D,          // 1 - calls prepare_data_contig_1D()
    CONTIG_COMPOUND_1D,        // 2
    COMPOUND_CONTIG_1D,        // 3
    COMPOUND_COMPOUND_1D,      // 4
    CONTIG_CONTIG_STRIDED_1D,  // 5
    CONTIG_CONTIG_2D,          // 6
    CONTIG_COMPOUND_2D,        // 7
    COMPOUND_CONTIG_2D,        // 8
    COMPOUND_COMPOUND_2D,      // 9
    CONTIG_CONTIG_3D,          // 10 - calls prepare_data_contig_3D()
} write_pattern;

Important: The benchmark configuration MEM_PATTERN=CONTIG and FILE_PATTERN=CONTIG with NUM_DIMS=3 maps to pattern 10 (CONTIG_CONTIG_3D), not pattern 1 (CONTIG_CONTIG_1D). This is why we modified both prepare_data_contig_1D() and prepare_data_contig_3D() functions.

GPU Memory Verification

True verification of GPU usage is through nvidia-smi dmon (not snapshots):

# Real-time GPU monitoring (better than nvidia-smi snapshots)
nvidia-smi dmon -s pucvmet -c 100

# Or use our monitoring script
./monitor_gpu.sh -d 3600 -i 5

Expected output when GPU memory is allocated:

# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0    22    42     -     0    42     0     0  6001   300

The mem column shows GPU memory utilization percentage, and pwr shows increased power consumption.

📁 Detailed File Modifications

`CMakeLists.txt`

Added find_package(CUDA REQUIRED) for CUDA detection
Linked ${CUDA_LIBRARIES} to all executable targets
Added ${CUDA_INCLUDE_DIRS} for CUDA headers
Fixed HDF5 include/library paths for proper compilation

`commons/h5bench_util.c`

prepare_contig_memory(): Replaced malloc() with cudaMalloc() for all data arrays
prepare_contig_memory_multi_dim(): Same GPU memory allocation pattern
free_contig_memory(): Replaced free() with cudaFree() for GPU arrays
configure_vfd_gds_fapl(): New function for programmatic vfd_gds setup

`h5bench_patterns/h5bench_write.c`

prepare_data_interleaved(): CPU allocation → initialization → GPU copy → CPU free
prepare_data_contig_1D(): Multiple GPU arrays with individual CPU→GPU transfers
data_free(): Updated to use cudaFree() for GPU-allocated data

`h5bench_patterns/h5bench_read.c`

Added CUDA headers for GPU memory support
Read operations now work with GPU-allocated buffers

`h5bench_patterns/h5bench_append.c`

Added CUDA headers for consistency
Append operations support GPU memory allocation

`commons/h5bench_util.h`

Added declaration for configure_vfd_gds_fapl() function

✅ Build Status

Custom h5bench with CUDA support has been successfully built!

Available Executables:

h5bench_write - Main write benchmark with CUDA memory allocation
h5bench_read - Read benchmark with GPU memory support
h5bench_append - Append benchmark with CUDA support
h5bench_overwrite - Overwrite benchmark with GPU memory
h5bench_write_unlimited - Unlimited write benchmark
h5bench_write_var_normal_dist - Variable normal distribution write

Build Requirements Met:

✅ MPI-enabled HDF5 installed (libhdf5-openmpi-dev)
✅ CUDA runtime and libraries linked
✅ All memory allocation functions converted to GPU memory
✅ MPI headers added to all pattern files
✅ Build configuration updated for OpenMPI HDF5

✅ Final Build Status:

The custom h5bench with CUDA support has been successfully built! All executables are working correctly with proper MPI support.

Key achievements:

✅ Full MPI I/O support: Using custom-built HDF5 with MPI enabled
✅ CUDA memory allocation: All dataset buffers use GPU memory (cudaMalloc)
✅ GPU Direct Storage ready: vfd_gds plugin integration prepared
✅ All executables functional: Tested and working correctly

Remaining warnings (non-critical):

Format string mismatches (cosmetic, doesn't affect functionality)
Volatile qualifier warnings (non-critical, doesn't affect functionality)

The build is production-ready for GPU Direct Storage benchmarking!

🚀 What This Enables

Performance Benefits

Zero-copy transfers: Direct SSD→GPU VRAM when using vfd_gds
Reduced memory bandwidth: Eliminates CPU DRAM bottleneck
Lower latency: Fewer memory copies in the data path
Higher throughput: GPU memory bandwidth utilization

Benchmarking Capabilities

Direct comparison: Traditional vs GPU Direct Storage paths
Real-world scenarios: Actual HDF5 workload performance testing
Scalability analysis: Multi-threaded GPU I/O performance
Bottleneck identification: CPU vs GPU memory transfer analysis

Research Applications

GPU Direct Storage evaluation: Performance characteristics of vfd_gds
HDF5 optimization: Memory allocation strategy impact
I/O pattern analysis: Different access patterns with GPU memory
System tuning: Optimal configuration for GPU-accelerated HDF5

NOTE: What has been done till now (Step 0 to 2) needs to be done on a node only once as setup.

The steps ahead will be reused every experiment.

3. Set experiment parameters

Edit benchmark_config.conf in source dir of repo.

Set number of IO threads, dataset sizes, block sizes and modes (GPU/CPU) there.

Available modes:

GPU: Uses GPU Direct Storage (vfd_gds) for direct SSD → GPU VRAM transfers
CPU: Uses traditional path (SSD → CPU DRAM → GPU VRAM via cudaMemcpy)

NOTE: These IO threads, are the hardware threads not virtual and for colvas they max out to 20. Do not set number of io threads more than 20.

4. Run and get h5bench data in csv

For WRITE benchmarks:

nohup ./run_h5bench_write.sh > h5bench_write_output.log 2>&1 &

For READ benchmarks:

nohup ./run_h5bench_read.sh > h5bench_read_output.log 2>&1 &

With GPU Monitoring (recommended):

# Run with GPU VRAM monitoring to confirm CUDA memory allocation
./run_benchmark_with_monitoring.sh run_h5bench_write.sh
./run_benchmark_with_monitoring.sh run_h5bench_read.sh

Manual GPU monitoring:

# Show current GPU status
./monitor_gpu.sh -s

# Monitor GPU during benchmarks (optimized for long runs)
./monitor_gpu.sh                    # Monitor for 2 hours (10s intervals)
./monitor_gpu.sh -d 3600           # Monitor for 1 hour
./monitor_gpu.sh -i 5              # Sample every 5 seconds

# Analyze existing monitoring data
./monitor_gpu.sh -a                # Analyze gpu_demon.log
./monitor_gpu.sh -a custom.log     # Analyze custom log file

What the monitoring tracks:

GPU Memory: VRAM usage (confirms CUDA memory allocation)
GPU Utilization: SM%, Memory bandwidth usage
GPU Clocks: Graphics, Memory, SM clock speeds
Temperature & Power: Thermal and power consumption
CPU Usage: Overall CPU utilization and load average
Processes: Running processes on GPU (PIDs, memory usage)
Timestamps: Precise timing for correlation with benchmarks

The scripts automatically:

Run both GPU and CPU modes as configured in benchmark_config.conf
Switch between vfd_gds (GPU Direct Storage) and traditional paths
Compare performance between direct SSD→GPU vs SSD→CPU→GPU transfers

The output (raw_output.csv) is stored in a new folder /home/gpuio/gpuIO/CSV/Run*/

where Run* indicates latest Run folder (made automatically)

A txt file called nodename.txt inside the Run folder is also made which stores the hostname (ex colva2).

u can check the h5bench_*_output.log file to check where the experiment has reached. (Look for Benchmark progress).

To kill the experiment midway: ps aux | grep run_h5bench and kill -9 the PID. You can also send a SIGINT (ctrl+c) signal with kill -2 to skip a single benchmark that might be stuck.

5. Plot the csv

Create a venv with

virtualenv .venv
source .venv/bin/activate
pip install -r requirements.txt

My plotting script is given as plot.py in source dir of repo. Please make a copy of it if u wanna edit it; dont change this one.

Usage:

sudo apt-get install graphviz
python plot.py

It will take the input csv as CSV/Run*/raw_output.csv which is the latest output CSV made by the run_h5bench.sh script.

It will store all PNG images in CSV/Run*/graphs where Run* indicates latest Run.

It will also make a decision tree as a PDF and a txt file defining rules of the decision tree in CSV/RUN*,

###$ Custom csv path and output path:

plot.py takes 2 arguments:

--data-path : default it will use the raw_output.csv in latest run folder CSV/Run*

--output-dir : default it will make a graphs dir in the latest run folder CSV/Run*

example of using custom output directory for graphs:

python plot.py --output-dir /home/gpuio/gpuIO/custom_graphs

6. Push to github

If u make any large file, please make sure to add them to .gitignore to avoid committing them to the repository.

Push after every succesful run.

7. Handling Anomalies in Benchmark Results

Sometimes individual benchmark configurations may fail or produce anomalous results (blank entries or 0 values) in the raw_output.csv file. writing a script to handle that, should be ready soon.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
CSV		CSV
benchmarks/h5bench		benchmarks/h5bench
vfd-gds		vfd-gds
.gitignore		.gitignore
TODO.md		TODO.md
benchmark_config.conf		benchmark_config.conf
build_h5bench.sh		build_h5bench.sh
build_hdf5.sh		build_hdf5.sh
build_hdf5_with_vfd_gds.sh		build_hdf5_with_vfd_gds.sh
gpu_centric.md		gpu_centric.md
make_decision_tree.py		make_decision_tree.py
monitor_gpu.sh		monitor_gpu.sh
plot.py		plot.py
readme.md		readme.md
requirements.txt		requirements.txt
run_benchmark_with_monitoring.sh		run_benchmark_with_monitoring.sh
run_h5bench_read.sh		run_h5bench_read.sh
run_h5bench_write.sh		run_h5bench_write.sh

Folders and files

Latest commit

History

Repository files navigation

Automater for running Experiments

0. directory structure

1. install hdf5 with vfd_gds support

2. install h5bench - the benchmark suite (with CUDA support)

🔧 Core Modifications Made

1. Memory Allocation Changes

2. Build System Changes

3. Read Operations Enhancement

4. vfd_gds Integration

5. Data Path Comparison

📝 Code Change Examples

Before (Stock h5bench):

After (Custom h5bench):

Data Initialization Pattern:

🔍 Key Technical Details

H5bench Data Pattern Numbering

GPU Memory Verification

📁 Detailed File Modifications

CMakeLists.txt

commons/h5bench_util.c

h5bench_patterns/h5bench_write.c

h5bench_patterns/h5bench_read.c

h5bench_patterns/h5bench_append.c

commons/h5bench_util.h

✅ Build Status

Available Executables:

Build Requirements Met:

✅ Final Build Status:

🚀 What This Enables

Performance Benefits

Benchmarking Capabilities

Research Applications

3. Set experiment parameters

4. Run and get h5bench data in csv

5. Plot the csv

Usage:

6. Push to github

7. Handling Anomalies in Benchmark Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`CMakeLists.txt`

`commons/h5bench_util.c`

`h5bench_patterns/h5bench_write.c`

`h5bench_patterns/h5bench_read.c`

`h5bench_patterns/h5bench_append.c`

`commons/h5bench_util.h`

Packages