Skip to content

Know your VRAM before you run. Instant GPU memory estimates for any HuggingFace model.

Notifications You must be signed in to change notification settings

ksingh-scogo/vramio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vramio

Know your VRAM before you run.

A dead-simple API to estimate GPU memory requirements for any HuggingFace model.

Live Demo


The Problem

You found a cool model on HuggingFace. Now what?

  • "Will it fit on my 24GB GPU?"
  • "What quantization do I need?"
  • "How much VRAM for inference?"

The answers are buried — scattered across model cards, config files, or simply missing. You either dig through safetensors metadata yourself, or download the model and pray.

The Solution

One API call. Instant answer.

curl "https://vramio.ksingh.in/model?hf_id=meta-llama/Llama-2-7b"
{
  "model": "meta-llama/Llama-2-7b",
  "total_parameters": "6.74B",
  "memory_required": "12.55 GB",
  "current_dtype": "F16",
  "recommended_vram": "15.06 GB",
  "other_precisions": {
    "fp32": "25.10 GB",
    "fp16": "12.55 GB",
    "int8": "6.27 GB",
    "int4": "3.14 GB"
  },
  "overhead_note": "Includes 20% for activations/KV cache (2K context)"
}

recommended_vram = what you actually need (includes 20% overhead for inference).

How It Works

  1. Fetches safetensors metadata from HuggingFace (just headers, not weights)
  2. Parses tensor shapes and dtypes
  3. Calculates memory for each precision
  4. Adds 20% overhead for activations + KV cache

No model downloads. No GPU required. Just math.

Read more about the implementation in this blog post.

Self-Host

# Clone and run
git clone https://github.com/ksingh-scogo/vramio.git
cd vramio
pip install httpx[http2]
python server_embedded.py

Or deploy free on Render using the included render.yaml.

Tech Stack

  • 160 lines of Python
  • Zero frameworks — just stdlib http.server + httpx
  • 1 dependencyhttpx[http2]

Credits

Built on memory estimation logic from hf-mem by @alvarobartt.

License

MIT

About

Know your VRAM before you run. Instant GPU memory estimates for any HuggingFace model.

Topics

Resources

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages