guide : setting up NVIDIA DGX Spark with ggml #16514
Replies: 3 comments 3 replies
-
|
This is an outstanding guide and I was able to get everything up and running on the Spark with minimal fuss. I did run into an interesting issue. Right now out of the box with the latest updates, it looks like my nvcc version was reported as this and my gcc version was Apparently, there were some changes made to the gcc vector type definitions starting with version 13, so I had to revert and rebuild with gcc 12 by appending these headers to your shell script ...{everything before line 48}
printf "[I] Installing llama.cpp\n"
git clone https://github.com/ggml-org/llama.cpp ~/ggml-org/llama.cpp
cd ~/ggml-org/llama.cpp
cmake -B build-cuda -DCMAKE_C_COMPILER=/usr/bin/gcc-12 -DCMAKE_CXX_COMPILER=/usr/bin/g++-12 -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-12 -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DGGML_CUDA=ON
cmake --build build-cuda -j
printf "[I] Installing whisper.cpp\n"
git clone https://github.com/ggml-org/whisper.cpp ~/ggml-org/whisper.cpp
cd ~/ggml-org/whisper.cpp
cmake -B build-cuda -DCMAKE_C_COMPILER=/usr/bin/gcc-12 -DCMAKE_CXX_COMPILER=/usr/bin/g++-12 -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-12 -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DGGML_CUDA=ON
cmake --build build-cuda -j
...{everything after line 58} |
Beta Was this translation helpful? Give feedback.
-
|
just got my spark, had to install the libcurl dev package so cmake could find curl but otherwise no issues |
Beta Was this translation helpful? Give feedback.
-
|
Why is the model LLama 3.3 70B (4 bits quantization) so slow on the DGX Spark, using the latest version of llama.cpp (checkout 25 october 2025) ? On other consumer architectures this model is very fast with llama.cpp. Can you give me some suggestions ? Thank you command line (the model was quantized 2 days ago with the latest version of llama.cpp): In the following a video. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Overview
In this guide we will configure the NVIDIA DGX™ Spark as a local and private AI assistant using the ggml software stack. The guide is geared towards developers and builders. We are going to setup the following AI capabilities:
These features will run simultaneously, in your local network, allowing you to fully utilize the power of your device at home or in the office.
Software
We are going to use the following open-source software:
Setup
Simply run the following command in a terminal on your NVIDIA DGX™ Spark:
bash <(curl -s https://ggml.ai/dgx-spark.sh)Note
The
dgx-spark.shscript above is quite basic and is merely one of the many possible ways you can configure your device for AI use cases. It is provided here mainly for convenience and as an example. Feel free to inspect it and adjust it for your needs.The command downloads and builds the latest version of the
ggmlsoftware stack and starts multiple HTTP REST services as shown in the following table:http://localhost:8021http://localhost:8022http://localhost:8023http://localhost:8024http://localhost:8025The first time running the command can take a few minutes to download the model weights. If everything goes OK, you should see the following output:
At this point, the machine is fully configured and ready to be used. Internet connection is no longer necessary.
Here's sample output of
nvidia-smiwhile theggmlservices are running:Use cases
Here is a small fraction of the AI use cases that are possible with this configuration.
Basic chat
Simply point your browser to the chat endpoint
http://localhost:8023:Inline code completions (FIM)
Install the llama.vim plugin in your Vim/Neovim editor and configure it to use the FIM endpoint
http://localhost:8022:In VSCode, install the llama.vscode extension and configure it in a similar way to use the FIM endpoint:
Coding agent
In VSCode, configure the llama.vscode extension to use the endpoints for completions, chat, embeddings and tools:
Document and image processing
Submit PDFs and image documents in the WebUI to analyze them with a multimodal LLM. For visuals, use the vision endpoint
http://localhost:8024:Audio transcription
Use the speech-to-text endpoint at
http://localhost:8025to quickly transcribe audio files:Performance
For performance numbers, see Performance of llama.cpp on NVIDIA DGX Spark
Conclusion
The new NVIDIA DGX Spark is a great choice for serving the latest AI models locally and privately. With 128GB of unified system memory it has the capacity to host multiple AI services simultaneously. And the
ggmlsoftware stack is the best way to do that.Beta Was this translation helpful? Give feedback.
All reactions