Releases: google/gemma.cpp
Releases · google/gemma.cpp
v0.1.4
What's Changed
- Refactor Gemma ctor and improve pool NUMA support by @copybara-service in #520
- Fix the prompt wrapping of gemma3-1b by @ufownl in #523
- Add note on attention length and SFP by @copybara-service in #521
- Add support for a secondary EOS token by @copybara-service in #525
- Update app argument documentation by @copybara-service in #526
- Set the secondary EOS for Gemma2 by @ufownl in #527
Full Changelog: v0.1.3...v0.1.4
v0.1.3
- Support for PaliGemma 2 and Gemma 3.
- Major update to MatMul and MatMul-using operations; significant performance increases in multiple parts of the codebase.
- Codebase simplifications and refactors in many areas.
- Bugfixes
What's Changed
- Add more ops: Sigmoid, (Two)MatVecAdd. Faster TwoMatVec. by @veluca93 in #129
- Improve weight handling. by @veluca93 in #130
- Remove unused includes by @copybara-service in #132
- Add a benchmark and additional tests. by @veluca93 in #131
- Adding Griffin implementation. by @pculliton in #136
- Change
NumGemmaLayers
andNumGriffinLayers
to constants in configs by @ufownl in #139 - Mention Makefile contributed by @jart by @copybara-service in #141
- Refactor data structures to reduce memory usage by @ufownl in #142
- Added functionality of storing layers activations output. by @atorero in #145
- Further improve IO, enable multiple backends without -D. by @copybara-service in #148
- Use lambda to split function and Make stream_token can break prefill by @zeerd in #156
- Simplify prefill early-exit (originally Merge #156) by @copybara-service in #158
- Fix underflow in NUQ ClusterCost() by @copybara-service in #162
- Add error-checking for py binding, add missing include+hwasan check by @copybara-service in #163
- Simplify threading: remove the use of inner_pool. by @szabadka in #167
- Use more parallelism in the QKV projections in MQA mode. by @szabadka in #170
- Fix kv offset computation for MHA config. by @szabadka in #172
- Use more parallelism in the final output of the attention block. by @szabadka in #175
- Use more parallelism in the QKV projections of the MHA block. by @szabadka in #176
- Factor out deinterleaving of bf16 vectors for MatVecs. by @samkaufman in #166
- Use more parallelism in attention block in prefill mode. by @szabadka in #177
- work with cmake install by @xinpingwang in #169
- 2x speedup of SFP decode (1.4x overall) on AVX3_DL+. by @copybara-service in #178
- Support additional scaling by @copybara-service in #181
- Store tokens/sec in auxiliary struct TimingInfo. by @copybara-service in #183
- Add TTFT to TimingInfo by @copybara-service in #186
- Make BlobWriter::Add() accept const void* by @copybara-service in #188
- Adds Kaggle testing to CI workflow by @pculliton in #189
- Fix normalization in Softmax function. by @szabadka in #194
- Clarified README by @zond in #137
- Unrolled / tiled 4x4 MatMul by @copybara-service in #199
- Refactor GemmaImpl dispatch to use Highway 1.2's HWY_DYNAMIC_DISPATCH_T by @copybara-service in #202
- Add first version of backpropagation support. by @szabadka in #203
- Fix for GenerateZeroMat call in TestTiledMatMul by @copybara-service in #206
- Remove no longer required stats.h - use Highway version instead by @copybara-service in #208
- Simplifications: remove GemmaInterface and GemmaImpl by @copybara-service in #209
- Implement mixed mode matmul: f32 * bf16 by @copybara-service in #210
- Fix Softmax on SVE by @copybara-service in #213
- Fix fix for weight type define, refs #198 by @copybara-service in #216
- Add Adam optimizer. by @szabadka in #212
- Add support for custom sampling function to runtime config. by @szabadka in #217
- Shifting large matrix init to heap in ops_test.cc by @copybara-service in #220
- Add CPU output, error if not C++17, simplify tokenizer ctor by @copybara-service in #222
- Use CompressedWeights<TConfig> in backpropagation. by @szabadka in #224
- Update benchmark with internal init by @copybara-service in #225
- Use Loader/AppArgs to construct gemma_test model, simplify AcceptFunc by @copybara-service in #227
- Implement float * SfpStream matmul by decompressing 4 * kColsA_RowsB -sized chunks of the second matrix. by @copybara-service in #231
- Add benchmark dependency to cmake build. by @szabadka in #234
- Fix numerical issue in Softcap by subtracting max. by @copybara-service in #236
- Extends Transformer() to prepare for batched processing. by @copybara-service in #238
- Tiny cleanup: distinguish between "ids" and "pieces" in argument names when encoding. by @copybara-service in #239
- Support mixed (bf16, sfp) tiled MatMul. Same sfp-decompress strategy as in (f32, by @copybara-service in #237
- Increase parallelism in ops_test by @copybara-service in #233
- Added MatMul_4x4_Batch which is MatMul_4x4, but with the first template arg moved to the first function arg, so the batch size (num A rows) can be variable at run-time. by @copybara-service in #241
- Reduce duplication in Config* by inheriting no-SSM by @copybara-service in #242
- Major duplicated code reduction in test/benchmarks by @copybara-service in #240
- Implement a missing (bf16, f32) tiled MatMul kernel. by @copybara-service in #245
- Removed now redundant non-batch matmul by @copybara-service in #246
- Integrate matmul into FFW: 4.3x prefill speedup by @copybara-service in #243
- Internal change. by @copybara-service in #244
- Added bias vector addition to MatMul by @copybara-service in #247
- Refactor CompressedWeights. by @copybara-service in #248
- Fix DASSERT - TiledBatch requires at least 2 vectors. by @copybara-service in #253
- Move raw_weights into separate header, used mainly by compress_weights. by @copybara-service in #249
- Further simplification to ForEachTensor, thanks I.K. by @copybara-service in #254
- Update developer docs and mention asan/msan by @copybara-service in #255
- 1.15x 7b sfp prefill speedup: Matmul in attention by @copybara-service in #256
- Fix Py binding/run_example: use GemmaEnv by @copybara-service in #257
- Simplify Attention. by @copybara-service in #258
- Fix debug_prompt and other binaries (internal init) by @copybara-service in #259
- Move kGriffinLayers into ConfigNoSSM, set kGemmaLayers directly by @copybara-service in #260
- Split out common parts (embedder and transformer block) from Prefill() and Transformer() into separate functions. by @copybara-service in #261
- Move test placeholder to a later pos. by @copybara-service in #263
- Code cleanup by @copybara-service in #264
- Refactor kCachePosSize and kCacheLayerSize into separate functors. by @copybara-service in #262
- Fixing two typos. by @copybara-service in #265
- Fix compilation errors in clang by @ufownl in #267
- Fix KV cache size calculation error by @ufownl in #266
- Skip the last RMSNormInplaceBatched in the Prefill phase. by @copybara-service in #268
- Improve logging when running Gemma examples: fix the issue when max_tokens, max_generated_tokens and temperature were logging without any trailing space/newline. by @copybara-service in #270
- Use hwy::ThreadPool::MaxThreads() to determine the number of threads to use. by @copybara-service in https://github.com/google/gem...
v0.1.2
- MQA implementation
- Ops refactorings and optimizations
- Bugfixes
- Model exporting script (
util/convert_weights.py
)
Important Note: With the MQA implementation, older 2B model artifacts need to be updated. Please re-download weights from Kaggle and ensure you have the latest version (-mqa or version 3).
What's Changed
- Clean up docs for developers by @austinvhuang in #102
- MQA Implementation for 2B models by @ufownl in #114
- Enhancing Utility Functions in ops.h by @enum-class in #105
- Added a missing space in app.h by @villesundell in #115
- Fix compilation error when
HWY_COMPILER_GCC_ACTUAL < 1300
by @ufownl in #120 - .bazelversion: Bazel 7.1.1 by @LINKIWI in #122
- Add standalone tool to compress weights. by @szabadka in #125
- 1.07x speedup: merge MQA parallel sections as suggested by @veluca93 by @copybara-service in #126
- Fix off-by-one errors in generation code and token streaming callback. by @szabadka in #127
New Contributors
- @villesundell made their first contribution in #115
- @LINKIWI made their first contribution in #122
- @szabadka made their first contribution in #125
Full Changelog: v0.1.1...v0.1.2
v0.1.1
- Refactor library interfaces
- Fixes to enable android and windows builds + general improvements to builds
- Bazel builds
- CI automation
- Allow either HF or Kaggle (vs Kaggle only) for artifact downloads
- Many small fixes and quality-of-life improvements from initial 0.1.0 release
What's Changed
- Dev -> Main sync by @austinvhuang in #24
- Update build.yml by @eltociear in #22
- Fix typos by @shirayu in #32
- Allow building on Windows using
clang-cl
toolchain by @dcoles in #6 - Do not pass explicitly -O2 flag to compiler in Release build by @traversaro in #3
- Fix build. by @dan-zheng in #35
- reset conversation by @kishida in #34
- Rename BUILD to BUILD.bazel. by @dan-zheng in #36
- Add --eot_line option by @shirayu in #33
- clean up formatting after 129e66a by @austinvhuang in #58
- Warning fixes: unused member, cast, unused function by @copybara-service in #61
- CLI args + README improvements + cleanup by @austinvhuang in #66
- Fix for Android's 32-bit off_t. Fixes #62 by @copybara-service in #63
- Add DEVELOPERS notes on using gemma as a library by @austinvhuang in #71
- Add clang-tidy, fix narrowing issues, fix constness by @enum-class in #65
- Support Bazel builds. Fixes #16 by @copybara-service in #75
- Add instructions to download from Hugging Face Hub by @osanseviero in #74
- Separate KV cache from GemmaImpl by @ufownl in #81
- Avoid fadvise on older Android. Fixes #84 by @copybara-service in #85
- use hwy/simd for RMSNorm(f, bf, f) calculation by @enum-class in #78
- Use highway simd for SquaredL2 calculation by @enum-class in #77
- Detect and print build type. Refs #88 by @copybara-service in #92
- libgemma API refactor - decouple from interactive repl demo specifics, add hello world example using libgemma by @austinvhuang in #82
- Additional cleanup after libgemma refactor #82 by @austinvhuang in #87
- Use bf16-rounded sqrt for scaling embeddings to match Gemma by @copybara-service in #93
- Remove unused ascii banner string by @copybara-service in #96
- Allow changing k parameter of
SampleTopK
as a compiler flag by @ufownl in #97 - Add missing log that point to a failed Generation by @zeerd in #98
New Contributors
- @austinvhuang made their first contribution in #24
- @eltociear made their first contribution in #22
- @shirayu made their first contribution in #32
- @dcoles made their first contribution in #6
- @traversaro made their first contribution in #3
- @dan-zheng made their first contribution in #35
- @kishida made their first contribution in #34
- @copybara-service made their first contribution in #61
- @enum-class made their first contribution in #65
- @osanseviero made their first contribution in #74
- @ufownl made their first contribution in #81
- @zeerd made their first contribution in #98
Full Changelog: v0.1.0...v0.1.1