Reduce memory usage in dataset loading and item cache by zhangzhenghao · Pull Request #1252 · gorse-io/gorse

zhangzhenghao · 2026-04-22T04:41:18Z

Problem

During dataset loading in LoadDataFromDatabase, all items were stored twice:

In the items variable (full data.Item objects) for sorting and partitioning
In dataSet.items via dataSet.AddItem()

This caused significant memory overhead, especially for large datasets.

Solution

This PR removes the duplicate storage by:

Collecting only itemIds (strings) instead of full data.Item objects
Sorting and partitioning itemIds directly using sort.Strings()
Retrieving full item data from dataSet.GetItems() when needed for non-personalized recommenders

Memory Savings

Each data.Item contains Labels, Categories, Timestamp, Comment, and other fields. A typical item can occupy 100-200 bytes.

For a dataset with 1 million items:

Before: ~200MB duplicate storage (items variable + dataSet.items)
After: ~8MB for itemIds (string array) + ~200MB for dataSet.items
Saved: ~200MB peak memory during loading

Changes

master/tasks.go: Changed var items []data.Item to var itemIds []string
Removed items = append(items, batchItems...)
Added itemIds = append(itemIds, item.ItemId) in the item loading loop
Changed sort.Slice(items, ...) to sort.Strings(itemIds)
Updated all references from itemGroups to itemIdGroups
Modified recommender.Push() calls to use dataSet.GetItems()[itemIdx]

Testing

Code changes are straightforward and maintain the same logic flow
CI tests will verify compilation and correctness

Previously, the LoadDataFromDatabase function stored all items twice: 1. In the 'items' variable for sorting and partitioning 2. In dataSet.items via dataSet.AddItem() This optimization removes the duplicate storage by: - Collecting only itemIds (strings) instead of full data.Item objects - Sorting and partitioning itemIds directly - Retrieving full item data from dataSet.GetItems() when needed Memory savings: Each data.Item contains Labels, Categories, Timestamp, and other fields that can occupy hundreds of bytes. For 1M items, this optimization saves approximately 100-200MB of memory during dataset loading.

codecov · 2026-04-22T04:47:25Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.42%. Comparing base (bf4dd35) to head (f122110).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1252      +/-   ##
==========================================
- Coverage   72.90%   70.42%   -2.49%     
==========================================
  Files          91       91              
  Lines       16729    19708    +2979     
==========================================
+ Hits        12197    13879    +1682     
- Misses       3287     4615    +1328     
+ Partials     1245     1214      -31

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR aims to reduce peak memory usage during Master.LoadDataFromDatabase by avoiding storing the full data.Item slice twice while still supporting item sorting/partitioning for feedback scans.

Changes:

Replace in-memory []data.Item accumulation with []string (itemIds) collected during item loading.
Sort and partition itemIds with sort.Strings + parallel.Split and update feedback scan ranges accordingly.
Update non-personalized recommender ingestion to look up items from dataSet.GetItems() by item ID.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…or tags

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

logics/item_to_item.go:444

convertLabelsToID recursively converts nested []any elements, but flatten (used by the tags-based recommenders) doesn't handle []any at all. As a result, any tag IDs nested inside arrays will be silently ignored (and the recursive conversion work in convertLabelsToID is effectively wasted). Consider extending flatten to traverse []any (and/or []string if expected), or constraining convertLabelsToID to only produce structures that flatten can consume.

func flatten(o any, tSet mapset.Set[dataset.ID]) {
	switch typed := o.(type) {
	case dataset.ID:
		tSet.Add(typed)
		return
	case []dataset.ID:
		tSet.Append(typed...)
		return
	case map[string]any:
		for _, v := range typed {
			flatten(v, tSet)
		}
	}
}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…d index references and optimizing data handling

…tem logic

…ing item slice and releasing memory

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

master/tasks.go:359

The PR description states that duplicate in-memory storage of items during loading is removed by collecting only item IDs and sorting/partitioning them. However, this code still appends full data.Item structs into items (and also stores the same items in dataSet via AddItem), so the duplicate storage and peak memory overhead remain. To achieve the intended savings, switch items to a []string (or []int32 indices) collected from batchItems, sort/split that slice, and fetch full items from dataSet.GetItems() only when calling non-personalized recommenders.

	items := make([]data.Item, 0, estimatedNumItems)
	itemLabelCount := make(map[string]int)
	itemLabelFirst := make(map[string]int32)
	itemLabelIndex := dataset.NewMapIndex()
	itemLabels := make([][]lo.Tuple2[int32, float32], 0, estimatedNumItems)
	itemEmbeddingIndexer := dataset.NewMapIndex()
	itemEmbeddingDimension := make([]map[int]int, 0)
	itemEmbeddings := make([][][]uint16, 0, estimatedNumItems)
	start = time.Now()
	itemChan, errChan := database.GetItemStream(newCtx, batchSize, itemTimeLimit)
	for batchItems := range itemChan {
		items = append(items, batchItems...)
		for _, item := range batchItems {
			dataSet.AddItem(item)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

zhenghaoz requested a review from Copilot April 22, 2026 04:49

Copilot started reviewing on behalf of zhenghaoz April 22, 2026 04:50 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Comment thread master/tasks.go Outdated

Comment thread master/tasks.go

Comment thread master/tasks.go

Comment thread master/tasks.go

zhenghaoz added 2 commits April 22, 2026 20:59

refactor: enhance dataset and user-to-user logic with index support f…

3bb6d7c

…or tags

refactor: optimize itemIds initialization in LoadDataFromDatabase

6565b04

zhenghaoz requested a review from Copilot April 22, 2026 13:17

Copilot started reviewing on behalf of zhenghaoz April 22, 2026 13:17 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Comment thread worker/pipeline.go Outdated

Comment thread master/tasks.go Outdated

zhenghaoz changed the title ~~Optimize memory usage in LoadDataset: remove duplicate item storage~~ Remove duplicate items on data loading Apr 22, 2026

zhenghaoz added 3 commits April 22, 2026 21:34

refactor: streamline dataset and user-to-user logic by removing unuse…

02f7462

…d index references and optimizing data handling

refactor: replace float32 handling with BF16 in dataset and item-to-i…

7c1eab6

…tem logic

refactor: optimize memory usage in LoadDataFromDatabase by preallocat…

395e145

…ing item slice and releasing memory

zhenghaoz requested a review from Copilot April 22, 2026 14:55

Copilot started reviewing on behalf of zhenghaoz April 22, 2026 14:56 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Comment thread worker/pipeline_test.go

zhenghaoz and others added 2 commits April 22, 2026 23:04

Apply suggestion from @Copilot

28cb72a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

test: add assertion for compressed map in TestCompressLabelsEmbeddings

f122110

zhenghaoz changed the title ~~Remove duplicate items on data loading~~ Reduce memory usage in dataset loading and item cach Apr 22, 2026

zhenghaoz changed the title ~~Reduce memory usage in dataset loading and item cach~~ Reduce memory usage in dataset loading and item cache Apr 22, 2026

zhenghaoz merged commit 6822196 into gorse-io:master Apr 22, 2026
11 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage in dataset loading and item cache#1252

Reduce memory usage in dataset loading and item cache#1252
zhenghaoz merged 8 commits intogorse-io:masterfrom
zhangzhenghao:optimize-memory-load-dataset

zhangzhenghao commented Apr 22, 2026

Uh oh!

codecov Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zhangzhenghao commented Apr 22, 2026

Problem

Solution

Memory Savings

Changes

Testing

Uh oh!

codecov Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Apr 22, 2026 •

edited

Loading