Skip to content

Reduce memory usage in dataset loading and item cache#1252

Merged
zhenghaoz merged 8 commits intogorse-io:masterfrom
zhangzhenghao:optimize-memory-load-dataset
Apr 22, 2026
Merged

Reduce memory usage in dataset loading and item cache#1252
zhenghaoz merged 8 commits intogorse-io:masterfrom
zhangzhenghao:optimize-memory-load-dataset

Conversation

@zhangzhenghao
Copy link
Copy Markdown
Contributor

Problem

During dataset loading in LoadDataFromDatabase, all items were stored twice:

  1. In the items variable (full data.Item objects) for sorting and partitioning
  2. In dataSet.items via dataSet.AddItem()

This caused significant memory overhead, especially for large datasets.

Solution

This PR removes the duplicate storage by:

  • Collecting only itemIds (strings) instead of full data.Item objects
  • Sorting and partitioning itemIds directly using sort.Strings()
  • Retrieving full item data from dataSet.GetItems() when needed for non-personalized recommenders

Memory Savings

Each data.Item contains Labels, Categories, Timestamp, Comment, and other fields. A typical item can occupy 100-200 bytes.

For a dataset with 1 million items:

  • Before: ~200MB duplicate storage (items variable + dataSet.items)
  • After: ~8MB for itemIds (string array) + ~200MB for dataSet.items
  • Saved: ~200MB peak memory during loading

Changes

  • master/tasks.go: Changed var items []data.Item to var itemIds []string
  • Removed items = append(items, batchItems...)
  • Added itemIds = append(itemIds, item.ItemId) in the item loading loop
  • Changed sort.Slice(items, ...) to sort.Strings(itemIds)
  • Updated all references from itemGroups to itemIdGroups
  • Modified recommender.Push() calls to use dataSet.GetItems()[itemIdx]

Testing

  • Code changes are straightforward and maintain the same logic flow
  • CI tests will verify compilation and correctness

Previously, the LoadDataFromDatabase function stored all items twice:
1. In the 'items' variable for sorting and partitioning
2. In dataSet.items via dataSet.AddItem()

This optimization removes the duplicate storage by:
- Collecting only itemIds (strings) instead of full data.Item objects
- Sorting and partitioning itemIds directly
- Retrieving full item data from dataSet.GetItems() when needed

Memory savings: Each data.Item contains Labels, Categories, Timestamp,
and other fields that can occupy hundreds of bytes. For 1M items, this
optimization saves approximately 100-200MB of memory during dataset loading.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.42%. Comparing base (bf4dd35) to head (f122110).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1252      +/-   ##
==========================================
- Coverage   72.90%   70.42%   -2.49%     
==========================================
  Files          91       91              
  Lines       16729    19708    +2979     
==========================================
+ Hits        12197    13879    +1682     
- Misses       3287     4615    +1328     
+ Partials     1245     1214      -31     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce peak memory usage during Master.LoadDataFromDatabase by avoiding storing the full data.Item slice twice while still supporting item sorting/partitioning for feedback scans.

Changes:

  • Replace in-memory []data.Item accumulation with []string (itemIds) collected during item loading.
  • Sort and partition itemIds with sort.Strings + parallel.Split and update feedback scan ranges accordingly.
  • Update non-personalized recommender ingestion to look up items from dataSet.GetItems() by item ID.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread master/tasks.go Outdated
Comment thread master/tasks.go
Comment thread master/tasks.go
Comment thread master/tasks.go
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

logics/item_to_item.go:444

  • convertLabelsToID recursively converts nested []any elements, but flatten (used by the tags-based recommenders) doesn't handle []any at all. As a result, any tag IDs nested inside arrays will be silently ignored (and the recursive conversion work in convertLabelsToID is effectively wasted). Consider extending flatten to traverse []any (and/or []string if expected), or constraining convertLabelsToID to only produce structures that flatten can consume.
func flatten(o any, tSet mapset.Set[dataset.ID]) {
	switch typed := o.(type) {
	case dataset.ID:
		tSet.Add(typed)
		return
	case []dataset.ID:
		tSet.Append(typed...)
		return
	case map[string]any:
		for _, v := range typed {
			flatten(v, tSet)
		}
	}
}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread worker/pipeline.go Outdated
Comment thread master/tasks.go Outdated
@zhenghaoz zhenghaoz changed the title Optimize memory usage in LoadDataset: remove duplicate item storage Remove duplicate items on data loading Apr 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

master/tasks.go:359

  • The PR description states that duplicate in-memory storage of items during loading is removed by collecting only item IDs and sorting/partitioning them. However, this code still appends full data.Item structs into items (and also stores the same items in dataSet via AddItem), so the duplicate storage and peak memory overhead remain. To achieve the intended savings, switch items to a []string (or []int32 indices) collected from batchItems, sort/split that slice, and fetch full items from dataSet.GetItems() only when calling non-personalized recommenders.
	items := make([]data.Item, 0, estimatedNumItems)
	itemLabelCount := make(map[string]int)
	itemLabelFirst := make(map[string]int32)
	itemLabelIndex := dataset.NewMapIndex()
	itemLabels := make([][]lo.Tuple2[int32, float32], 0, estimatedNumItems)
	itemEmbeddingIndexer := dataset.NewMapIndex()
	itemEmbeddingDimension := make([]map[int]int, 0)
	itemEmbeddings := make([][][]uint16, 0, estimatedNumItems)
	start = time.Now()
	itemChan, errChan := database.GetItemStream(newCtx, batchSize, itemTimeLimit)
	for batchItems := range itemChan {
		items = append(items, batchItems...)
		for _, item := range batchItems {
			dataSet.AddItem(item)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread worker/pipeline_test.go
zhenghaoz and others added 2 commits April 22, 2026 23:04
@zhenghaoz zhenghaoz changed the title Remove duplicate items on data loading Reduce memory usage in dataset loading and item cach Apr 22, 2026
@zhenghaoz zhenghaoz changed the title Reduce memory usage in dataset loading and item cach Reduce memory usage in dataset loading and item cache Apr 22, 2026
@zhenghaoz zhenghaoz merged commit 6822196 into gorse-io:master Apr 22, 2026
11 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants