Reduce memory usage in dataset loading and item cache#1252
Reduce memory usage in dataset loading and item cache#1252zhenghaoz merged 8 commits intogorse-io:masterfrom
Conversation
Previously, the LoadDataFromDatabase function stored all items twice: 1. In the 'items' variable for sorting and partitioning 2. In dataSet.items via dataSet.AddItem() This optimization removes the duplicate storage by: - Collecting only itemIds (strings) instead of full data.Item objects - Sorting and partitioning itemIds directly - Retrieving full item data from dataSet.GetItems() when needed Memory savings: Each data.Item contains Labels, Categories, Timestamp, and other fields that can occupy hundreds of bytes. For 1M items, this optimization saves approximately 100-200MB of memory during dataset loading.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1252 +/- ##
==========================================
- Coverage 72.90% 70.42% -2.49%
==========================================
Files 91 91
Lines 16729 19708 +2979
==========================================
+ Hits 12197 13879 +1682
- Misses 3287 4615 +1328
+ Partials 1245 1214 -31 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR aims to reduce peak memory usage during Master.LoadDataFromDatabase by avoiding storing the full data.Item slice twice while still supporting item sorting/partitioning for feedback scans.
Changes:
- Replace in-memory
[]data.Itemaccumulation with[]string(itemIds) collected during item loading. - Sort and partition
itemIdswithsort.Strings+parallel.Splitand update feedback scan ranges accordingly. - Update non-personalized recommender ingestion to look up items from
dataSet.GetItems()by item ID.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
logics/item_to_item.go:444
convertLabelsToIDrecursively converts nested[]anyelements, butflatten(used by the tags-based recommenders) doesn't handle[]anyat all. As a result, any tag IDs nested inside arrays will be silently ignored (and the recursive conversion work inconvertLabelsToIDis effectively wasted). Consider extendingflattento traverse[]any(and/or[]stringif expected), or constrainingconvertLabelsToIDto only produce structures thatflattencan consume.
func flatten(o any, tSet mapset.Set[dataset.ID]) {
switch typed := o.(type) {
case dataset.ID:
tSet.Add(typed)
return
case []dataset.ID:
tSet.Append(typed...)
return
case map[string]any:
for _, v := range typed {
flatten(v, tSet)
}
}
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…d index references and optimizing data handling
…ing item slice and releasing memory
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
master/tasks.go:359
- The PR description states that duplicate in-memory storage of items during loading is removed by collecting only item IDs and sorting/partitioning them. However, this code still appends full
data.Itemstructs intoitems(and also stores the same items indataSetviaAddItem), so the duplicate storage and peak memory overhead remain. To achieve the intended savings, switchitemsto a[]string(or[]int32indices) collected frombatchItems, sort/split that slice, and fetch full items fromdataSet.GetItems()only when calling non-personalized recommenders.
items := make([]data.Item, 0, estimatedNumItems)
itemLabelCount := make(map[string]int)
itemLabelFirst := make(map[string]int32)
itemLabelIndex := dataset.NewMapIndex()
itemLabels := make([][]lo.Tuple2[int32, float32], 0, estimatedNumItems)
itemEmbeddingIndexer := dataset.NewMapIndex()
itemEmbeddingDimension := make([]map[int]int, 0)
itemEmbeddings := make([][][]uint16, 0, estimatedNumItems)
start = time.Now()
itemChan, errChan := database.GetItemStream(newCtx, batchSize, itemTimeLimit)
for batchItems := range itemChan {
items = append(items, batchItems...)
for _, item := range batchItems {
dataSet.AddItem(item)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Problem
During dataset loading in
LoadDataFromDatabase, all items were stored twice:itemsvariable (fulldata.Itemobjects) for sorting and partitioningdataSet.itemsviadataSet.AddItem()This caused significant memory overhead, especially for large datasets.
Solution
This PR removes the duplicate storage by:
itemIds(strings) instead of fulldata.ItemobjectsitemIdsdirectly usingsort.Strings()dataSet.GetItems()when needed for non-personalized recommendersMemory Savings
Each
data.ItemcontainsLabels,Categories,Timestamp,Comment, and other fields. A typical item can occupy 100-200 bytes.For a dataset with 1 million items:
Changes
master/tasks.go: Changedvar items []data.Itemtovar itemIds []stringitems = append(items, batchItems...)itemIds = append(itemIds, item.ItemId)in the item loading loopsort.Slice(items, ...)tosort.Strings(itemIds)itemGroupstoitemIdGroupsrecommender.Push()calls to usedataSet.GetItems()[itemIdx]Testing