Create LakeFS commits RDD directly without using an input format #9657

arielshaqed · 2025-11-10T12:43:22Z

Avoid input format

Garbage collection (and all other uses) use LakeFSContext.newRDD to create the "ranges RDD". Creating them explicitly with Spark operators means Spark can parallelize reading all metaranges and ranges.

How much faster?

I have a small repo with many small commits. I enabled GC for it. Here are summaries from two sample mark-only runs.

Direct RDD (this code)

Runtime: 2m33s

{
  "run_id": "g4uk6erfnfus73frbnqg",
  "success": true,
  "first_slice": "g5adr8f5pvec73cpia80",
  "start_time": "2025-11-10T10:34:37.245361091Z",
  "cutoff_time": "2025-11-10T04:34:37.243Z",
  "num_deleted_objects": 147942
}

File format RDD (previous code)

Runtime: 3m52s

{
  "run_id": "g4uinaarakss73aoeel0",
  "success": true,
  "first_slice": "g5adr8f5pvec73cpia80",
  "start_time": "2025-11-10T12:15:11.097697745Z",
  "cutoff_time": "2025-11-10T06:15:11.096Z",
  "num_deleted_objects": 147942
}

Summary

The same number of objects were marked for deletion.
The same objects were marked for deletion on both.
New code takes 0.65 the time of the old code.

Parallelize object listing

The never-ending run of #9649 manages to finish listing, but still does not end (ran for 11 hours). That's because it also lists objects - and in practice Spark did not parallelize this. Explicitly parallelize it.

Results

I can finish the mark portion of the run in 2 hours (and a few seconds change) by configuring --conf spark.hadoop.lakefs.job.range_read_parallelism=256 and running on a smaller but still fairly large EMR serverless cluster (500 vCPUs, memory and disk like they were going out of fashion).

Closes #9649.

Garbage collection (and all other uses) use LakeFSContext.newRDD to create the "ranges RDD". Creating them explicitly with Spark operators means Spark can parallelize reading all metaranges and ranges. <h2>How much faster?</h2> I have a small repo with many small commits. I enabled GC for it. Here are summaries from two sample mark-only runs. <h3>Direct RDD (this code)</h3> Runtime: 2m33s ```json { "run_id": "g4uk6erfnfus73frbnqg", "success": true, "first_slice": "g5adr8f5pvec73cpia80", "start_time": "2025-11-10T10:34:37.245361091Z", "cutoff_time": "2025-11-10T04:34:37.243Z", "num_deleted_objects": 147942 } ``` <h3>File format RDD (previous code)</h3> Runtime: 3m52s ```json { "run_id": "g4uinaarakss73aoeel0", "success": true, "first_slice": "g5adr8f5pvec73cpia80", "start_time": "2025-11-10T12:15:11.097697745Z", "cutoff_time": "2025-11-10T06:15:11.096Z", "num_deleted_objects": 147942 } ``` <h3>Summary</h3> - The same number of objects were marked for deletion. - The _same_ objects were marked for deletion on both. - New code takes 0.65 the time of the old code.

Copilot

Pull Request Overview

This pull request refactors the LakeFSContext.newRDD method to directly process ranges using Spark RDD operations instead of using Hadoop's InputFormat API. The changes aim to simplify the data loading pipeline by bypassing the InputFormat layer.

Key changes:

Replaces sc.newAPIHadoopRDD with direct RDD operations using mapPartitions
Makes the Range class public and serializable for use across Spark operations
Introduces direct file handling and SSTableReader creation in RDD transformations

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
clients/spark/src/main/scala/io/treeverse/clients/LakeFSInputFormat.scala	Makes `Range` class public and serializable to support cross-partition serialization in Spark RDDs
clients/spark/src/main/scala/io/treeverse/clients/LakeFSContext.scala	Refactors `newRDD` to use direct RDD operations instead of InputFormat, processing ranges and entries through `mapPartitions` transformations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-10T13:06:49Z

clients/spark/src/main/scala/io/treeverse/clients/LakeFSContext.scala

+        val localFile = File.createTempFile("lakefs.", ".range")
+        fs.copyToLocalFile(false, path, new Path(localFile.getAbsolutePath), true)
+        val companion = Entry.messageCompanion
+        val sstableReader = new SSTableReader(localFile.getAbsolutePath, companion, true)
+        // TODO(ariels): Do we need to validate that this reader is good?  Assume _not_, this is
+        // not InputFormat code so it should have slightly nicer error reports.
+        sstableReader.newIterator().map((entry) => (entry.key, new WithIdentifier(entry.id, entry.message, range.id)))


The temporary file and SSTableReader are not properly cleaned up, leading to resource leaks. The TODO comment acknowledges this issue. Similar to the pattern used in EntryRecordReader (lines 95-98 in LakeFSInputFormat.scala), you should register a task completion listener to delete the temporary file:

val localFile = File.createTempFile("lakefs.", ".range") Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ => localFile.delete()))

Additionally, the sstableReader should be closed when the task completes. Consider adding:

Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ => sstableReader.close()))

Wow, nice one Copilot!

arielshaqed · 2025-11-10T13:49:48Z

Plan before pulling:

Review.
Run in mark-only mode on a large repo, see it is faster.
Run in mark-only mode on a medium repo, see it agrees with the old one.

D'oh.

All errors during closing are _logged_ but do not fail the task: these are readonly objects, so bad closes can do no more than leak (on "reasonable" systems). Flagged by **Copilot**, hurrah for verifiable actionable suggestions!

Read objects in parallel: - from directory listing; - from commits Default parallelism is no good for either of these, because it is based on # of CPUs - and we want a _lot_ more. New configuration option `lakefs.job.range_read_parallelism` configures this parallelism.

N-o-Z · 2025-11-20T00:16:49Z

clients/spark/src/main/scala/io/treeverse/gc/GarbageCollection.scala

          .except(committedDF)
          .except(uncommittedDF)
-          .cache()
+          .persist(StorageLevel.MEMORY_AND_DISK)


Can you explain the difference between persist and cache

Sure. (Linking to PySpark docs for a new version because it's easiest to find these online. But it's been like this for... ever.) Man says:

Persist this RDD with the default storage level (MEMORY_ONLY).

So it only works for small RDDs. But we mostly care about large RDDs.

Personally I think that if you have "persist", and you name a shortcut to it "cache", then you have $\ge 1$ naming problems.

arielshaqed

PTAL. @N-o-Z - you're probably it!

N-o-Z

LGTM

arielshaqed · 2025-11-23T10:44:48Z

Thanks!

Pulling. I believe the failing test (which is not required anyway) failed for spurious flakiness - it is on the UI and not on lakeFSFS.

arielshaqed linked an issue Nov 10, 2025 that may be closed by this pull request

Spark GC prepares a list of ranges in driver memory, inefficiently #9649

Closed

arielshaqed added performance area/integrations area/client/spark include-changelog PR description should be included in next release changelog labels Nov 10, 2025

arielshaqed requested review from Copilot and guy-har November 10, 2025 13:02

Copilot AI reviewed Nov 10, 2025

View reviewed changes

arielshaqed added 3 commits November 10, 2025 15:50

scalafmt

75b620a

D'oh.

[bug] Delete file, close SSTableReader after iterating

f644b53

All errors during closing are _logged_ but do not fail the task: these are readonly objects, so bad closes can do no more than leak (on "reasonable" systems). Flagged by **Copilot**, hurrah for verifiable actionable suggestions!

arielshaqed requested a review from N-o-Z November 17, 2025 07:55

scalafmt

5d27676

arielshaqed marked this pull request as ready for review November 17, 2025 07:58

N-o-Z reviewed Nov 20, 2025

View reviewed changes

arielshaqed commented Nov 20, 2025

View reviewed changes

arielshaqed requested a review from N-o-Z November 20, 2025 14:55

N-o-Z approved these changes Nov 20, 2025

View reviewed changes

arielshaqed merged commit e39f13b into master Nov 23, 2025
43 of 44 checks passed

arielshaqed deleted the bug/9649-gc-slow-never-starts-removing branch November 23, 2025 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create LakeFS commits RDD directly without using an input format #9657

Create LakeFS commits RDD directly without using an input format #9657

arielshaqed commented Nov 10, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 10, 2025

Uh oh!

arielshaqed Nov 10, 2025

Uh oh!

arielshaqed commented Nov 10, 2025

Uh oh!

N-o-Z Nov 20, 2025

Uh oh!

arielshaqed Nov 20, 2025

Uh oh!

arielshaqed left a comment

Uh oh!

N-o-Z left a comment

Uh oh!

arielshaqed commented Nov 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Create LakeFS commits RDD directly without using an input format #9657

Create LakeFS commits RDD directly without using an input format #9657

Conversation

arielshaqed commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Avoid input format

How much faster?

Direct RDD (this code)

File format RDD (previous code)

Summary

Parallelize object listing

Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

arielshaqed Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

arielshaqed commented Nov 10, 2025

Uh oh!

N-o-Z Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

arielshaqed Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

arielshaqed left a comment

Choose a reason for hiding this comment

Uh oh!

N-o-Z left a comment

Choose a reason for hiding this comment

Uh oh!

arielshaqed commented Nov 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arielshaqed commented Nov 10, 2025 •

edited

Loading