Spark GC prepares a list of ranges in driver memory, inefficiently

## What

When Spark GC works on committed data, it calls lakeFS prepare GC commits.[^1]  That gives a CSV file, which the driver reads.  It then produces a set of all the ranges which appear in any of the commits in the CSV file.

Unfortunately this appears to happen in driver memory, using Java arrays rather Scala streaming, and no Spark in sight.  That (probably) causes timeouts on a large customer repository, and will also waste memory etc.

Instead, parallelize this part in Spark as well!

## Details

[This code](https://github.com/treeverse/lakeFS/blob/4855f88dce377da69899e5952f56213717fbed79/clients/spark/src/main/scala/io/treeverse/clients/LakeFSInputFormat.scala#L198-L210):
```scala
    val ranges = commitIDs
      .flatMap(commitID => {
        val metaRangeURL = apiClient.getMetaRangeURL(repoName, commitID)
        if (metaRangeURL == "") {
          // a commit with no meta range is an empty commit.
          // this only happens for the first commit in the repository.
          None
        } else {
          val rangesReader = metarangeReaderGetter(job.getConfiguration, metaRangeURL, true)
          read(rangesReader).map(rd => new Range(new String(rd.id), rd.message.estimatedSize))
        }
      })
      .toSet
```
is strange.  commitIDs is a Java array which we get from the Hadoop configuration with the getStrings method.  So flatMap is _also_ a Java array, computed in the driver, and flattened to a set only after it is all generated.

[^1]:  That itself can be slow, see #9648, but that is a different issue!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark GC prepares a list of ranges in driver memory, inefficiently #9649

What

Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Spark GC prepares a list of ranges in driver memory, inefficiently #9649

Description

What

Details

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions