Skip to content

MAPREDUCE-7508.FileInputFormat can throw ArrayIndexOutofBoundsException because of some concurrent execution. #7859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: trunk
Choose a base branch
from

Conversation

liangyu-1
Copy link
Contributor

…on because of some concurrent execution.

Description of PR

As Described in MAPREDUCE-7508

When Spark scans for files, it uses the FileInputFormat.getSplits() method to split the file. The first step in getSplits is to retrieve the file's length. If the file length is not zero, the next step is to get the block locations array for that file. However, if the two upstream programs rapidly create and write to the same file (i.e., the file is overwritten or appended to almost simultaneously), a race condition may occur:

The file's length is already non-zero,

but calling getFileBlockLocations() returns an empty array because the file is being overwritten or is not yet fully written.

When this happens, subsequent logic in getSplits (such as accessing the last element of the block locations array) will throw an ArrayIndexOutOfBoundsException because the block locations array is unexpectedly empty.

How was this patch tested?

I rebuild the project and ran on our cluster, spark did not throw Execptions.

For code changes:

If Array blkLocations is empty, it will continue to next iteration, so that it will now find the the last blockLocation of this file.

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@yangjiandan
Copy link
Contributor

yangjiandan commented Aug 8, 2025

LGTM!
The submitted patch clearly addresses a rare but potentially hazardous issue where FileInputFormat could throw an ArrayIndexOutOfBoundsException under concurrent execution. It's a focused and well-implemented fix and seems necessary to improve the stability of the HDFS component. @slfan1989 Could you help take a look at this PR as well?

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 50s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 44m 46s trunk passed
+1 💚 compile 0m 44s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 compile 0m 39s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 checkstyle 0m 43s trunk passed
+1 💚 mvnsite 0m 45s trunk passed
+1 💚 javadoc 0m 34s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 28s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 26s trunk passed
+1 💚 shadedclient 41m 19s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 32s the patch passed
+1 💚 compile 0m 35s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javac 0m 35s the patch passed
+1 💚 compile 0m 31s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 javac 0m 31s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 31s the patch passed
+1 💚 mvnsite 0m 35s the patch passed
+1 💚 javadoc 0m 21s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 20s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 23s the patch passed
+1 💚 shadedclient 41m 10s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 8m 50s hadoop-mapreduce-client-core in the patch passed.
+1 💚 asflicense 0m 37s The patch does not generate ASF License warnings.
148m 39s
Subsystem Report/Notes
Docker ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7859/2/artifact/out/Dockerfile
GITHUB PR #7859
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 0d28eceec116 5.15.0-144-generic #157-Ubuntu SMP Mon Jun 16 07:33:10 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / d2ba001
Default Java Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7859/2/testReport/
Max. process+thread count 1078 (vs. ulimit of 5500)
modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7859/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@@ -369,6 +369,9 @@ public InputSplit[] getSplits(JobConf job, int numSplits)
} else {
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
if (blkLocations.length == 0){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contributions. Just wonder why it will meet length != 0 but blkLocations.length == 0, some corner case that lead inconsistent metadata of NameNode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hexiaoqiao thanks for review.

It is a rare corner case, only happens when using spark streaming monitors a path where the upstream system sometimes starts two identical tasks that attempt to create and write to the same HDFS file simultaneously. This can lead to conflicts where a file is created and written to twice in quick succession.

When Spark scans for files, it uses the FileInputFormat.getSplits() method to split the file. The first step in getSplits is to retrieve the file's length. If the file length is not zero, the next step is to get the block locations array for that file. However, if the two upstream programs rapidly create and write to the same file (i.e., the file is overwritten or appended to almost simultaneously), a race condition may occur:

The file's length is already non-zero, but calling getFileBlockLocations() returns an empty array because the file is being overwritten or is not yet fully written.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Got it. Make sense to me. +1 to check in.
BTW, the root cause here is invoke listStatus to get file status and invoke another interface getFileBlockLocations to get block location, but file has changed between this steps, right? If it is true, is it proper to use blkLocations only as condition at L364 rather than file.length which could be not the correct result here? Thanks again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Got it. Make sense to me. +1 to check in. BTW, the root cause here is invoke listStatus to get file status and invoke another interface getFileBlockLocations to get block location, but file has changed between this steps, right? If it is true, is it proper to use blkLocations only as condition at L364 rather than file.length which could be not the correct result here? Thanks again.

Thanks for your review. It is a good solution to use blkLocations only as condition at L364, if blkLocations array is empty, file.length is also empty, this will not effect the later opretions.

@liangyu-1 liangyu-1 requested a review from Hexiaoqiao August 22, 2025 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants