[FLINK-39778][s3] Recoverable writer silently loses the in-flight tail on resume by Samrat002 · Pull Request #28268 · apache/flink

Samrat002 · 2026-05-27T13:49:13Z

What is the purpose of the change

NativeS3RecoverableWriter.recover() silently discarded the sub-part-size tail that persist() had durably uploaded to S3 as a side object. After a crash-and-restore cycle, any bytes written since the last full-part boundary were permanently lost, violating Flink's exactly-once guarantee.

This patch fixes the data loss by downloading the side object during recover() and seeding the resumed output stream with those bytes before accepting further writes.

Brief change log

recover() in NativeS3RecoverableWriter now downloads the incomplete-tail side object and seeds the resumed stream with those bytes before accepting new writes. A downloadIncompleteTail() helper validates the length and cleans up the local file on failure. NativeS3RecoverableFsDataOutputStream gains a resume constructor that opens the seed file in append mode so position accounting is correct from the start.

Verifying this change

UT to showcase the bug and fix working

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no) no
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no) no
The serializers: (yes / no / don't know) no
The runtime per-record code paths (performance sensitive): (yes / no / don't know) no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know) no
The S3 file system connector: (yes / no / don't know) yes

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

Was generative AI tooling used to co-author this PR?

Yes (please specify the tool below)

flinkbot · 2026-05-27T13:55:19Z

CI report:

407ebc4 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

Izeren

I haven't reviewed all tests in details yet, but I would like first to understand better the logic about partially uploaded subparts. My initial impression from the FLIP was that we would like to store incomplete parts in-line in state and treat part upload as atomic operation. Did we choose not to do that? I wonder because I am not sure in reliability of incomplete parts. Are they subject to the same lifecycle policies as incomplete MPUs or some different policy?

Izeren · 2026-05-29T16:33:52Z

+                    s3AccessHelper.getObject(s3recoverable.incompleteObjectName(), target);
+            if (downloaded != s3recoverable.incompleteObjectLength()) {
+                throw new IOException(
+                        "Incomplete-tail object "


This exception doesn't tell what are the implications. Does it mean that state is corrupted and can't be recovered unless object on S3 is restored? If so, would be useful to explain it. Would help both oncall engineer and to classify such errors correctly (retriable/non-retriable)

Izeren · 2026-05-29T16:49:12Z

+ * <p><b>Thread safety:</b> not thread-safe. Use a single thread per instance, matching the
+ * single-thread invariant of the production {@link NativeS3RecoverableFsDataOutputStream}.
+ */
+public final class InMemoryNativeS3Operations extends NativeS3ObjectOperations {


If this is meant to be used as test harness for FileSystem testing (as replacement of localStack). Arguably it is good to have tests for it too

Samrat002 · 2026-06-04T13:14:34Z

I haven't reviewed all tests in details yet, but I would like first to understand better the logic about partially uploaded subparts. My initial impression from the FLIP was that we would like to store incomplete parts in-line in state and treat part upload as atomic operation. Did we choose not to do that? I wonder because I am not sure in reliability of incomplete parts. Are they subject to the same lifecycle policies as incomplete MPUs or some different policy?

Yes, you are right. That was initially discussed. what we are observing at the scale of production, users don't really set policies. There are billions of MPU get accumulated and, leading to high cost.

Izeren · 2026-06-05T17:14:48Z

Yes, you are right. That was initially discussed. what we are observing at the scale of production, users don't really set policies. There are billions of MPU get accumulated and, leading to high cost.

Could you please elaborate on how storing subparts in the state is linked to the billing problem. Aren't aborted MPUs introducing all the same dangling S3 objects?

My general question was more about why do we store subparts as separate tail files to resume from on S3. Are they as good as the inline Flink state in terms of data corruption risks?

Izeren · 2026-06-05T17:20:03Z

+    @TempDir java.nio.file.Path tmp;
+
+    @Test
+    void persistThenRecoverPreservesTailBytes() throws Exception {


Should we also tests that data written after last persist is discarded on recovery? Otherwise we are not guaranteeing exactly once

And probably test that the 2nd recovery attempt can still read the side part?

Samrat002 · 2026-06-05T17:29:33Z

Yes, you are right. That was initially discussed. what we are observing at the scale of production, users don't really set policies. There are billions of MPU get accumulated and, leading to high cost.

Could you please elaborate on how storing subparts in the state is linked to the billing problem. Aren't aborted MPUs introducing all the same dangling S3 objects?

My general question was more about why do we store subparts as separate tail files to resume from on S3. Are they as good as the inline Flink state in terms of data corruption risks?

My bad, I misunderstood and correlated different things.

Two reasons I went with S3 objects over inlining in state:
1. Checkpoint cost. Tails can be up to part-size 5 MiB+, often larger. Inlining per writer per checkpoint inflates checkpoint payload through the JM/state backend. At scale that's a real cost vs. a single S3 PUT.
2. Durability is the same. The state backend is usually S3 too, so a tail object gives us the same 11-9s either way. State doesn't strengthen the guarantee, just shifts where the bytes live.

On lifecycle, tail objects live under deterministic keys we own, and deletion is driven by our commit/abort/recovery path not by a bucket lifecycle policy. So cleanup is as predictable as state GC, without the checkpoint-size hit.

rkhachatryan · 2026-06-08T21:05:11Z

+                    s3recoverable.numBytesInParts(),
+                    incompleteTail,
+                    incompleteTailLength);
+        } catch (Throwable t) {


Should we use Exception here instead of Throwable?
I doubt that we want this block to be executed in case of VM errors for example.

rkhachatryan · 2026-06-08T21:06:31Z

+            incompleteTail = downloadIncompleteTail(s3recoverable);
+            incompleteTailLength = s3recoverable.incompleteObjectLength();


Why do we need to pass around incompleteTailLength in addition to incompleteTail?

Per my understanding, at this point it's a local file with fixed known length. So just the file/name should be enough, shouldn't it?

rkhachatryan · 2026-06-08T21:09:51Z

+        File incompleteTail = null;
+        long incompleteTailLength = 0L;
+        if (s3recoverable.incompleteObjectName() != null) {


Should we validate inside the if branch that the length > 0?

rkhachatryan · 2026-06-08T21:10:56Z

This constructor (with null/0L) seems to be unused now.

rkhachatryan · 2026-06-08T21:12:32Z

+    @TempDir java.nio.file.Path tmp;
+
+    @Test
+    void persistThenRecoverPreservesTailBytes() throws Exception {


And probably test that the 2nd recovery attempt can still read the side part?

…l on resume

Samrat002 · 2026-06-09T17:01:10Z

@rkhachatryan Added changes and addressed to review comments. PTAL whenever time

rkhachatryan

LGTM

Izeren reviewed May 29, 2026

View reviewed changes

github-actions Bot added the community-reviewed PR has been reviewed by the community. label May 30, 2026

Samrat002 force-pushed the FLINK-39778 branch from 4eaff5d to 89ffc7b Compare June 3, 2026 18:18

Samrat002 requested a review from Izeren June 3, 2026 18:19

Samrat002 force-pushed the FLINK-39778 branch from 89ffc7b to 606ffa4 Compare June 4, 2026 13:48

Izeren reviewed Jun 5, 2026

View reviewed changes

rkhachatryan reviewed Jun 8, 2026

View reviewed changes

Samrat002 added 3 commits June 9, 2026 21:54

[FLINK-39778][s3] Recoverable writer silently loses the in-flight tai…

f41a092

…l on resume

Address to review comments

f642115

Address to review comments

407ebc4

Samrat002 force-pushed the FLINK-39778 branch from 606ffa4 to 407ebc4 Compare June 9, 2026 16:57

Samrat002 requested review from Izeren and rkhachatryan June 9, 2026 17:01

rkhachatryan approved these changes Jun 15, 2026

View reviewed changes

rkhachatryan merged commit 7577493 into apache:master Jun 15, 2026

		incompleteTail = downloadIncompleteTail(s3recoverable);
		incompleteTailLength = s3recoverable.incompleteObjectLength();

Conversation

Samrat002 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Was generative AI tooling used to co-author this PR?

Uh oh!

flinkbot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

Izeren left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Samrat002 commented Jun 4, 2026

Uh oh!

Izeren commented Jun 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samrat002 commented Jun 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samrat002 commented Jun 9, 2026

Uh oh!

rkhachatryan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Samrat002 commented May 27, 2026 •

edited

Loading

flinkbot commented May 27, 2026 •

edited

Loading