Skip to content

[FLINK-39778][s3] Recoverable writer silently loses the in-flight tail on resume#28268

Merged
rkhachatryan merged 3 commits into
apache:masterfrom
Samrat002:FLINK-39778
Jun 15, 2026
Merged

[FLINK-39778][s3] Recoverable writer silently loses the in-flight tail on resume#28268
rkhachatryan merged 3 commits into
apache:masterfrom
Samrat002:FLINK-39778

Conversation

@Samrat002

@Samrat002 Samrat002 commented May 27, 2026

Copy link
Copy Markdown
Contributor

What is the purpose of the change

NativeS3RecoverableWriter.recover() silently discarded the sub-part-size tail that persist() had durably uploaded to S3 as a side object. After a crash-and-restore cycle, any bytes written since the last full-part boundary were permanently lost, violating Flink's exactly-once guarantee.

This patch fixes the data loss by downloading the side object during recover() and seeding the resumed output stream with those bytes before accepting further writes.

Brief change log

recover() in NativeS3RecoverableWriter now downloads the incomplete-tail side object and seeds the resumed stream with those bytes before accepting new writes. A downloadIncompleteTail() helper validates the length and cleans up the local file on failure. NativeS3RecoverableFsDataOutputStream gains a resume constructor that opens the seed file in append mode so position accounting is correct from the start.

Verifying this change

UT to showcase the bug and fix working

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no) no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no) no
  • The serializers: (yes / no / don't know) no
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know) no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know) no
  • The S3 file system connector: (yes / no / don't know) yes

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

@flinkbot

flinkbot commented May 27, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@Izeren Izeren left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed all tests in details yet, but I would like first to understand better the logic about partially uploaded subparts. My initial impression from the FLIP was that we would like to store incomplete parts in-line in state and treat part upload as atomic operation. Did we choose not to do that? I wonder because I am not sure in reliability of incomplete parts. Are they subject to the same lifecycle policies as incomplete MPUs or some different policy?

s3AccessHelper.getObject(s3recoverable.incompleteObjectName(), target);
if (downloaded != s3recoverable.incompleteObjectLength()) {
throw new IOException(
"Incomplete-tail object "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception doesn't tell what are the implications. Does it mean that state is corrupted and can't be recovered unless object on S3 is restored? If so, would be useful to explain it. Would help both oncall engineer and to classify such errors correctly (retriable/non-retriable)

* <p><b>Thread safety:</b> not thread-safe. Use a single thread per instance, matching the
* single-thread invariant of the production {@link NativeS3RecoverableFsDataOutputStream}.
*/
public final class InMemoryNativeS3Operations extends NativeS3ObjectOperations {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is meant to be used as test harness for FileSystem testing (as replacement of localStack). Arguably it is good to have tests for it too

@github-actions github-actions Bot added the community-reviewed PR has been reviewed by the community. label May 30, 2026
@Samrat002 Samrat002 requested a review from Izeren June 3, 2026 18:19
@Samrat002

Copy link
Copy Markdown
Contributor Author

I haven't reviewed all tests in details yet, but I would like first to understand better the logic about partially uploaded subparts. My initial impression from the FLIP was that we would like to store incomplete parts in-line in state and treat part upload as atomic operation. Did we choose not to do that? I wonder because I am not sure in reliability of incomplete parts. Are they subject to the same lifecycle policies as incomplete MPUs or some different policy?

Yes, you are right. That was initially discussed. what we are observing at the scale of production, users don't really set policies. There are billions of MPU get accumulated and, leading to high cost.

@Izeren

Izeren commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Yes, you are right. That was initially discussed. what we are observing at the scale of production, users don't really set policies. There are billions of MPU get accumulated and, leading to high cost.

Could you please elaborate on how storing subparts in the state is linked to the billing problem. Aren't aborted MPUs introducing all the same dangling S3 objects?

My general question was more about why do we store subparts as separate tail files to resume from on S3. Are they as good as the inline Flink state in terms of data corruption risks?

@TempDir java.nio.file.Path tmp;

@Test
void persistThenRecoverPreservesTailBytes() throws Exception {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also tests that data written after last persist is discarded on recovery? Otherwise we are not guaranteeing exactly once

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And probably test that the 2nd recovery attempt can still read the side part?

@Samrat002

Copy link
Copy Markdown
Contributor Author

Yes, you are right. That was initially discussed. what we are observing at the scale of production, users don't really set policies. There are billions of MPU get accumulated and, leading to high cost.

Could you please elaborate on how storing subparts in the state is linked to the billing problem. Aren't aborted MPUs introducing all the same dangling S3 objects?

My general question was more about why do we store subparts as separate tail files to resume from on S3. Are they as good as the inline Flink state in terms of data corruption risks?

My bad, I misunderstood and correlated different things.

Two reasons I went with S3 objects over inlining in state:
  1. Checkpoint cost. Tails can be up to part-size 5 MiB+, often larger. Inlining per writer per checkpoint inflates checkpoint payload through the JM/state backend. At scale that's a real cost vs. a single S3 PUT.
  2. Durability is the same. The state backend is usually S3 too, so a tail object gives us the same 11-9s either way. State doesn't strengthen the guarantee, just shifts where the bytes live.

On lifecycle, tail objects live under deterministic keys we own, and deletion is driven by our commit/abort/recovery path not by a bucket lifecycle policy. So cleanup is as predictable as state GC, without the checkpoint-size hit.

s3recoverable.numBytesInParts(),
incompleteTail,
incompleteTailLength);
} catch (Throwable t) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use Exception here instead of Throwable?
I doubt that we want this block to be executed in case of VM errors for example.

Comment on lines +95 to +96
incompleteTail = downloadIncompleteTail(s3recoverable);
incompleteTailLength = s3recoverable.incompleteObjectLength();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to pass around incompleteTailLength in addition to incompleteTail?

Per my understanding, at this point it's a local file with fixed known length. So just the file/name should be enough, shouldn't it?

Comment on lines +92 to +94
File incompleteTail = null;
long incompleteTailLength = 0L;
if (s3recoverable.incompleteObjectName() != null) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we validate inside the if branch that the length > 0?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor (with null/0L) seems to be unused now.

@TempDir java.nio.file.Path tmp;

@Test
void persistThenRecoverPreservesTailBytes() throws Exception {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And probably test that the 2nd recovery attempt can still read the side part?

@Samrat002

Copy link
Copy Markdown
Contributor Author

@rkhachatryan Added changes and addressed to review comments. PTAL whenever time

@Samrat002 Samrat002 requested review from Izeren and rkhachatryan June 9, 2026 17:01

@rkhachatryan rkhachatryan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rkhachatryan rkhachatryan merged commit 7577493 into apache:master Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants