[FLINK-39778][s3] Recoverable writer silently loses the in-flight tail on resume#28268
Conversation
Izeren
left a comment
There was a problem hiding this comment.
I haven't reviewed all tests in details yet, but I would like first to understand better the logic about partially uploaded subparts. My initial impression from the FLIP was that we would like to store incomplete parts in-line in state and treat part upload as atomic operation. Did we choose not to do that? I wonder because I am not sure in reliability of incomplete parts. Are they subject to the same lifecycle policies as incomplete MPUs or some different policy?
| s3AccessHelper.getObject(s3recoverable.incompleteObjectName(), target); | ||
| if (downloaded != s3recoverable.incompleteObjectLength()) { | ||
| throw new IOException( | ||
| "Incomplete-tail object " |
There was a problem hiding this comment.
This exception doesn't tell what are the implications. Does it mean that state is corrupted and can't be recovered unless object on S3 is restored? If so, would be useful to explain it. Would help both oncall engineer and to classify such errors correctly (retriable/non-retriable)
| * <p><b>Thread safety:</b> not thread-safe. Use a single thread per instance, matching the | ||
| * single-thread invariant of the production {@link NativeS3RecoverableFsDataOutputStream}. | ||
| */ | ||
| public final class InMemoryNativeS3Operations extends NativeS3ObjectOperations { |
There was a problem hiding this comment.
If this is meant to be used as test harness for FileSystem testing (as replacement of localStack). Arguably it is good to have tests for it too
Yes, you are right. That was initially discussed. what we are observing at the scale of production, users don't really set policies. There are billions of MPU get accumulated and, leading to high cost. |
Could you please elaborate on how storing subparts in the state is linked to the billing problem. Aren't aborted MPUs introducing all the same dangling S3 objects? My general question was more about why do we store subparts as separate tail files to resume from on S3. Are they as good as the inline Flink state in terms of data corruption risks? |
| @TempDir java.nio.file.Path tmp; | ||
|
|
||
| @Test | ||
| void persistThenRecoverPreservesTailBytes() throws Exception { |
There was a problem hiding this comment.
Should we also tests that data written after last persist is discarded on recovery? Otherwise we are not guaranteeing exactly once
There was a problem hiding this comment.
And probably test that the 2nd recovery attempt can still read the side part?
My bad, I misunderstood and correlated different things. Two reasons I went with S3 objects over inlining in state: On lifecycle, tail objects live under deterministic keys we own, and deletion is driven by our commit/abort/recovery path not by a bucket lifecycle policy. So cleanup is as predictable as state GC, without the checkpoint-size hit. |
| s3recoverable.numBytesInParts(), | ||
| incompleteTail, | ||
| incompleteTailLength); | ||
| } catch (Throwable t) { |
There was a problem hiding this comment.
Should we use Exception here instead of Throwable?
I doubt that we want this block to be executed in case of VM errors for example.
| incompleteTail = downloadIncompleteTail(s3recoverable); | ||
| incompleteTailLength = s3recoverable.incompleteObjectLength(); |
There was a problem hiding this comment.
Why do we need to pass around incompleteTailLength in addition to incompleteTail?
Per my understanding, at this point it's a local file with fixed known length. So just the file/name should be enough, shouldn't it?
| File incompleteTail = null; | ||
| long incompleteTailLength = 0L; | ||
| if (s3recoverable.incompleteObjectName() != null) { |
There was a problem hiding this comment.
Should we validate inside the if branch that the length > 0?
There was a problem hiding this comment.
This constructor (with null/0L) seems to be unused now.
| @TempDir java.nio.file.Path tmp; | ||
|
|
||
| @Test | ||
| void persistThenRecoverPreservesTailBytes() throws Exception { |
There was a problem hiding this comment.
And probably test that the 2nd recovery attempt can still read the side part?
|
@rkhachatryan Added changes and addressed to review comments. PTAL whenever time |
What is the purpose of the change
NativeS3RecoverableWriter.recover()silently discarded the sub-part-size tail thatpersist()had durably uploaded to S3 as a side object. After a crash-and-restore cycle, any bytes written since the last full-part boundary were permanently lost, violating Flink's exactly-once guarantee.This patch fixes the data loss by downloading the side object during
recover()and seeding the resumed output stream with those bytes before accepting further writes.Brief change log
recover()inNativeS3RecoverableWriternow downloads the incomplete-tail side object and seeds the resumed stream with those bytes before accepting new writes. AdownloadIncompleteTail()helper validates the length and cleans up the local file on failure.NativeS3RecoverableFsDataOutputStreamgains a resume constructor that opens the seed file in append mode so position accounting is correct from the start.Verifying this change
UT to showcase the bug and fix working
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (yes / no) noDocumentation
Was generative AI tooling used to co-author this PR?