fix: scope large binary storage and cleanup by execution id by kunwp1 · Pull Request #5280 · apache/texera

kunwp1 · 2026-05-28T19:54:51Z

What changes were proposed in this PR?

Large binaries were stored in the shared texera-large-binaries bucket under flat keys objects/{timestamp}/{uuid} with no execution id, and clearExecutionResources(eid) deleted all of them via LargeBinaryManager.deleteAllObjects(). Any cleanup for one execution therefore erased every other execution's (and user's) large binaries.

This PR namespaces every large binary by its execution id and scopes deletion:

Object keys are now objects/{eid}/{uuid} on both the JVM and Python workers.
The execution id is carried to workers via a new InitializeExecutorRequest.executionId proto field, injected by the system at executor init. The user-facing largebinary() / new LargeBinary() APIs are unchanged.
Cleanup uses the new LargeBinaryManager.deleteByExecution(eid) (prefix delete of objects/{eid}/). Both JVM and Python engines share the bucket and key shape, so this single JVM-side delete removes binaries created by both.
The deleteAllObjects() is removed.

Pre-existing objects under the old objects/{timestamp}/... scheme are left untouched.

Any related issues, documentation, discussions?

Closes #4123.

How was this PR tested?

Requires running ./bin/python-proto-gen.sh

Import the following json file to create two workflows (You can configure the source operator to use any kinds of files you have), run them, and check if each execution creates 6 objects and one execution doesn't remove the other execution's large binary objects.
Large.Binary.Python (1).json

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic), models Claude Opus 4.7 and Claude Sonnet 4.6

)

Also update existing call site in RegionExecutionCoordinator to pass None for the new field (required because ScalaPB no_default_values_in_constructor is true).

)

…e#4123)

…che#4123)

…he#4123) betterproto returns an empty (falsy) ExecutionIdentity for an unset executionId field rather than None, so the previous `is not None` check never triggered and an unset id would silently produce objects/0/... Use truthiness so unset -> None -> create() raises, matching the JVM invariant. Also moves a stray mid-file `import re` to the top.

codecov-commenter · 2026-05-28T19:57:46Z

Codecov Report

❌ Patch coverage is 89.70588% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.30%. Comparing base (251a845) to head (a12bc7c).

Files with missing lines	Patch %	Lines
...pache/texera/service/util/LargeBinaryManager.scala	60.00%	4 Missing and 2 partials ⚠️
...rg/apache/texera/web/service/WorkflowService.scala	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #5280      +/-   ##
============================================
- Coverage     51.14%   50.30%   -0.85%     
+ Complexity     2416     2415       -1     
============================================
  Files          1054     1052       -2     
  Lines         40918    40799     -119     
  Branches       4381     4355      -26     
============================================
- Hits          20929    20522     -407     
- Misses        18765    19052     +287     
- Partials       1224     1225       +1

Flag	Coverage Δ		*Carryforward flag
access-control-service	`41.89% <ø> (ø)`
agent-service	`33.76% <ø> (ø)`		Carriedforward from d26bff6
amber	`52.01% <61.11%> (+0.03%)`	⬆️
computing-unit-managing-service	`1.38% <ø> (ø)`
config-service	`54.68% <ø> (ø)`
file-service	`38.42% <ø> (ø)`
frontend	`43.20% <ø> (-2.21%)`	⬇️	Carriedforward from d26bff6
pyamber	`90.83% <100.00%> (+0.03%)`	⬆️
python	`90.72% <ø> (-0.08%)`	⬇️	Carriedforward from d26bff6
workflow-compiling-service	`58.39% <ø> (ø)`

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…apache#4123) Move the per-execution id out of StorageConfig (which holds only static system configuration sourced from storage.conf) into a dedicated module-level holder in large_binary_manager (set_current_execution_id), mirroring the JVM LargeBinaryManager. The Python init handler sets it via that API.

Add get_current_execution_id() and route create() and the tests through it instead of reading the module-level _current_execution_id directly, keeping the holder's access encapsulated.

kunwp1 · 2026-05-28T22:43:48Z

/request-review @Xiao-zhen-Liu

Can you review this PR because you are an engine expert?

chenlica · 2026-06-01T06:51:29Z

@Xiao-zhen-Liu Please review the PR as requested.

apache#4123) Address review feedback: replace the module-level globals (_s3_client, DEFAULT_BUCKET, _current_execution_id) and free functions with a LargeBinaryManager class holding state as instance attributes, exposed as a single shared per-worker singleton. No more `global` statements; mirrors the JVM `object LargeBinaryManager`. Consumers import the singleton, so call sites are unchanged. Update the stream/type tests to patch the singleton instance.

…pache#4123) The pure create() logic (execution-scoped key + fail-fast when no context is set) was only exercised by the MinIO-backed LargeBinaryManagerSpec. Move those two assertions into LargeBinaryManagerUnitSpec so they run without Docker and count toward coverage; the MinIO spec keeps the isolation test that genuinely needs a live S3 endpoint. deleteByExecution's success and swallow branches were already covered by the unit spec.

Yicong-Huang · 2026-06-02T00:03:57Z

+# Shared singleton for the worker process. Consumers import this instance:
+#   from pytexera.storage.large_binary_manager import large_binary_manager
+large_binary_manager = LargeBinaryManager()


this is still a global variable on the module level. please also avoid having this.

please create a singleton class if that's the need

ok I read more on this,

https://www.thepythoncodingstack.com/p/creating-a-singleton-class-in-python

we can use two ways to create singleton:

a full singleton public class with __new__ to guard creation of such a class. Callsites can import this class and it will be used as a singleton.

use a module level variable to create a single instance of a class, and let callsite to import that instance only. (The way your currently implementation is doing) This is OK for internal APIs and sometime preferable in python's context. But we need to make sure callsite do not import the class. In another word, we need to make class LargetBinaryManager private: class _LargeBinaryManager and only expose the instance.

I myself would prefer the first one. but as this is used internally, either way is fine. you can decide which way to go.

I also lean toward using __new__ because it's much simpler. Made the changes. Can you check if it's correct?

Thanks. it looks correct to me. it is always better to add a simple test case to guard it.

I added a guard as well

…pache#4123) Address review feedback: remove the module-level singleton instance (a module global) and instead guard single-instance creation in the class via __new__, so LargeBinaryManager() always returns the same shared instance. Callers import and use the class directly; no module-level instance is exposed.

Per review, add a simple test asserting LargeBinaryManager() always returns the same shared instance and that state set through one handle is visible through another.

Xiao-zhen-Liu · 2026-06-02T22:11:44Z

@kunwp1 Can you update the PR description to include the source data of the test workflow?

kunwp1 · 2026-06-02T23:16:06Z

@kunwp1 Can you update the PR description to include the source data of the test workflow?

I didn't include the source data because the size is huge (more than 2GB) to add it in the description. You can configure the source operator with any of the files you have. I just updated the description to include this information.

kunwp1 added 9 commits May 28, 2026 10:56

feat: eid-scoped large binary create + per-execution delete (apache#4123

dd8883b

)

feat: add executionId to InitializeExecutorRequest (apache#4123)

d0d6fd9

Also update existing call site in RegionExecutionCoordinator to pass None for the new field (required because ScalaPB no_default_values_in_constructor is true).

feat: send executionId to workers on executor init (apache#4123)

add9b95

feat: set large-binary execution context on JVM worker init (apache#4123

decab59

)

fix: scope execution cleanup to the execution's large binaries (apach…

d236504

…e#4123)

fix: scope deleteWorkflow large-binary cleanup to its executions (apa…

51e49d1

…che#4123)

refactor: remove unused bucket-wide deleteAllObjects (apache#4123)

5114610

feat: eid-scoped large binary create on Python worker (apache#4123)

9845360

github-actions Bot assigned kunwp1 May 28, 2026

github-actions Bot added engine fix pyamber common labels May 28, 2026

kunwp1 force-pushed the fix/large-binary-eid-lifecycle branch from 6a0709e to 94c2804 Compare May 28, 2026 20:09

kunwp1 mentioned this pull request May 28, 2026

S3StorageClient.deleteDirectory only deletes the first ≤1000 objects per prefix #5281

Open

kunwp1 added 3 commits May 28, 2026 13:29

refactor: read large-binary execution id through a getter (apache#4123)

f051176

Add get_current_execution_id() and route create() and the tests through it instead of reading the module-level _current_execution_id directly, keeping the holder's access encapsulated.

Format and refactoring

d885c2b

github-actions Bot requested a review from Xiao-zhen-Liu May 28, 2026 22:43

kunwp1 and others added 2 commits May 28, 2026 15:45

Merge branch 'main' into fix/large-binary-eid-lifecycle

8e1ebfb

Polish comments

3330bf5

Yicong-Huang reviewed Jun 1, 2026

View reviewed changes

Comment thread amber/src/main/python/pytexera/storage/large_binary_manager.py Outdated

Yicong-Huang reviewed Jun 1, 2026

View reviewed changes

Comment thread amber/src/main/python/pytexera/storage/large_binary_manager.py Outdated

kunwp1 added 3 commits June 1, 2026 12:08

Format

9e78542

Yicong-Huang reviewed Jun 2, 2026

View reviewed changes

kunwp1 and others added 4 commits June 1, 2026 17:19

test: guard the LargeBinaryManager singleton invariant (apache#4123)

aff0ab4

Per review, add a simple test asserting LargeBinaryManager() always returns the same shared instance and that state set through one handle is visible through another.

Merge branch 'main' into fix/large-binary-eid-lifecycle

d79e5c9

Merge branch 'main' into fix/large-binary-eid-lifecycle

a12bc7c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: scope large binary storage and cleanup by execution id#5280

fix: scope large binary storage and cleanup by execution id#5280
kunwp1 wants to merge 21 commits into
apache:mainfrom
kunwp1:fix/large-binary-eid-lifecycle

kunwp1 commented May 28, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 28, 2026 •

edited

Loading

Uh oh!

kunwp1 commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

chenlica commented Jun 1, 2026

Uh oh!

Yicong-Huang Jun 2, 2026 •

edited

Loading

Uh oh!

Yicong-Huang Jun 2, 2026 •

edited

Loading

Uh oh!

kunwp1 Jun 2, 2026

Uh oh!

Yicong-Huang Jun 2, 2026

Uh oh!

kunwp1 Jun 2, 2026

Uh oh!

Xiao-zhen-Liu commented Jun 2, 2026

Uh oh!

kunwp1 commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

kunwp1 commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this PR?

Any related issues, documentation, discussions?

How was this PR tested?

Was this PR authored or co-authored using generative AI tooling?

Uh oh!

codecov-commenter commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kunwp1 commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chenlica commented Jun 1, 2026

Uh oh!

Yicong-Huang Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kunwp1 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

kunwp1 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Xiao-zhen-Liu commented Jun 2, 2026

Uh oh!

kunwp1 commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kunwp1 commented May 28, 2026 •

edited

Loading

codecov-commenter commented May 28, 2026 •

edited

Loading

kunwp1 commented May 28, 2026 •

edited

Loading

Yicong-Huang Jun 2, 2026 •

edited

Loading

Yicong-Huang Jun 2, 2026 •

edited

Loading

kunwp1 commented Jun 2, 2026 •

edited

Loading