Skip to content

fix: scope large binary storage and cleanup by execution id#5280

Open
kunwp1 wants to merge 21 commits into
apache:mainfrom
kunwp1:fix/large-binary-eid-lifecycle
Open

fix: scope large binary storage and cleanup by execution id#5280
kunwp1 wants to merge 21 commits into
apache:mainfrom
kunwp1:fix/large-binary-eid-lifecycle

Conversation

@kunwp1
Copy link
Copy Markdown
Contributor

@kunwp1 kunwp1 commented May 28, 2026

What changes were proposed in this PR?

Large binaries were stored in the shared texera-large-binaries bucket under flat keys objects/{timestamp}/{uuid} with no execution id, and clearExecutionResources(eid) deleted all of them via LargeBinaryManager.deleteAllObjects(). Any cleanup for one execution therefore erased every other execution's (and user's) large binaries.

This PR namespaces every large binary by its execution id and scopes deletion:

  • Object keys are now objects/{eid}/{uuid} on both the JVM and Python workers.
  • The execution id is carried to workers via a new InitializeExecutorRequest.executionId proto field, injected by the system at executor init. The user-facing largebinary() / new LargeBinary() APIs are unchanged.
  • Cleanup uses the new LargeBinaryManager.deleteByExecution(eid) (prefix delete of objects/{eid}/). Both JVM and Python engines share the bucket and key shape, so this single JVM-side delete removes binaries created by both.
  • The deleteAllObjects() is removed.

Pre-existing objects under the old objects/{timestamp}/... scheme are left untouched.

Any related issues, documentation, discussions?

Closes #4123.

How was this PR tested?

Requires running ./bin/python-proto-gen.sh

Import the following json file to create two workflows (You can configure the source operator to use any kinds of files you have), run them, and check if each execution creates 6 objects and one execution doesn't remove the other execution's large binary objects.
Large.Binary.Python (1).json

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic), models Claude Opus 4.7 and Claude Sonnet 4.6

kunwp1 added 9 commits May 28, 2026 10:56
Also update existing call site in RegionExecutionCoordinator to pass
None for the new field (required because ScalaPB no_default_values_in_constructor is true).
…he#4123)

betterproto returns an empty (falsy) ExecutionIdentity for an unset
executionId field rather than None, so the previous `is not None` check
never triggered and an unset id would silently produce objects/0/...
Use truthiness so unset -> None -> create() raises, matching the JVM
invariant. Also moves a stray mid-file `import re` to the top.
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 28, 2026

Codecov Report

❌ Patch coverage is 89.70588% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.30%. Comparing base (251a845) to head (a12bc7c).

Files with missing lines Patch % Lines
...pache/texera/service/util/LargeBinaryManager.scala 60.00% 4 Missing and 2 partials ⚠️
...rg/apache/texera/web/service/WorkflowService.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5280      +/-   ##
============================================
- Coverage     51.14%   50.30%   -0.85%     
+ Complexity     2416     2415       -1     
============================================
  Files          1054     1052       -2     
  Lines         40918    40799     -119     
  Branches       4381     4355      -26     
============================================
- Hits          20929    20522     -407     
- Misses        18765    19052     +287     
- Partials       1224     1225       +1     
Flag Coverage Δ *Carryforward flag
access-control-service 41.89% <ø> (ø)
agent-service 33.76% <ø> (ø) Carriedforward from d26bff6
amber 52.01% <61.11%> (+0.03%) ⬆️
computing-unit-managing-service 1.38% <ø> (ø)
config-service 54.68% <ø> (ø)
file-service 38.42% <ø> (ø)
frontend 43.20% <ø> (-2.21%) ⬇️ Carriedforward from d26bff6
pyamber 90.83% <100.00%> (+0.03%) ⬆️
python 90.72% <ø> (-0.08%) ⬇️ Carriedforward from d26bff6
workflow-compiling-service 58.39% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

kunwp1 added 3 commits May 28, 2026 13:29
…apache#4123)

Move the per-execution id out of StorageConfig (which holds only static
system configuration sourced from storage.conf) into a dedicated module-level
holder in large_binary_manager (set_current_execution_id), mirroring the JVM
LargeBinaryManager. The Python init handler sets it via that API.
Add get_current_execution_id() and route create() and the tests through it
instead of reading the module-level _current_execution_id directly, keeping
the holder's access encapsulated.
@kunwp1
Copy link
Copy Markdown
Contributor Author

kunwp1 commented May 28, 2026

/request-review @Xiao-zhen-Liu

Can you review this PR because you are an engine expert?

@github-actions github-actions Bot requested a review from Xiao-zhen-Liu May 28, 2026 22:43
Comment thread amber/src/main/python/pytexera/storage/large_binary_manager.py Outdated
Comment thread amber/src/main/python/pytexera/storage/large_binary_manager.py Outdated
@chenlica
Copy link
Copy Markdown
Contributor

chenlica commented Jun 1, 2026

@Xiao-zhen-Liu Please review the PR as requested.

kunwp1 added 3 commits June 1, 2026 12:08
apache#4123)

Address review feedback: replace the module-level globals (_s3_client,
DEFAULT_BUCKET, _current_execution_id) and free functions with a
LargeBinaryManager class holding state as instance attributes, exposed as a
single shared per-worker singleton. No more `global` statements; mirrors the
JVM `object LargeBinaryManager`. Consumers import the singleton, so call sites
are unchanged. Update the stream/type tests to patch the singleton instance.
…pache#4123)

The pure create() logic (execution-scoped key + fail-fast when no context is
set) was only exercised by the MinIO-backed LargeBinaryManagerSpec. Move those
two assertions into LargeBinaryManagerUnitSpec so they run without Docker and
count toward coverage; the MinIO spec keeps the isolation test that genuinely
needs a live S3 endpoint. deleteByExecution's success and swallow branches were
already covered by the unit spec.
Comment on lines +111 to +113
# Shared singleton for the worker process. Consumers import this instance:
# from pytexera.storage.large_binary_manager import large_binary_manager
large_binary_manager = LargeBinaryManager()
Copy link
Copy Markdown
Contributor

@Yicong-Huang Yicong-Huang Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still a global variable on the module level. please also avoid having this.

please create a singleton class if that's the need

Copy link
Copy Markdown
Contributor

@Yicong-Huang Yicong-Huang Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I read more on this,

https://www.thepythoncodingstack.com/p/creating-a-singleton-class-in-python

we can use two ways to create singleton:

  1. a full singleton public class with __new__ to guard creation of such a class. Callsites can import this class and it will be used as a singleton.
  2. use a module level variable to create a single instance of a class, and let callsite to import that instance only. (The way your currently implementation is doing) This is OK for internal APIs and sometime preferable in python's context. But we need to make sure callsite do not import the class. In another word, we need to make class LargetBinaryManager private: class _LargeBinaryManager and only expose the instance.

I myself would prefer the first one. but as this is used internally, either way is fine. you can decide which way to go.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also lean toward using __new__ because it's much simpler. Made the changes. Can you check if it's correct?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. it looks correct to me. it is always better to add a simple test case to guard it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a guard as well

kunwp1 and others added 4 commits June 1, 2026 17:19
…pache#4123)

Address review feedback: remove the module-level singleton instance (a module
global) and instead guard single-instance creation in the class via __new__, so
LargeBinaryManager() always returns the same shared instance. Callers import and
use the class directly; no module-level instance is exposed.
Per review, add a simple test asserting LargeBinaryManager() always returns the
same shared instance and that state set through one handle is visible through
another.
@Xiao-zhen-Liu
Copy link
Copy Markdown
Contributor

@kunwp1 Can you update the PR description to include the source data of the test workflow?

@kunwp1
Copy link
Copy Markdown
Contributor Author

kunwp1 commented Jun 2, 2026

@kunwp1 Can you update the PR description to include the source data of the test workflow?

I didn't include the source data because the size is huge (more than 2GB) to add it in the description. You can configure the source operator with any of the files you have. I just updated the description to include this information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Finish the Life Cycle of Large Binaries

5 participants