fix: scope large binary storage and cleanup by execution id#5280
fix: scope large binary storage and cleanup by execution id#5280kunwp1 wants to merge 21 commits into
Conversation
Also update existing call site in RegionExecutionCoordinator to pass None for the new field (required because ScalaPB no_default_values_in_constructor is true).
…he#4123) betterproto returns an empty (falsy) ExecutionIdentity for an unset executionId field rather than None, so the previous `is not None` check never triggered and an unset id would silently produce objects/0/... Use truthiness so unset -> None -> create() raises, matching the JVM invariant. Also moves a stray mid-file `import re` to the top.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5280 +/- ##
============================================
- Coverage 51.14% 50.30% -0.85%
+ Complexity 2416 2415 -1
============================================
Files 1054 1052 -2
Lines 40918 40799 -119
Branches 4381 4355 -26
============================================
- Hits 20929 20522 -407
- Misses 18765 19052 +287
- Partials 1224 1225 +1
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6a0709e to
94c2804
Compare
…apache#4123) Move the per-execution id out of StorageConfig (which holds only static system configuration sourced from storage.conf) into a dedicated module-level holder in large_binary_manager (set_current_execution_id), mirroring the JVM LargeBinaryManager. The Python init handler sets it via that API.
Add get_current_execution_id() and route create() and the tests through it instead of reading the module-level _current_execution_id directly, keeping the holder's access encapsulated.
|
/request-review @Xiao-zhen-Liu Can you review this PR because you are an engine expert? |
|
@Xiao-zhen-Liu Please review the PR as requested. |
apache#4123) Address review feedback: replace the module-level globals (_s3_client, DEFAULT_BUCKET, _current_execution_id) and free functions with a LargeBinaryManager class holding state as instance attributes, exposed as a single shared per-worker singleton. No more `global` statements; mirrors the JVM `object LargeBinaryManager`. Consumers import the singleton, so call sites are unchanged. Update the stream/type tests to patch the singleton instance.
…pache#4123) The pure create() logic (execution-scoped key + fail-fast when no context is set) was only exercised by the MinIO-backed LargeBinaryManagerSpec. Move those two assertions into LargeBinaryManagerUnitSpec so they run without Docker and count toward coverage; the MinIO spec keeps the isolation test that genuinely needs a live S3 endpoint. deleteByExecution's success and swallow branches were already covered by the unit spec.
| # Shared singleton for the worker process. Consumers import this instance: | ||
| # from pytexera.storage.large_binary_manager import large_binary_manager | ||
| large_binary_manager = LargeBinaryManager() |
There was a problem hiding this comment.
this is still a global variable on the module level. please also avoid having this.
please create a singleton class if that's the need
There was a problem hiding this comment.
ok I read more on this,
https://www.thepythoncodingstack.com/p/creating-a-singleton-class-in-python
we can use two ways to create singleton:
- a full singleton public class with
__new__to guard creation of such a class. Callsites can import this class and it will be used as a singleton. - use a module level variable to create a single instance of a class, and let callsite to import that instance only. (The way your currently implementation is doing) This is OK for internal APIs and sometime preferable in python's context. But we need to make sure callsite do not import the class. In another word, we need to make
class LargetBinaryManagerprivate:class _LargeBinaryManagerand only expose the instance.
I myself would prefer the first one. but as this is used internally, either way is fine. you can decide which way to go.
There was a problem hiding this comment.
I also lean toward using __new__ because it's much simpler. Made the changes. Can you check if it's correct?
There was a problem hiding this comment.
Thanks. it looks correct to me. it is always better to add a simple test case to guard it.
There was a problem hiding this comment.
I added a guard as well
…pache#4123) Address review feedback: remove the module-level singleton instance (a module global) and instead guard single-instance creation in the class via __new__, so LargeBinaryManager() always returns the same shared instance. Callers import and use the class directly; no module-level instance is exposed.
Per review, add a simple test asserting LargeBinaryManager() always returns the same shared instance and that state set through one handle is visible through another.
|
@kunwp1 Can you update the PR description to include the source data of the test workflow? |
I didn't include the source data because the size is huge (more than 2GB) to add it in the description. You can configure the source operator with any of the files you have. I just updated the description to include this information. |
What changes were proposed in this PR?
Large binaries were stored in the shared
texera-large-binariesbucket under flat keysobjects/{timestamp}/{uuid}with no execution id, andclearExecutionResources(eid)deleted all of them viaLargeBinaryManager.deleteAllObjects(). Any cleanup for one execution therefore erased every other execution's (and user's) large binaries.This PR namespaces every large binary by its execution id and scopes deletion:
objects/{eid}/{uuid}on both the JVM and Python workers.InitializeExecutorRequest.executionIdproto field, injected by the system at executor init. The user-facinglargebinary()/new LargeBinary()APIs are unchanged.LargeBinaryManager.deleteByExecution(eid)(prefix delete ofobjects/{eid}/). Both JVM and Python engines share the bucket and key shape, so this single JVM-side delete removes binaries created by both.deleteAllObjects()is removed.Pre-existing objects under the old
objects/{timestamp}/...scheme are left untouched.Any related issues, documentation, discussions?
Closes #4123.
How was this PR tested?
Requires running
./bin/python-proto-gen.shImport the following json file to create two workflows (You can configure the source operator to use any kinds of files you have), run them, and check if each execution creates 6 objects and one execution doesn't remove the other execution's large binary objects.
Large.Binary.Python (1).json
Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Anthropic), models Claude Opus 4.7 and Claude Sonnet 4.6