Design: interactive grid for the operator result pane #5395

tanishqgandhi1908 · 2026-06-06T06:02:53Z

tanishqgandhi1908
Jun 6, 2026

Design conversation for #5394.

Making sort, filter, and row search work on the full dataset

The frontend only ever holds a small slice of the data, whatever pages the user has scrolled through. If sort and filter were evaluated only on the rows currently in browser memory, the user would silently get wrong results on any non-trivial dataset. To make them meaningful, the filter / sort / row-search criteria need to be evaluated on the backend, where the full dataset lives.

Operator results are already stored as Iceberg / Parquet files. Iceberg has two relevant capabilities for this:

It can skip entire data files during a scan by comparing the filter against per-file min/max statistics it stores alongside the data.
It can push remaining row-level predicates into the Parquet reader, so only matching rows are decoded.

The proposal is to surface these capabilities by extending the existing WebSocket pagination protocol with optional filter / sort / row-search fields, and adding methods to the storage abstraction that execute them through Iceberg:

ResultPaginationRequest gains optional filters, sorts, and rowSearch fields. Requests without these fields take the same code path as today.
VirtualDocument gains getRangeWithQuery and countWithQuery methods, defaulted to safe fallbacks so non-Iceberg document types continue to work unchanged.
A new IcebergPredicateBuilder translates the wire-format ColumnFilter objects into Iceberg Expressions, with type-aware value parsing per column type so we don't silently mis-coerce strings into numbers.
IcebergDocument implements both new methods. Operators Iceberg supports natively (eq, ne, lt, le, gt, ge, startsWith, isNull, isNotNull, in) are pushed down. contains and endsWith aren't pushdown-capable, so they're evaluated in memory over the iterator returned by the scan. rowSearch compiles to a multi-column contains and runs as a residual.

Sort is the one exception. Iceberg has no ORDER BY pushdown, so a sort is necessarily executed in JVM memory over the filtered iterator. To prevent that from OOM-ing the backend on large filtered sets, sort is capped at a configurable row threshold (storage.result.sort.max-rows, default 100k). When the matched count exceeds the cap, rows are returned in scan order with a sortSkipped flag in the response, and the frontend shows a banner explaining how to narrow the filter to enable sorting.

Architectural notes

Frontend memory stays bounded — ag-grid virtualization keeps DOM at ~20–30 row nodes regardless of dataset size.
The existing pagination cache in OperatorPaginationResultService is populated on response, so revisiting a page is a zero-WS round-trip.
Wire format stays backward-compatible. columnOffset / columnLimit / columnSearch are kept on ResultPaginationRequest with their defaults; the new frontend simply stops setting them because column virtualization makes the column pager obsolete. New fields are skipped when empty so the no-query path is byte-identical to today's payload.

Reference implementation

The hackathon prototype — #5099 — has all of this working end-to-end. It's there for reference.

Happy to discuss more on this!!

chenlica · 2026-06-09T07:14:26Z

chenlica
Jun 9, 2026
Collaborator

Similar to my comments in #5394, please include diagrams.

0 replies

tanishqgandhi1908 · 2026-06-12T16:24:12Z

tanishqgandhi1908
Jun 12, 2026
Author

Please find diagrams and architecture for the proposed idea -

Default View of Proposed result pane

Default view. The new ag-grid-based result pane showing the full 150-row Iris dataset across 25 pages. Each column header preserves the existing Min / Max / Non-Null statistics inline via a custom header component (same data as today's stats row, restructured for the new grid). The ↕ icon on each header indicates the column is sortable; it brightens to a blue caret when active. Pagination at the bottom uses auto-fit page size — the page size follows the panel's height.

Full-data row search

Row search across all columns. Typing into the "Search rows…" box runs a debounced substring match against every string-coercible column on the backend. Here 1.4 matches 20 rows out of 150 (3 pages of results) and the header stats stay anchored to the full dataset for context. The query is pushed down to Iceberg as a multi-column residual filter — full-dataset, not just rows currently in browser memory.

Per-column filter for a string column

Per-column filter (string column). Each column header has its own filter funnel. Because the underlying schema marks Species as a string, ag-grid picks the text-filter UI: Contains / Starts with / Ends with / Equals, with AND/OR composition. Here Species contains "Iris-versi" narrows 150 rows down to 50. The filter predicate is sent to the backend as part of the extended ResultPaginationRequest, evaluated by Iceberg, and returned as a paginated subset.

Per-column filter for a numeric column

Per-column filter (numeric column). Same affordance, schema-aware behavior. Because PetalWidthCm is a numeric column, the filter UI offers numeric operators — Equals / Does not equal / Greater than / Less than / Between / Blank / Not blank — instead of string ones. This routes through IcebergPredicateBuilder with type-aware parsing so values are never silently mis-coerced (e.g. "30" is read as the integer 30, not the string).

Architecture

flowchart TB
    subgraph Browser["Browser"]
        UI["ag-grid<br/>column filter · sort · row-search UI"]
        SVC["OperatorPaginationResultService<br/>per-page LRU cache"]
        UI -->|"getRows(startRow, endRow,<br/>filterModel, sortModel)"| SVC
    end

    subgraph Backend["Texera Backend (JVM)"]
        ERS["ExecutionResultService<br/>routes by query presence"]
        VD["VirtualDocument<br/>getRangeWithQuery · countWithQuery"]
        ID["IcebergDocument<br/>pushdown + residual + sort"]
        PB["IcebergPredicateBuilder<br/>ColumnFilter → Iceberg Expression"]
        ERS --> VD
        VD --> ID
        ID --> PB
    end

    DISK[("Iceberg / Parquet on disk")]

    SVC -. "WebSocket: ResultPaginationRequest<br/>filters? · sorts? · rowSearch?" .-> ERS
    ID -->|"newScan().filter(expr).select(cols)<br/>file pruning + reader-level predicate pushdown"| DISK
    DISK -. "matching rows only" .-> ID

Read top-to-bottom. ag-grid's IDatasource translates user interactions into a ResultPaginationRequest carrying the optional filters / sorts / rowSearch fields. ExecutionResultService routes between the no-query fast path (existing getRange) and the new getRangeWithQuery. IcebergDocument uses IcebergPredicateBuilder to turn wire-format filters into Iceberg Expressions, lets Iceberg prune entire files via min/max stats and push predicates into the Parquet reader, then evaluates residual ops (contains / endsWith / rowSearch) and sort in JVM memory before slicing the requested page.

0 replies

chenlica · 2026-06-13T07:05:03Z

chenlica
Jun 13, 2026
Collaborator

Thanks. @mengw15 Please chime in to provide comments.

0 replies

mengw15 · 2026-06-13T08:17:37Z

mengw15
Jun 13, 2026
Collaborator

Thanks for putting this together - a few comments / questions:

Sort can't be pushed down (Iceberg has no order-by), and the residual filters (contains / endsWith / row-search) can't be pruned by file stats either. Even the pushdownable ops (=, <, >, in, startsWith) only skip files when the data is clustered by that column — operator results are written in arrival order, so min/max ranges overlap and pruning is usually weak. On a large output this can mean scanning most of the table. Do we have a sense of the actual latency and compute cost there — how long does a user wait for a sort or row-search to come back? And if a user accidentally sorts/filters a huge result, is there a way to cancel an in-flight query (or a timeout) so the panel doesn't hang?
View vs. dataflow semantics. We're a workflow system, so I'm assuming the filter/sort here only changes what's shown in the panel — the data passed to the downstream operator is still the full, unfiltered output. If so, could this mislead users into thinking they've filtered the actual data?
Persistence of the query state. Are the filter/sort (and their results) persisted? If a user sets a filter, switches away, and re-opens the operator, do they get the filtered view back, or does it re-scan and re-filter from scratch?
Overlap with the Filter / Selection operators. We already have Filter and Selection operators. For a dataflow system, the more intuitive way to persist a filter is an operator — its output flows downstream (which also addresses the second point) and stays semantically consistent with the rest of the system; and if the operator-result cache (cc @Xiao-zhen-Liu ) is enabled, the cost should be comparable since the upstream is cached. Curious how you're drawing that line.

0 replies

tanishqgandhi1908 · 2026-06-13T18:51:42Z

tanishqgandhi1908
Jun 13, 2026
Author

Thanks @mengw15,

I agree that full-result backend filtering/search/sort has concerns, especially for large outputs. Iceberg predicate pushdown can help in some cases, but because operator results are written in arrival order, pruning may be weak.

To make the feature safer and easier to review, I would like to split the work into phases.

First PR: interactive result grid UI only
- Replace the current table with ag-grid
- improve column virtualization, resizing, reordering, hide/show columns
- Row inspector docks inline below the grid (replaces today's popup modal).
- Pagination with auto-fit page size
- Sort and per-column filter wired but page-local — they operate on rows currently in ag-grid's cache, with a visible "view only" banner so they don't read as workflow-level filters. (Or hide the UI entirely in PR 1 if the team prefers — open to either.)
- Bottom-docked result panel (replaces the draggable floating popup).
Follow-up design: full-result querying
- add benchmarks on realistic result sizes
- add timeout / cancellation behavior
- add scan/sort limits
- clearly mark filters as “view only”
- possibly explicit Apply behavior instead of querying on every keystroke

To answer on above points-

Cost, latency, and cancellation - You are right that sort, row search, contains, and endsWith can be expensive on large outputs. Sort is especially risky and it has to happen in backend memory.
View vs dataflow semantics - Yes, this is intended to affect only the result-pane view. The downstream operator still receives the full operator output. I can make this explicit in the UI. I will label these controls as view/search controls, not workflow transformations, and add a visible view only banner near active filters so users do not confuse this with the Filter / Selection operators.
Persistence of query state - Not persisted as workflow semantics. At most we keep per-operator query state transiently in the browser session, so switching away and back restores the same view. A page reload starts from the unfiltered result.
Relationship with Filter / Selection operators - I see the line as:
result-pane filters/search/sort: temporary exploration and debugging;
Filter / Selection operators: persistent workflow transformations whose outputs flow downstream.

Does this split and details sound reasonable? If yes, I can narrow the implementation to the interactive result grid first and leave backend full-result filtering/search/sort for a separate design and follow-up PR. Or open to any other suggestions. Thanks

0 replies

chenlica · 2026-06-13T19:44:19Z

chenlica
Jun 13, 2026
Collaborator

@tanishqgandhi1908 If the details are hard to discuss online, we can do an offline discussion and report the results here.

0 replies

mengw15 · 2026-06-13T20:14:53Z

mengw15
Jun 13, 2026
Collaborator

I agree that an offline meeting would be helpful.

My main concern is that I am not fully convinced these parts need to be changed. For example, why do we need to replace the current result table with ag-grid? Also, the floating result panel is an intentional design choice, so I am not sure why we should refactor it into a static/bottom-docked panel. I also believe Texera's frontend currently depends on NG-ZORRO, so I am not sure whether we want to introduce a second table dependency. cc @aglinxinyuan

For the fourth point, my opinion is that Texera is a data-analysis workflow platform. If users want to see the result of selection/filter/sort as part of the analysis, they should add the corresponding operator to the workflow, especially when the cost is comparable. Result-panel filtering/sorting can be useful for temporary inspection, but I would prefer to keep the current design. This is just my current opinion.

0 replies

tanishqgandhi1908 · 2026-06-13T21:03:23Z

tanishqgandhi1908
Jun 13, 2026
Author

Oh, got it. Sure, let's connect offline and discuss.

0 replies

aglinxinyuan · 2026-06-13T21:37:18Z

aglinxinyuan
Jun 13, 2026
Collaborator

I agree that an offline meeting would be helpful.

My main concern is that I am not fully convinced these parts need to be changed. For example, why do we need to replace the current result table with ag-grid? Also, the floating result panel is an intentional design choice, so I am not sure why we should refactor it into a static/bottom-docked panel. I also believe Texera's frontend currently depends on NG-ZORRO, so I am not sure whether we want to introduce a second table dependency. cc @aglinxinyuan

For the fourth point, my opinion is that Texera is a data-analysis workflow platform. If users want to see the result of selection/filter/sort as part of the analysis, they should add the corresponding operator to the workflow, especially when the cost is comparable. Result-panel filtering/sorting can be useful for temporary inspection, but I would prefer to keep the current design. This is just my current opinion.

It's very obvious that the result panel should be floating panel. It's ok to introduce a new table library if ng-zorro cannot provide the features we need.

0 replies

Design: interactive grid for the operator result pane #5395

Uh oh!

tanishqgandhi1908 Jun 6, 2026

Making sort, filter, and row search work on the full dataset

Architectural notes

Reference implementation

Replies: 9 comments

Uh oh!

chenlica Jun 9, 2026 Collaborator

Uh oh!

tanishqgandhi1908 Jun 12, 2026 Author

Default View of Proposed result pane

Full-data row search

Per-column filter for a string column

Per-column filter for a numeric column

Architecture

Uh oh!

chenlica Jun 13, 2026 Collaborator

Uh oh!

Uh oh!

mengw15 Jun 13, 2026 Collaborator

Uh oh!

tanishqgandhi1908 Jun 13, 2026 Author

Uh oh!

chenlica Jun 13, 2026 Collaborator

Uh oh!

mengw15 Jun 13, 2026 Collaborator

Uh oh!

tanishqgandhi1908 Jun 13, 2026 Author

Uh oh!

aglinxinyuan Jun 13, 2026 Collaborator

tanishqgandhi1908
Jun 6, 2026

chenlica
Jun 9, 2026
Collaborator

tanishqgandhi1908
Jun 12, 2026
Author

chenlica
Jun 13, 2026
Collaborator

mengw15
Jun 13, 2026
Collaborator

tanishqgandhi1908
Jun 13, 2026
Author

chenlica
Jun 13, 2026
Collaborator

mengw15
Jun 13, 2026
Collaborator

tanishqgandhi1908
Jun 13, 2026
Author

aglinxinyuan
Jun 13, 2026
Collaborator