pub struct TopK { /* private fields */ }
Expand description
Global TopK
§Background
“Top K” is a common query optimization used for queries such as “find the top 3 customers by revenue”. The (simplified) SQL for such a query might be:
SELECT customer_id, revenue FROM 'sales.csv' ORDER BY revenue DESC limit 3;
The simple plan would be:
> explain SELECT customer_id, revenue FROM sales ORDER BY revenue DESC limit 3;
+--------------+----------------------------------------+
| plan_type | plan |
+--------------+----------------------------------------+
| logical_plan | Limit: 3 |
| | Sort: revenue DESC NULLS FIRST |
| | Projection: customer_id, revenue |
| | TableScan: sales |
+--------------+----------------------------------------+
While this plan produces the correct answer, it will fully sorts the input before discarding everything other than the top 3 elements.
The same answer can be produced by simply keeping track of the top K=3 elements, reducing the total amount of required buffer memory.
§Partial Sort Optimization
This implementation additionally optimizes queries where the input is already partially sorted by a common prefix of the requested ordering. Once the top K heap is full, if subsequent rows are guaranteed to be strictly greater (in sort order) on this prefix than the largest row currently stored, the operator safely terminates early.
§Example
For input sorted by (day DESC)
, but not by timestamp
, a query such as:
SELECT day, timestamp FROM sensor ORDER BY day DESC, timestamp DESC LIMIT 10;
can terminate scanning early once sufficient rows from the latest days have been collected, skipping older data.
§Structure
This operator tracks the top K items using a TopKHeap
.
Implementations§
Source§impl TopK
impl TopK
Sourcepub fn try_new(
partition_id: usize,
schema: Arc<Schema>,
common_sort_prefix: LexOrdering,
expr: LexOrdering,
k: usize,
batch_size: usize,
runtime: Arc<RuntimeEnv>,
metrics: &ExecutionPlanMetricsSet,
) -> Result<TopK, DataFusionError>
pub fn try_new( partition_id: usize, schema: Arc<Schema>, common_sort_prefix: LexOrdering, expr: LexOrdering, k: usize, batch_size: usize, runtime: Arc<RuntimeEnv>, metrics: &ExecutionPlanMetricsSet, ) -> Result<TopK, DataFusionError>
Create a new TopK
that stores the top k
values, as
defined by the sort expressions in expr
.
Sourcepub fn insert_batch(
&mut self,
batch: RecordBatch,
) -> Result<(), DataFusionError>
pub fn insert_batch( &mut self, batch: RecordBatch, ) -> Result<(), DataFusionError>
Insert batch
, remembering if any of its values are among
the top k seen so far.
Sourcepub fn emit(
self,
) -> Result<Pin<Box<dyn RecordBatchStream<Item = Result<RecordBatch, DataFusionError>> + Send>>, DataFusionError>
pub fn emit( self, ) -> Result<Pin<Box<dyn RecordBatchStream<Item = Result<RecordBatch, DataFusionError>> + Send>>, DataFusionError>
Returns the top k results broken into batch_size
RecordBatch
es, consuming the heap
Auto Trait Implementations§
impl Freeze for TopK
impl !RefUnwindSafe for TopK
impl Send for TopK
impl Sync for TopK
impl Unpin for TopK
impl !UnwindSafe for TopK
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left
is true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left(&self)
returns true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read more