Struct datafusion::physical_plan::TopK
source · pub struct TopK { /* private fields */ }
Expand description
Global TopK
§Background
“Top K” is a common query optimization used for queries such as “find the top 3 customers by revenue”. The (simplified) SQL for such a query might be:
SELECT customer_id, revenue FROM 'sales.csv' ORDER BY revenue DESC limit 3;
The simple plan would be:
> explain SELECT customer_id, revenue FROM sales ORDER BY revenue DESC limit 3;
+--------------+----------------------------------------+
| plan_type | plan |
+--------------+----------------------------------------+
| logical_plan | Limit: 3 |
| | Sort: revenue DESC NULLS FIRST |
| | Projection: customer_id, revenue |
| | TableScan: sales |
+--------------+----------------------------------------+
While this plan produces the correct answer, it will fully sorts the input before discarding everything other than the top 3 elements.
The same answer can be produced by simply keeping track of the top K=3 elements, reducing the total amount of required buffer memory.
§Structure
This operator tracks the top K items using a TopKHeap
.
Implementations§
source§impl TopK
impl TopK
sourcepub fn try_new(
partition_id: usize,
schema: Arc<Schema>,
expr: Vec<PhysicalSortExpr>,
k: usize,
batch_size: usize,
runtime: Arc<RuntimeEnv>,
metrics: &ExecutionPlanMetricsSet,
partition: usize
) -> Result<TopK, DataFusionError>
pub fn try_new( partition_id: usize, schema: Arc<Schema>, expr: Vec<PhysicalSortExpr>, k: usize, batch_size: usize, runtime: Arc<RuntimeEnv>, metrics: &ExecutionPlanMetricsSet, partition: usize ) -> Result<TopK, DataFusionError>
Create a new TopK
that stores the top k
values, as
defined by the sort expressions in expr
.
sourcepub fn insert_batch(
&mut self,
batch: RecordBatch
) -> Result<(), DataFusionError>
pub fn insert_batch( &mut self, batch: RecordBatch ) -> Result<(), DataFusionError>
Insert batch
, remembering if any of its values are among
the top k seen so far.
sourcepub fn emit(
self
) -> Result<Pin<Box<dyn RecordBatchStream<Item = Result<RecordBatch, DataFusionError>> + Send>>, DataFusionError>
pub fn emit( self ) -> Result<Pin<Box<dyn RecordBatchStream<Item = Result<RecordBatch, DataFusionError>> + Send>>, DataFusionError>
Returns the top k results broken into batch_size
RecordBatch
es, consuming the heap