Blog | Velox

From flaky Axiom CI to a Velox bug fix: a cross-repo debugging story

March 29, 2026 · 9 min read

Software Engineer @ Meta

TL;DR

When adding macOS CI to Axiom, set operation tests kept failing intermittently — but only in macOS debug CI. Linux CI (debug and release) passed consistently. Local runs always passed. The root cause turned out to be a bug in Velox — a dependency managed as a Git submodule. This post describes the process of debugging a CI-only failure when the bug lives in a different repository.

The problem

Axiom is a set of open-source composable libraries for building SQL and dataframe processing front ends on top of Velox. It integrates Velox as a Git submodule.

After adding macOS CI builds (#1168), we couldn't get the test suite to pass reliably. Various set operation tests (SqlTest.set_*) failed intermittently in macOS debug CI — different tests on different runs, all involving EXCEPT ALL or INTERSECT ALL. Linux debug and release CI passed consistently. Running the same tests locally — same build type, same data, same configuration — always passed.

Why CI-only failures are hard to debug

The obvious approach — reproduce locally, attach a debugger, step through the code — doesn't work. You are limited to CI runs, and each run takes about 10 minutes end-to-end: push the change, wait for CI to pick it up, build (~8 minutes even with ccache), run tests, check results. That's 3-4 iterations per hour at best. Every CI run has to be designed to extract maximum signal.

Step 1: Make CI runs count

The first change was to restructure CI to focus on the problem:

Disable unrelated builds. Linux debug, Linux release, and macOS release were all passing. Disabling them reduced the CI cycle time.
Run only SqlTest.set_* tests 5 times. The tests were flaky — sometimes passing, sometimes failing. Running just the set operation tests (not the full suite) 5 times increased the chances of triggering the failure together with the debug logging while keeping the run fast. Without this, a CI run could pass by luck, wasting 10 minutes with no useful signal.
Run the full test suite after. The 5x set-test loop ran first. If it passed, the full suite ran to check for regressions.

Step 2: Iterate on Velox changes using Axiom CI

The failing tests all involved EXCEPT ALL and INTERSECT ALL queries. Axiom's SQL parser and optimizer translate these into Velox execution plans using counting joins — a Velox-specific join type that, unlike regular semi/anti joins that check whether a key exists, tracks how many times each key appears on the build side. Since the implementation lives in Velox, the bug had to be there too.

But the tests were in Axiom. Axiom uses Velox as a Git submodule pointing to a specific commit on facebookincubator/velox. To add debug logging to Velox and test it in Axiom CI, we needed Axiom to build from a modified Velox.

The key insight is that you don't have to land changes to Velox first. You can point Axiom's submodule at a fork branch and let CI validate end-to-end.

The workflow:

Fork Velox and push changes to a branch. Create a branch on your fork (e.g., mbasmanova/velox) with the debug logging or fix.

Point Axiom's submodule at the fork. Two changes are needed:

Update .gitmodules to use the fork URL:

[submodule "velox"]
    path = velox
    url = https://github.com/mbasmanova/velox.git

Update the submodule to the desired commit:

git -C velox checkout <commit>
git add velox .gitmodules

Push and let CI build. Axiom CI checks out submodules recursively, so it picks up the modified Velox automatically.

One subtlety: the Velox PR and the Axiom submodule both pushed to branches on the same fork (mbasmanova/velox). Initially we used the same branch name for both, which caused force-pushes from one repo to overwrite the other. The fix was to use separate branches — fix-counting-join-merge for the Velox PR and fix-counting-join-merge-axiom for the Axiom submodule.

Step 3: Form a hypothesis, then add targeted logging

With limited CI iterations, random logging is wasteful. Each round of logging must be designed to prove or disprove one or more hypotheses.

The hypothesis was:

When multiple build drivers process overlapping keys, each driver builds its own hash table with per-key counts. When these tables are merged, duplicate keys are dropped without summing their counts.

To test this, we added logging at three points in the counting join lifecycle, logging full input/output data per driver including the driver ID:

Build input — log each batch of rows received by each build driver.
Probe input — log each batch of probe rows.
Probe output — log the rows emitted by each probe driver.

By comparing build inputs across drivers (to see which keys each driver processed) with probe outputs (to see the final results), we could determine whether the merged hash table had correct per-key counts.

Before pushing to CI, we ran the tests locally. The bug didn't reproduce locally, but the logging code paths still executed. This validated that the log output was readable, the format was correct, and the information needed to confirm or reject the hypothesis would be present.

One CI run with this logging confirmed the hypothesis: multiple build drivers processed overlapping keys, but the probe output showed counts as if only one driver's keys were present — the merge had dropped duplicate key counts.

Step 4: The fix

The actual bug was straightforward once identified. Velox's HashTable has three code paths for inserting rows during hash table merge:

arrayPushRow for array-based hash mode (small key domains)
buildFullProbe with normalized-key comparison (medium key domains)
buildFullProbe with full key comparison (complex types)

All three paths handled duplicate keys correctly for regular joins (linking rows via a "next" pointer), but for counting joins (which use a count instead of a next pointer), duplicates were silently dropped.

The fix adds an addCount method to sum counts when a duplicate key is found during merge. Since Axiom's build does not compile Velox tests, we built Velox standalone to develop and run the test.

Designing a test that exercises all three code paths was non-trivial. The hash table mode is chosen automatically based on key types, number of distinct values, and value ranges. Our first test attempts only hit array mode because the key domain was small. We had to study the mode selection logic in decideHashMode() to find key configurations that force each mode. This analysis was time-consuming enough that we documented the hash modes (#16953) to spare other developers the same exercise. The key configurations we found:

Small integers {1, 2} → array mode
Two integer columns with 1500 distinct values each (combined cardinality exceeds the 2M array threshold) → normalized-key mode
ARRAY(INTEGER) keys (complex types don't support value IDs) → hash mode

We verified each variant independently by reverting the fix for one code path at a time and confirming the test failed.

Step 5: Landing the fix

With the fix validated in Axiom CI, we created two PRs:

Velox PR (#16949): The fix, test, documentation updates, and a new hashtable.hashMode runtime stat.
Axiom PR (#1175): Update the Velox submodule and stop skipping set operation tests in macOS debug CI (they were excluded via GTEST_FILTER="-SqlTest.set_*" as a workaround).

After the Velox PR merges, the Axiom submodule will be updated to point back to facebookincubator/velox.

Why only macOS debug?

Honestly, we don't fully know.

The bug requires splits with overlapping keys to land on different build drivers. The test uses 4 drivers and 3 data splits (one per input vector). Velox assigns splits to drivers on demand — whichever driver is ready first grabs the next split. If all splits happen to be processed by the same driver, there is no merge and no bug.

Note that this is not a race condition. Given the same split-to-driver assignment, the result is deterministic and reproducibly wrong. The non-determinism is purely in which driver grabs which split, which depends on thread scheduling.

What we can't explain is why macOS debug CI triggers this while the same macOS debug build locally does not. Same OS, same build type, same compiler, same test data, same number of drivers. Yet the thread scheduling differs enough to change split distribution. Differences in CPU, load, or runtime environment on the CI runner may play a role.

If you have insights into what could cause such a difference in thread scheduling between local and CI macOS environments, please comment on axiom#1170.

Takeaways

Don't dismiss "flaky" tests. A test that sometimes passes and sometimes fails is not necessarily a bad test. In this case, the test was correct — the production code was buggy. The flakiness was the only signal that something was wrong. We initially excluded the set tests via GTEST_FILTER to land the macOS CI work, but came back to investigate right away. Leaving them disabled would have hidden a real bug in Velox affecting all users of EXCEPT ALL and INTERSECT ALL.
Design each CI run for maximum signal. When you get 3-4 iterations per hour, you can't afford exploratory logging. Form a hypothesis first, design logging to confirm or reject it, validate the logging locally, then push.
Run flaky tests multiple times. Not to distinguish flaky from broken, but to ensure you reliably trigger the failure together with your debug logging in a single CI run.
Use downstream CI to iterate on dependency fixes. When a bug lives in a dependency, you can debug and validate fixes without landing anything. Point the submodule at a fork branch and let downstream CI run end-to-end. See Step 2 above for details.

Adaptive Per-Function CPU Time Tracking

March 26, 2026 · 6 min read

Rajeev Singh

Software Engineer @ Meta

Pedro Pedreira

Software Engineer @ Meta

Context

Velox evaluates SQL expressions as trees of functions. A query like if(array_gte(a, b), multiply(x, y), 0) compiles into a tree where each node processes an entire vector of rows at a time. When a query runs slowly, the first question usually is: which function is consuming the most CPU? Is it the expensive array comparison, or the cheap arithmetic called millions of times? This problem is even more prominent in use cases like training data loading, when very long and deeply nested expression trees are common, and jobs may run for many hours, or days; in such cases, the CPU usage of even seemingly short-lived functions may add up to substantial overhead. Without a detailed per-function CPU usage breakdown, you may be left guessing — or worse, optimizing the wrong thing.

Velox supports CPU usage tracking through the expression.track_cpu_usage query config. When enabled, every function invocation is wrapped in a CpuWallTimer — an RAII guard that makes the clock_gettime() system call at entry and exit, recording the CPU and wall time spent in each function. This gives precise, per-function cost attribution across the entire expression tree.

The Overhead Problem

That precision comes at a price. Each CpuWallTimer invocation performs a heap allocation, two clock_gettime() syscalls, and a heap deallocation — roughly ~5μs of fixed overhead, depending on the platform. For a function like array_gte(), which spends milliseconds per vector iterating over array elements, 5μs is a negligible noise. But for a function like multiply that incurs a single instruction, the timer takes longer than the work itself:

Vector Size	No Tracking	Full Tracking	Overhead
100 rows	2.13ms	5.89ms	2.8x
1,000 rows	300μs	668μs	2.2x
10,000 rows	193μs	233μs	1.2x

Benchmark for multiply(a, b) on double columns. Times are per 10K iterations.

This overhead forced production systems to disable per-function CPU tracking entirely, losing visibility into more granular CPU consumption. Selective per-function tracking (expression.track_cpu_usage_for_functions) was added to Velox in the past to let operators opt in specific functions, but even for tracked functions the overhead remained — you just chose which functions to pay the cost for.

Why Uniform Sampling Falls Short

The obvious next step is to sample: measure every Nth vector instead of every one, and extrapolate. A uniform rate applied to all functions would be simple — one counter, one config key. But a single rate cannot serve both cheap and expensive functions.

Consider array_gte, where the timer overhead is less than 0.1% of execution time. Sampling in this case throws away measurement precision for no benefit — the function can afford full tracking. Now consider multiply on small vectors, where the timer is 2500% of the function cost. Even a 1/1000 rate may not be aggressive enough. A rate that works for one function is either inaccurate or insufficient for another.

Adaptive Per-Function Sampling

When a query starts running, adaptive sampling does not yet know which functions are cheap and which are expensive. So it watches.

The first vector through each function is a warmup — caches and branch predictors settle, and the system measures the cost of the CpuWallTimer itself by creating and destroying a dummy one. Over the next 5 vectors, each function runs with a lightweight stopwatch instead, quietly accumulating how long it actually takes without any timer overhead.

By the sixth vector, the system has what it needs: the ratio of timer overhead to function cost. If the ratio is within a configurable threshold (default 1%), the function switches to full tracking: every future vector gets a CpuWallTimer, with zero information loss. If the timer dominates, the function switches to sampling at a rate that bounds overhead to the threshold:

Function	Per-Vector Cost	Timer Overhead	Overhead Ratio	Decision
`multiply(a,b)` 100 rows	~0.2μs	~5μs	2500%	Sample 1/2500
`multiply(a,b)` 10K rows	~20μs	~5μs	25%	Sample 1/25
`array_gte(a,b)` 100 rows	~3ms	~5μs	0.17%	Always track
`array_gte(a,b)` 10K rows	~12ms	~5μs	0.04%	Always track

Expensive functions get exact tracking. Cheap functions sample aggressively. No per-function configuration needed — just a single "max acceptable overhead" knob.

When stats are read, sampled functions have their CPU and wall time scaled up by the ratio of total vectors to timed vectors. To handle short queries where the sampling counter might never fire, the first post-calibration vector is always timed, guaranteeing at least one measurement to extrapolate from.

Results

So does it work? We benchmarked adaptive sampling against both no tracking and full tracking, across cheap and expensive functions, at different vector sizes, and with different adaptive rates (1% and 0.5%). Results are presented below:

Benchmark	No Tracking	Full Tracking	Adaptive 1%	Adaptive 0.5%
multiply 100 rows	2.13ms	5.86ms (36%)	2.19ms (97%)	2.16ms (98%)
multiply 1K rows	299μs	671μs (44%)	304μs (98%)	302μs (99%)
multiply 10K rows	193μs	224μs (86%)	165μs (117%)	193μs (100%)
array_gte 100 rows	188ms	193ms (98%)	189ms (100%)	187ms (100%)
array_gte 1K rows	32ms	32ms (100%)	32ms (100%)	32ms (100%)
array_gte 10K rows	11ms	11ms (100%)	11ms (100%)	11ms (100%)

(Percentages are relative to "No Tracking" — higher is better)

For multiply on 100-row vectors, overhead drops from 64% to under 3%. For array_gte, overhead was already negligible and remains so — the function is fully tracked with near zero sampling.

The natural concern with sampling is accuracy. We ran a correctness benchmark comparing the extrapolated CPU time from adaptive sampling against the true value from full tracking over 1,000 vectors:

Function	100 rows	1,000 rows	10,000 rows
multiply	108.8%	102.4%	103.7%
array_gte	99.8%	100.3%	99.6%

Estimates land within 1–9% of the true value. For expensive functions that end up in always track mode, accuracy is exact — every vector is timed.

Configuration

Two config properties control adaptive sampling:

Config	Type	Default	Description
`expression.adaptive_cpu_sampling`	bool	false	Enable adaptive per-function sampling
`expression.adaptive_cpu_sampling_max_overhead_pct`	double	1.0	Target max overhead percentage

Adaptive sampling interacts cleanly with existing tracking modes. expression.track_cpu_usage (global tracking for all functions) takes highest priority and always wins. expression.track_cpu_usage_for_functions (selective per-function tracking) overrides adaptive for named functions. Adaptive sampling applies to everything else.

Takeaway

Adaptive per-function CPU time tracking eliminates the tradeoff between observability and performance. Every function gets the best tracking mode for its cost profile — cheap functions run at near-native speed with minimal overhead, expensive functions get exact tracking at zero additional cost. The system is self-tuning and requires no per-function configuration.

Accelerating Unicode string processing with SIMD in Velox

March 13, 2026 · 8 min read

Ping Liu

Software Engineer

Yuhta

Software Engineer @ Meta

Masha Basmanova

Software Engineer @ Meta

TL;DR

We optimized two Unicode string helpers — cappedLengthUnicode and cappedByteLengthUnicode — by replacing byte-by-byte utf8proc_char_length calls with a SIMD-based scanning loop. The new implementation processes register-width blocks at a time: pure-ASCII blocks skip in one step, while mixed blocks use bitmask arithmetic to count character starts. Both helpers now share a single parameterized template, eliminating code duplication.

On a comprehensive benchmark matrix covering string lengths from 4 to 1024 bytes and ASCII ratios from 0% to 100%, we measured 2–15× speedups across most configurations, with no regressions on Unicode-heavy inputs. The optimization benefits all callers of these helpers, including the Iceberg truncate transform and various string functions.

The hidden traps of regex in LIKE and split

March 7, 2026 · 6 min read

Masha Basmanova

Software Engineer @ Meta

SQL functions sometimes use regular expressions under the hood in ways that surprise users. Two common examples are the LIKE operator and Spark's split function.

In Presto, split takes a literal string delimiter and regexp_split is a separate function for regex-based splitting. Spark's split, however, always treats the delimiter as a regular expression.

Both LIKE and Spark's split can silently produce wrong results and waste CPU when used with column values instead of constants. Understanding why this happens helps write faster, more correct queries — and helps engine developers make better design choices.

velox::StringView API Changes and Best Practices

February 10, 2026 · 5 min read

Pedro Pedreira

Software Engineer @ Meta

Context

Strings are ubiquitously used in large-scale analytic query processing. From storing identifiers, names, labels, or structured data (like json/xml), to simply descriptive text, like a product description or the contents of this very blog post, there is hardly a SQL query that does not require the manipulation of string data.

This post describes in more detail how Velox handles columns of strings, the low-level C++ APIs involved and some recent changes made to them, and presents best practices for string usage throughout Velox's codebase.

Task Barrier: Efficient Task Reuse and Streaming Checkpoints in Velox

January 6, 2026 · 4 min read

Xiaoxuan Meng

Software Engineer @ Meta

Yuhta

Software Engineer @ Meta

Masha Basmanova

Software Engineer @ Meta

Pedro Pedreira

Software Engineer @ Meta

TL;DR

Velox Task Barriers provide a synchronization mechanism that not only enables efficient task reuse, important for workloads such as AI training data loading, but also delivers the strict sequencing and checkpointing semantics required for streaming workloads.

By injecting a barrier split, users guarantee that no subsequent data is processed until the entire DAG is flushed and the synchronization signal is unblocked. This capability serves two critical patterns:

Task Reuse: Eliminates the overhead of repeated task initialization and teardown by safely reconfiguring warm tasks for new queries. This is a recurring pattern in AI training data loading workloads.
Streaming Processing: Enables continuous data handling with consistent checkpoints, allowing stateful operators to maintain context across batches without service interruption.

See the Task Barrier Developer Guide for implementation details.

Why Sort is row-based in Velox — A Quantitative Assessment

December 24, 2025 · 8 min read

Meng Duan (macduan)

Software Engineer @ ByteDance

Xiaoxuan Meng

Software Engineer @ Meta

TL;DR

Velox is a fully vectorized execution engine[1]. Its internal columnar memory layout enhances cache locality, exposes more inter-instruction parallelism to CPUs, and enables the use of SIMD instructions, significantly accelerating large-scale query processing.

However, some operators in Velox utilize a hybrid layout, where datasets can be temporarily converted to a row-oriented format. The OrderBy operator is one example, where our implementation first materializes the input vectors into rows, containing both sort keys and payload columns, sorts them, and converts the rows back to vectors.

In this article, we explain the rationale behind this design decision and provide experimental evidence for its implementation. We show a prototype of a hybrid sorting strategy that materializes only the sort-key columns, reducing the overhead of materializing payload columns. Contrary to expectations, the end-to-end performance did not improve—in fact, it was even up to 3× slower. We present the two variants and discuss why one is counter-intuitively faster than the other.

Multi-Round Lazy Start Merge

November 9, 2025 · 6 min read

Meng Duan (macduan)

Software Engineer @ ByteDance

Xiaoxuan Meng

Software Engineer @ Meta

Pedro Pedreira

Software Engineer @ Meta

Background

Efficiently merging sorted data partitions at scale is crucial for a variety of training data preparation workloads, especially for Generative Recommenders (GRs) a new paradigm introduced in the paper Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations. A key requirement is to merge training data across partitions—for example, merging hourly partitions into daily ones—while ensuring that all rows sharing the same primary key are stored consecutively. Training data is typically partitioned and bucketed by primary key, with rows sharing the same key stored consecutively, so merging across partitions essentially becomes a multi-way merge problem.

Normally, Apache Spark can be used for this sort-merge requirement — for example, via CLUSTER BY. However, training datasets for a single job can often reach the PB scale, which in turn generates shuffle data at PB scale. Although we typically apply bucketing and ordering by key when preparing training data in production, Spark can eliminate the shuffle when merging training data from multiple hourly partitions. However, each Spark task can only read the files planned from various partitions within a split sequentially, placing them into the sorter and spilling as needed. Only after all files have been read does Spark perform a sort-merge of the spilled files. This process produces a large number of small spill files, which further degrades efficiency.

Enabling Shared Library Builds in Velox

November 1, 2025 · 8 min read

Jacob Wujciak-Jens

Software Engineer @ Voltron Data

In this post, I’ll share how we unblocked shared library builds in Velox, the challenges we encountered with our large CMake build system, and the creative solution that let us move forward without disrupting contributors or downstream users.

Velox switches to C++20 standard

August 30, 2025 · 3 min read

Christian Zentgraf

Software Engineer @ IBM

Background

The C++ standard used in a project determines what built-in functionality developers can use to ease development and extended capabilities.

Since its inception in August of 2021 Velox used the C++17 standard.

TL;DR​

The problem​

Why CI-only failures are hard to debug​

Step 1: Make CI runs count​

Step 2: Iterate on Velox changes using Axiom CI​

Step 3: Form a hypothesis, then add targeted logging​

Step 4: The fix​

Step 5: Landing the fix​

Why only macOS debug?​

Takeaways​

Context​

The Overhead Problem​

Why Uniform Sampling Falls Short​

Adaptive Per-Function Sampling​

Results​

Configuration​

Takeaway​

TL;DR​

Context​

TL;DR​

TL;DR​

Background​

Background​

TL;DR

The problem

Why CI-only failures are hard to debug

Step 1: Make CI runs count

Step 2: Iterate on Velox changes using Axiom CI

Step 3: Form a hypothesis, then add targeted logging

Step 4: The fix

Step 5: Landing the fix

Why only macOS debug?

Takeaways

Context

The Overhead Problem

Why Uniform Sampling Falls Short

Adaptive Per-Function Sampling

Results

Configuration

Takeaway

TL;DR

Context

TL;DR

TL;DR

Background

Background