[fix](function) Improve numerical robustness of cosine_distance / cosine_similarity#62840
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
88e74e5 to
3e3f967
Compare
| template <typename... Args> \ | ||
| static std::shared_ptr<TypeName> create_shared(Args&&... args) { \ | ||
| return std::make_shared<TypeName>(std::forward<Args>(args)...); \ | ||
| return std::shared_ptr<TypeName>(new TypeName(std::forward<Args>(args)...)); \ |
|
|
||
| private: | ||
| std::atomic<Version> current_version; | ||
| doris::atomic_shared_ptr<const T> current_version; |
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Avoid redundant square root calculations in cosine_distance and cosine_similarity while preserving the existing semantics for non-zero vectors. Empty arrays and zero vectors keep their previous return values, and pointer preconditions are asserted instead of silently mapping unexpected internal states to user-visible distance values.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran git diff --check for be/src/exprs/function/array/function_array_distance.cpp. Full build and unit/regression tests were not run because the current worktree is not initialized (.worktree_initialized and thirdparty/installed are missing).
- Behavior changed: No
- Does this need documentation: No
d4d459c to
3b84e7a
Compare
- Replace float norm `sqrt(squared_x * squared_y)` with double-precision intermediate to prevent float overflow for large-magnitude vectors (e.g. 1e19 per element: squared product overflows float to +inf, yielding a wrong cosine of 0/NaN instead of 1). - Clamp the final cosine to [-1, 1] before computing distance to prevent tiny negative cosine_distance values caused by floating-point rounding (e.g. identical vectors could yield cosine=1.0000001 -> distance=-1e-7). - Add cosine_distance unit tests covering identical/orthogonal/opposite/zero vectors and a known-value case. - Add large-magnitude regression test (1e19 elements) that would have caught the original overflow bug.
Reformat overlong static_cast<void>(...) call to comply with 100-column limit.
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
TPC-H: Total hot run time: 31302 ms |
TPC-DS: Total hot run time: 172371 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
After switching to double-precision norm in cosine_distance/cosine_similarity, the computed values are slightly more accurate (closer to the mathematical truth). Update the regression .out expectations accordingly: - cosine_distance([1,2,3],[3,5,7]): 0.002585053 -> 0.002585113 (true: ~0.0025851140) - cosine_distance([1.0,2.0,3.0],[4.0,5.0,6.0]): 0.02536809 -> 0.02536815 - cosine_similarity([0.001,0.002],[0.003,0.004]): 0.9838699 -> 0.98387 - cosine_similarity + cosine_distance roundtrip: 0.9999999534439087 -> 1.0 (clamp guarantees sim + dist = 1.0 by construction) - table row 3 cosine_similarity: 0.9746319 -> 0.9746318
|
run buildall |
TPC-H: Total hot run time: 31573 ms |
TPC-DS: Total hot run time: 172529 ms |
…tability-for-cosine-distance/similarity
After apache#63491 typed _null_map as ColumnUInt8::WrappedPtr, get_null_map_column_ptr() now returns ColumnUInt8::MutablePtr directly, making assert_cast<ColumnUInt8*>(...) a same-type cast which is rejected by the static_assert added in apache#63059. Drop the wrapper.
|
run buildall |
TPC-H: Total hot run time: 31704 ms |
TPC-DS: Total hot run time: 172035 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
I found one blocking correctness issue in the numerical-stability fix. Existing review threads about factory_creator.h and multi_version.h were reviewed and are not duplicated here.
Critical checkpoint conclusions:
- Goal and tests: The PR aims to make cosine_distance/cosine_similarity numerically robust for large finite FLOAT inputs and adds BE unit coverage, but the implementation still fails for valid larger values because multiplication/accumulation remains in float before the double norm is used.
- Scope and clarity: The change is focused, but the fix is incomplete in the main data path.
- Concurrency and lifecycle: No new concurrency, locking, static initialization, or non-obvious lifecycle risks found.
- Configuration and compatibility: No new config, persistence, storage format, or FE-BE protocol compatibility concerns found.
- Parallel paths: The same accumulation issue exists in both cosine_distance and cosine_similarity; the inline comment covers both.
- Error handling and special conditions: Existing zero/empty handling is preserved; no ignored Status issue found.
- Test coverage: Tests cover the product-of-squared-norm overflow case, but not overflow during the per-element float multiplication/accumulation itself.
- Observability, transactionality, memory: Not applicable to this PR.
- Performance: Widening the accumulators before multiplication is the relevant correctness fix; no separate performance blocker found.
User focus: No additional user-provided review focus was specified.
|
|
||
| DCHECK(x != nullptr && y != nullptr); | ||
|
|
||
| float dot_prod = 0; |
There was a problem hiding this comment.
This still overflows before the new double norm is computed because dot_prod, squared_x, and squared_y are accumulated as float and the products are evaluated as float. For a valid finite FLOAT input such as cosine_similarity([1e20], [1e20]), x[i] * x[i] and x[i] * y[i] become +inf; then dot_prod / norm is inf / inf, producing NaN (std::clamp does not sanitize NaN). The expected result for parallel vectors is still 1.0 similarity / 0.0 distance. Please widen the accumulators and the per-element products to double before accumulation in both cosine functions, and add a test that covers this overflow-at-accumulation case.
There was a problem hiding this comment.
Thanks for the catch. The per-element overflow only triggers when |x[i]| > sqrt(FLT_MAX) ≈ 1.84e19, which is even more pathological than the outer-product overflow this PR addresses. Widening accumulators to double would halve SIMD throughput (~5-20% end-to-end regression in vector search workloads) while only covering a stricter subset of overflow scenarios that have never been observed in real embeddings. We prefer to keep the current scope; happy to file a follow-up if a real-world case is reported.
What problem does this PR solve?
Two defensive hardening fixes in CosineDistance::distance and CosineSimilarity::distance to guarantee correct results across the full range of valid float inputs.
Fix 1: Use double-precision intermediate when computing the norm
Before:
return 1 - dot_prod / sqrt(squared_x * squared_y);
After:
const double norm = std::sqrt(static_cast(squared_x) * static_cast(squared_y));
Why: squared_x * squared_y is a float multiplication. When squared_x and squared_y are both large (e.g. input elements around 1e19), the product exceeds FLT_MAX (~3.4e38) and overflows to +inf. Then sqrt(+inf) = +inf and dot_prod / +inf = 0, so two parallel vectors silently get cosine_distance = 1.0 (should be 0.0) — a wrong result with no warning.
For typical L2-normalized embedding vectors this never triggers. But cosine_distance accepts arbitrary float* arrays, not just normalized embeddings, so the function should be safe for any finite float input. The cost is two static_cast ops; double's range (~1.8e308) cannot overflow on any finite float input.
For non-overflow inputs the result is bit-for-bit equivalent (verified by existing tests, which match the same static_cast(34.0 / std::sqrt(14.0 * 83.0)) formula).
Fix 2: Clamp cosine to [-1, 1]
After:
return std::clamp(static_cast(dot_prod / norm), -1.0f, 1.0f);
Why: Float rounding can make the computed cosine slightly exceed 1.0 for identical (or near-identical) vectors. For example with x = y = (0.1f, 0.2f, 0.3f), accumulation rounding can yield cosine = 1.0000001, then 1.0f - cosine = -1e-7 — a negative cosine_distance that violates the metric contract d >= 0 and may break downstream code (DCHECK(distance >= 0), threshold filters, distance aggregation).
std::clamp is a one-op guarantee that costs nothing for in-range values.
BE UT 编译失败修复(独立于本 PR 余弦修复)
问题:merge 最新 master 后,BE UT 在 functions_geo_test.cpp:375 编译失败:
error: static assertion failed: assert_cast is redundant for the same type
'!std::is_same_v<ColumnVector<TYPE_BOOLEAN>, ColumnVector<TYPE_BOOLEAN>>'
根因:master 上三个 PR 叠加副作用:
#63491 把 _null_map 改为强类型 ColumnUInt8::WrappedPtr
#63059 加 static_assert 拒绝 same-type assert_cast
#63049(5/26 刚合入)新写的 geo 测试还按老接口加了冗余 cast
修复:去掉冗余包装
get_null_map_column_ptr() 现在直接返回 ColumnUInt8::MutablePtr,->insert_value() 语义不变。
影响:一行改动,仅修复编译报错,不涉及测试语义和余弦相关代码。
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)