Skip to content

fix: free a column family name immediately on drop (ghost tables)#647

Open
sleekmountaincat wants to merge 1 commit into
mainfrom
fix/evict-dropped-cf-on-reopen
Open

fix: free a column family name immediately on drop (ghost tables)#647
sleekmountaincat wants to merge 1 commit into
mainfrom
fix/evict-dropped-cf-on-reopen

Conversation

@sleekmountaincat

@sleekmountaincat sleekmountaincat commented Jun 11, 2026

Copy link
Copy Markdown

Summary

Dropping a column family now unconditionally removes its by-name entry from the descriptor's registry map, so a subsequent open with the same name creates a fresh column family instead of reusing the dangling dropped handle. A new columnsMutex on DBDescriptor guards all concurrent accessors of the columns map, and flush()/close() now snapshot-and-pin the descriptors they iterate.

Purpose

Root cause of Harper's "ghost table" failures (deleted tables resurrecting after restart, Invalid column family specified in write batch poisoning a whole database env, wedged table creation). Harper opens every table's column family from multiple worker threads, so the old tryUnregisterColumnFamily refcount gate (use_count <= 2) never fired: the dropped entry stayed registered, a same-name recreate got the dead handle, and one failed write contaminated the env's shared write path until process restart. Reproduced from scratch on a live 2-node Fabric cluster (harper-pro 5.0.27) with a plain table: drop + same-name recreate without a restart breaks the env every time.

Where to focus review

  1. Intentional semantics change: drop goes from deferred-while-referenced to immediate-name-free. Instances still holding the dropped column family can read it until they close; what's gone is the old "reopen by name resurrects it" behavior. The previous two-instance drop test encoded the deferred contract and was replaced with a regression test for the new one (test/drop.test.ts). This needs an explicit maintainer call.
  2. Lock ordering: databasesMutex before columnsMutex everywhere both are held (DBRegistry::OpenDB, close() under PurgeAll); the drop path takes only columnsMutex, so it cannot interact with registry-lock holders that wait on in-flight operations at shutdown.
  3. flush() pinning (db_descriptor.cpp): runs on libuv worker threads; it now copies the shared_ptrs under the mutex so a concurrent drop/close cannot destroy a handle mid-Flush. Previously this was an unsynchronized iteration that the new every-drop erase would have made a routine data race.
  4. The last assertion in the new drop test documents that a write through a still-held dropped handle fails AND contaminates the env's write path; it must stay the final assertion.

Validation

  • Full vitest suite: 32 files, 459 passed, 1 skipped, 0 failed (macOS arm64 build; tsc, oxlint, oxfmt clean).
  • End-to-end against live harper-pro 5.0.27: this fix backported to v1.4.2 (Backport fix/evict-dropped-cf-on-reopen to v1.4.x (for 1.4.3) #648), built in-container, swapped into the running server. Drop + same-name recreate + insert works; rapid drop/recreate cycles work; no poison, no wedge; clean drops stay dropped across restart.
  • Harper-side regression tests exist in HarperFast/harper (branch fix/ghost-table-drop-atomicity, unitTests/resources/dropTableGhost.test.js): they fail on the current binding with the incident error (Could not access column family N) and pass with this fix.

Coordination

Sibling backport PR targets v1.4.x for a 1.4.3 patch release (harper-pro 5.0.x pins ^1.4.2). The harper core PR depends on a release containing this fix.

Generated by an AI agent (Claude, investigation + fix); reviewed via adversarial agent review and a cross-model (Antigravity) review whose findings (columns-map data race, flush use-after-free, lock granularity) are addressed in this diff.

🤖 Generated with Claude Code

Root cause of the harper "ghost table" failure. When a column family was
dropped while other DBHandles still referenced it (Harper opens every table
from multiple worker threads), tryUnregisterColumnFamily left the entry in
DBDescriptor::columns (use_count > 2; versions before v2.0.0 had no eviction
at all). A later open-by-name in DBRegistry::OpenDB then reused the dropped
descriptor and handed back a dangling column family handle: every write
failed with "Invalid argument: Invalid column family specified in write
batch", which poisoned the database env, wedged table creation, and made
drop+recreate-same-name impossible without a process restart.

Fix: unconditionally remove the by-name map entry on successful drop
(DBDescriptor::unregisterColumnFamily). Instances still holding the dropped
column family keep the descriptor alive through their shared_ptr and can
continue reading until they close; only the by-name lookup is removed, so a
reopen creates a fresh column family.

Because the erase now fires on every drop, the columns map gets a dedicated
columnsMutex guarding all concurrent accessors: DBRegistry::OpenDB's
copy-check-insert, the columns getter, flush() (runs on libuv worker
threads; now also pins the descriptors so handles cannot die mid-Flush),
close()'s compact loop and clear, and the erase itself. Lock order is
databasesMutex before columnsMutex; the drop path takes only columnsMutex,
avoiding any interaction with registry-lock holders that wait on in-flight
operations (e.g. PurgeAll at shutdown).

This intentionally changes drop semantics from deferred-while-referenced to
immediate: the old "drop a column family with two database instances" test
encoded the deferred contract and has been replaced with a regression test
for the new contract. The new test also documents that a write attempted
through a still-held dropped handle fails and contaminates the env's shared
write path (the poison this fix prevents callers from triggering through
reopen-by-name).

Removes DBDescriptor::tryUnregisterColumnFamily (its conditional refcount
heuristic was the bug; Drop/DropSync were its only callers).

Validated: native build clean (macOS arm64), full vitest suite 459 passed /
1 skipped / 0 failed; tsc, oxlint, and oxfmt clean (pre-commit hook bypassed
only because the local pnpm shim is broken; its checks were run manually).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions

Copy link
Copy Markdown
Contributor

📊 Benchmark Results

get-sync.bench.ts

getSync() > random keys - small key size (100 records)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 lmdb 1 20.27K ops/sec 49.33 47.34 588.68 0.124 101,364
🥈 rocksdb 2 11.82K ops/sec 84.62 82.47 22,429.915 0.887 59,087

getSync() > sequential keys - small key size (100 records)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 lmdb 1 23.02K ops/sec 43.44 41.58 1,703.842 0.139 115,100
🥈 rocksdb 2 11.70K ops/sec 85.44 83.45 2,598.857 0.113 58,522

ranges.bench.ts

getRange() > small range (100 records, 50 range)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 lmdb 1 24.49K ops/sec 40.83 35.61 2,198.263 0.294 122,448
🥈 rocksdb 2 16.60K ops/sec 60.25 53.35 1,062.466 0.125 82,982

realistic-load.bench.ts

Realistic write load with workers > write variable records with transaction log

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 501.49 ops/sec 1,994.038 75.65 48,956.084 9.59 1,004
🥈 lmdb 2 26.74 ops/sec 37,397.894 436.977 1,154,503.223 135.023 64.00

transaction-log.bench.ts

Transaction log > read 100 iterators while write log with 100 byte records

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 32.42K ops/sec 30.84 13.94 377.57 0.205 162,116
🥈 lmdb 2 444.54 ops/sec 2,249.502 204.166 13,227.496 1.24 2,226

Transaction log > read one entry from random position from log with 1000 100 byte records

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 733.21K ops/sec 1.36 1.15 5,234.623 0.222 3,666,067
🥈 lmdb 2 472.74K ops/sec 2.12 1.13 9,095.804 0.519 2,363,698

worker-put-sync.bench.ts

putSync() > random keys - small key size (100 records, 10 workers)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 836.59 ops/sec 1,195.33 1,031.113 2,565.561 0.418 1,674
🥈 lmdb 2 1.18 ops/sec 850,034.027 794,870.403 887,063.028 2.35 10.00

worker-transaction-log.bench.ts

Transaction log with workers > write log with 100 byte records

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 20.62K ops/sec 48.50 30.46 425.228 0.487 41,235
🥈 lmdb 2 808.76 ops/sec 1,236.464 303.613 15,449.019 5.49 1,618

Results from commit c2e927b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants