Faster, memory-efficient joins when joining a large table to a small lookup table.
bloomjoin helps when:
- Large table joined to small table (10:1 ratio or more)
- Low overlap between join keys (<25%)
| n_x | n_y | overlap | speed | memory |
|---|---|---|---|---|
| 1,000,000 | 10,000 | 1% | 2.0x | 2.2x |
| 1,000,000 | 10,000 | 5% | 1.6x | 2.0x |
| 500,000 | 5,000 | 2% | 1.7x | 1.9x |
| 200,000 | 20,000 | 5% | 1.2x | 1.2x |
Values > 1 mean bloomjoin is faster / uses less memory than dplyr.
devtools::install_github("gojiplus/bloomjoin")library(bloomjoin)
result <- bloom_join(large_df, small_lookup, by = "id")Same syntax as dplyr. Supports type = "inner", "left", "right", "semi", "anti".
- Similar-sized tables: dplyr is faster
- High overlap (>50%): no benefit from pre-filtering
| n_x | n_y | overlap | speed | memory |
|---|---|---|---|---|
| 100,000 | 100,000 | 10% | 0.4x | 0.5x |
| 100,000 | 100,000 | 50% | 0.4x | 0.4x |
Values < 1 mean dplyr is faster.
- Build a Bloom filter from the smaller table's keys
- Pre-filter the larger table to remove non-matching rows
- Run the actual join on the reduced dataset
Bloom filters have no false negatives, so no matches are lost.
MIT