Pairwise Haplotype-Resolved Alignment, Yield Afterward
The confluence of the two great streams.
General-purpose pairwise sequence aligner for bacterial genomics. Short reads, long reads, contigs - Phraya aligns all your data. Produces rich alignment superpositions with deferred filtering for SNP calling, in-silico typing, classification, and other downstream analyses. Zero binary dependencies. Native Rust implementation with SIMD optimization (AVX2/NEON).
Phase 1 MVP in development. Architecture revision completed 2026-05-27. See issue #58 for PRD.
cargo install --git https://github.com/CFSAN-Biostatistics/phraya --locked phrayaThis installs the phraya binary using Rust's portable SIMD path. On ARM64 (Graviton, Apple Silicon), NEON is always active. On x86-64, a scalar fallback is used — portable builds run at approximately 40–60% the speed of a native SIMD build.
For full AVX2 acceleration on x86-64:
RUSTFLAGS="-C target-cpu=native" cargo install --git https://github.com/CFSAN-Biostatistics/phraya --locked phrayaRequires Rust 1.75+. No external binary dependencies (BWA, minimap2, samtools, htslib).
Most aligners force you to choose filtering parameters (mapping quality, coverage thresholds, multi-mapping behavior) before seeing results. Wrong assumptions mean expensive re-alignment.
Phraya separates alignment computation from filtering decisions:
- Plan: Analyze inputs, detect use case, build k-mer evidence index
- Align: Execute alignments, store rich metadata (multi-mapping, CIGAR, coverage)
- Filter: Experiment with different parameters without re-aligning
Cache alignment results. Try different filters. Reuse for multiple downstream analyses.
# 1. Create alignment plan (detects use case, builds k-mer evidence)
phraya plan --inputs reads/*.fastq --reference ref.fasta --output cohort.phrayaplan
# 2. Extract task list for parallel execution
phraya plan-tasks cohort.phrayaplan > tasks.tsv
cat tasks.tsv | parallel --colsep '\t' phraya align cohort.phrayaplan {1} {2}
# 3. Merge results from multiple samples
phraya merge sample_*.phraya --output cohort_merged.phraya
# 4. Filter with different parameters (no re-alignment needed)
phraya filter cohort_merged.phraya --min-coverage 10 --min-mapq 30 --format vcf > variants.vcf
phraya filter cohort_merged.phraya --min-coverage 5 --max-multi-map-fraction 0.3 --format tsv > variants.tsvPhraya automatically detects your workflow:
- Case 2 (main use case): N reads + reference → traditional BWA-like alignment
- Case 3 (key innovation): M contigs + N reads, no reference → selects centroid, aligns all to it
- Case 4: M contigs ± reference → minimap2-like contig alignment
- Multi-mapping storage: Tracks alternative alignment positions (score ratio ≥ 0.95). Filter ambiguous variants post-hoc.
- Evidence-informed: K-mer uniqueness and variation hotspots computed before alignment.
- Rich metadata: Every variant observation includes CIGAR, mapping quality, edit distance, local coverage (±50bp), all alleles, provenance.
- Coverage tracks: Quantized to nearest 5, RLE-compressed, full reference length.
- Mergeable format: Combine samples with order-independent merge preserving provenance.
- Library-first filtering:
phraya-filtercrate exposes public API for custom tools. - Parallel-ready: Plan files emit task lists for GNU Parallel, SLURM, WDL, Nextflow.
Workspace with 5 crates:
- phraya-core: Core types (Sequence, VariantObservation, EvidenceLayer, CoverageTrack, MinimizerSketch), errors, k-mer sketching (via simd-minimizers), centroid selection, k-mer uniqueness
- phraya-io: FASTA/FASTQ/BAM/CRAM parsing,
.phrayaplan/.phraya/.phraya.queriesformats (MessagePack + zstd) - phraya-align: WFA extension, SIMD diagonal fill (SSE4.2/NEON), seeding from minimizer sketches
- phraya-filter: Filtering library (threshold/expression/preset), output formatters (VCF/TSV/phraya)
- phraya (
phraya-cli/): Binary CLI (plan/plan-tasks/align/filter subcommands)
.phrayaplan(v2): Plan file (k-mer evidence + task list). Read-only during alignment. Binary MessagePack + zstd..phraya: Position index (variant observations + coverage track). Mergeable. Binary MessagePack + zstd..phraya.queries: Query index (multi-mapping alternatives per read). Sidecar file. Binary MessagePack + zstd.
Download the tarball for your platform from GitHub Releases:
| Tarball | OS | Arch | SIMD | Use when |
|---|---|---|---|---|
phraya-*-x86_64-linux-gnu-native.tar.gz |
Linux | x86_64 | AVX2 | Modern x86_64 Linux (≥2013 CPUs — Haswell/Excavator or newer) |
phraya-*-x86_64-linux-gnu-portable.tar.gz |
Linux | x86_64 | SSE4.2 | Any x86_64 Linux; broadest compatibility |
phraya-*-aarch64-linux-gnu.tar.gz |
Linux | ARM64 | NEON | AWS Graviton, Ampere Altra, ARM servers |
phraya-*-x86_64-darwin.tar.gz |
macOS | Intel | AVX2 | Intel Mac |
phraya-*-aarch64-darwin.tar.gz |
macOS | Apple Silicon | NEON | M1/M2/M3/M4 Mac |
tar xzf phraya-*-x86_64-linux-gnu-native.tar.gz
./phraya --versionPortable vs native (x86_64 Linux): The native build uses AVX2 via
-C target-cpu=x86-64-v3 and is ~2× faster for k-mer sketching thanks to the
simd-minimizers AVX2 path. Use it on any CPU from ~2013 onward. If it exits with
Illegal instruction, fall back to the portable build (SSE4.2 baseline, runs on
every x86_64 CPU since ~2008).
ARM builds (Linux ARM64 and Apple Silicon) always use NEON — there is no portable/native split because NEON is mandatory on AArch64 and always available.
# Pull the latest image (amd64 and arm64 supported)
docker pull ghcr.io/cfsan-biostatistics/phraya:latest
# Verify installation
docker run --rm ghcr.io/cfsan-biostatistics/phraya:latest --version
# Run with your data (mount current directory as /data)
docker run --rm -v $(pwd):/data ghcr.io/cfsan-biostatistics/phraya:latest \
plan --inputs /data/reads/*.fastq --reference /data/ref.fasta --output /data/cohort.phrayaplan
docker run --rm -v $(pwd):/data ghcr.io/cfsan-biostatistics/phraya:latest \
align /data/cohort.phrayaplan query_id target_id
docker run --rm -v $(pwd):/data ghcr.io/cfsan-biostatistics/phraya:latest \
filter /data/cohort.phraya --min-coverage 10 --min-mapq 30 --format vcf > variants.vcf| Tag | Description |
|---|---|
latest |
Most recent release |
v1.2.3 |
Exact version |
v1.2 |
Latest patch for minor version |
The Docker image is built with the SSE4.2 baseline (-C target-feature=+sse4.2) rather than -C target-cpu=native. This ensures the image runs on any modern x86-64 CPU but does not use AVX2 acceleration for k-mer sketching.
For HPC workloads where you control the hardware, building from source with -C target-cpu=native will enable AVX2 (x86-64) or NEON (ARM64) and improve sketching throughput:
RUSTFLAGS="-C target-cpu=native" cargo build --releasecargo build --releaseRequires Rust 1.75+. No external binary dependencies (BWA, minimap2, samtools).
For best k-mer sketching performance, enable native SIMD:
RUSTFLAGS="-C target-cpu=native" cargo build --releaseWithout -C target-cpu=native, simd-minimizers falls back to a scalar path and is slower. On ARM64 (Graviton, Apple Silicon), NEON is always enabled.
Phraya depends on simd-minimizers for k-mer sketching and seeding. This library implements SIMD-accelerated canonical minimizers using AVX2 (x86-64) or NEON (ARM64), and is described in:
Ragnar Groot Koerkamp, Igor Martayan. SimdMinimizers: Computing random minimizers, fast. SEA 2025. doi:10.4230/LIPIcs.SEA.2025.20
We use canonical minimizers with default parameters k=21, w=11 (appropriate for bacterial genomics) and ntHash rolling hashes. Sketches are computed once during phraya plan and reused during phraya align, eliminating redundant computation.
- Score ratio threshold: 0.95 (hard-coded). Stores alternatives within 95% of best identity. Opinionated choice for storage efficiency.
- K-mer parameters: k=21, w=11 (canonical minimizers, standard for bacterial genomes). l = w+k-1 = 31 satisfies the odd-l canonicality requirement of simd-minimizers.
- Coverage quantization: Nearest 5. Enables RLE compression, negligible precision loss for variant calling decisions.
- Sketch reuse: Plan-time sketches stored in
.phrayaplan(v2) keyed by sequence ID; alignment reuses them rather than recomputing.
- Deacon (https://github.com/bede/deacon): General-purpose aligner with flexible post-processing.
Phraya differentiates on:
- Richer intermediate format (more cacheable/reusable)
- More deferred parameters (multi-mapping, coverage computed during alignment, filtered post-hoc)
- Case 3 (contigs + reads without reference via centroid selection)
In scope:
- Cases 2 (reads + ref), 3 (contigs + reads), 4 (contigs only)
- K-mer evidence (uniqueness only)
- Threshold-based filtering
- VCF/TSV/phraya output formats
- Library API (phraya-filter)
Phase 2+:
- Case 1 (read MSA without reference)
- Expression-based filters (
--expr) - Named presets (
--preset conservative) - Variation hotspot estimation
- Python/R bindings
- GPU acceleration
Phraya uses a tag-triggered release workflow. Push a v* tag and all automated channels update within ~15 minutes.
git tag v0.2.0
git push origin v0.2.0| Channel | What happens | Secret required |
|---|---|---|
| GitHub Releases | 5 prebuilt binaries uploaded (portable + native Linux x86_64, Linux ARM64, macOS Intel, macOS M1) | GITHUB_TOKEN (automatic) |
| Docker | Multi-arch image pushed to ghcr.io/cfsan-biostatistics/phraya with :latest + versioned tags |
GITHUB_TOKEN (automatic) |
| crates.io | All 5 crates published in dependency order | CARGO_REGISTRY_TOKEN |
Pre-releases (tags containing -rc, -alpha, -beta) skip crates.io publish and do not update the :latest Docker tag.
Bioconda (bioconda-recipes repo):
- Fork bioconda/bioconda-recipes
- Update
recipes/phraya/meta.yaml— bumpversion, updatesha256from the GitHub Release SHA256SUMS.txt - Open PR to
bioconda/bioconda-recipes
Homebrew (if using a tap rather than homebrew-core):
- Update
Formula/phraya.rbin the tap repo — bumpversionandsha256 - Test locally:
brew install --build-from-source Formula/phraya.rb - Commit and push; Homebrew users get the update on next
brew update
CARGO_REGISTRY_TOKEN: crates.io API token for publishing. Set in repo Settings → Secrets → Actions.GITHUB_TOKEN: Automatically provided by GitHub Actions. No setup needed.
After the workflow completes, verify all channels:
# GitHub Releases: check all 5 binaries exist
gh release view v0.2.0 --json assets --jq '[.assets[].name]'
# Docker
docker pull ghcr.io/cfsan-biostatistics/phraya:v0.2.0
docker run --rm ghcr.io/cfsan-biostatistics/phraya:v0.2.0 --version
# crates.io: package page should show new version
# https://crates.io/crates/phraya| Platform | Recommended binary | Notes |
|---|---|---|
| Linux x86_64, HPC cluster | phraya-linux-x86_64-native |
AVX2, ~2× faster k-mer sketching |
| Linux x86_64, older hardware | phraya-linux-x86_64-portable |
SSE4.2 baseline, runs everywhere |
| Linux ARM64 (Graviton, Raspberry Pi) | phraya-linux-aarch64 |
NEON, always enabled |
| macOS Intel | phraya-macos-x86_64 |
AVX2 |
| macOS Apple Silicon | phraya-macos-aarch64 |
NEON |
| Container / unknown CPU | Docker image | Portable SSE4.2 build |
Unlicense. As a work product of the US Government (17 USC 105), Phraya is in the public domain.
See issue #58 for Phase 1 PRD. Implementation contributions welcome.