Skip to content

HenryCWong/DNAS-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

DNAS-Bench: Deterministic Nucleic Acid Screener Benchmarking

Python 3.8+ License: MIT Paper DOI

DNAS-Bench is a deterministic benchmarking framework for evaluating the robustness of Biosecurity Screening Software (BSS) against adversarial nucleic acid sequence manipulations.


Overview

Commercial and open-source DNA synthesis providers use Biosecurity Screening Software to detect potentially dangerous synthesis orders. DNAS-Bench stress-tests these screeners by applying a systematic suite of deterministic manipulations to known sequences-of-concern — no AI required, no BSS code access needed.

Key findings from our paper:

  • SeqScreen detects 41.99% of manipulated sequences on average; Commec detects 10.19%
  • Splitting sequences below 50 bp bypasses most existing safeguards
  • Simple methods (e.g., appending a repeated nucleotide at 1.5× length) perform nearly as well as complex ones — differing by only 0.75 percentage points on average

Manipulation Methods

Method Description Difficulty
Split Fragment sequence into non-overlapping subsequences of fixed length Easy
Encapsulate Add 4 bp random flanking sequences to both ends of each fragment Easy
Cover (GFP) Append a GFP gene subsequence at 1.5× fragment length to tail Medium
Cover-A Append a poly-adenine tail at 1.5× fragment length Easy
Cover-Random Append a random genomic sequence at 1.5× fragment length Easy
Flip Introduce random point mutations at a specified fraction of positions Medium

Quick Start

Prerequisites

No external dependencies — all scripts use the Python standard library only (Python 3.6+).

Step 1 — Split and Encapsulate

split-sequence.py fragments your sequence at a fixed length and produces two FASTA files in one run:

  • -oencapsulate dataset: fragments with Golden Gate tails (enzyme recognition site + 4 bp overhangs)
  • --split-outputsplits dataset: payload-only fragments (no tails), used as input for covers and flips
python split-sequence.py \
    -i sequence.fasta \
    -l 100 \
    -o output/encapsulate/fragments_L100.fasta \
    --split-output output/splits/fragments_L100.fasta

Repeat for each desired length (e.g. 50, 100, 150, 200, 250, 300 bp).

Step 2 — Covers (append benign sequence)

split_and_append.py takes a payload FASTA from Step 1 and appends a covering sequence at 1.5× the fragment length.

# Cover with a donor FASTA (e.g. GFP)
python split_and_append.py \
    -a output/splits/fragments_L100.fasta \
    --second gfp.fasta --wrap-second \
    --frag-len 100 -o output/covers_gfp/ --single-file

# Cover with poly-A tail
python split_and_append.py \
    -a output/splits/fragments_L100.fasta \
    --poly-a \
    --frag-len 100 -o output/covers_poly_a/ --single-file

# Cover with random sequence
python split_and_append.py \
    -a output/splits/fragments_L100.fasta \
    --random \
    --frag-len 100 -o output/covers_random/ --single-file

Step 3 — Flipped Splits (point mutations)

flip_splits.py takes a directory of FASTA files (e.g. your splits output folder) and mutates a specified percentage of bases in each fragment.

python flip_splits.py \
    output/splits/ \
    output/flipped_splits/ \
    5.0 \
    --seed 42

Repository Structure

DNAS-Bench/
├── split-sequence.py        # Produces encapsulate (tailed) and splits (payload-only) datasets
├── split_and_append.py      # Covers: append donor FASTA, poly-A, or random sequence at 1.5x length
├── flip_splits.py           # Flipped splits: introduce random point mutations across a fragment directory
├── gfp.fasta                # GFP donor sequence used with split_and_append.py --second
└── README.md

Output Format

Running split-sequence.py at length 100 produces two files:

output/
├── encapsulate/
│   └── fragments_L100.fasta            # Golden Gate tails included → feed directly to BSS
├── splits/
│   └── fragments_L100.fasta            # payload only → input for covers and flips
├── covers_gfp/
│   └── <prefix>_combined_fragments_L100.fasta
├── covers_poly_a/
├── covers_random/
└── flipped_splits/

Fragment headers in the encapsulate output follow --name-format:

>{orig}|frag{index}|{len}bp|{left}-{right}

Fragment headers in the splits output follow --split-name-format:

>{orig}|frag{index}|{len}bp|split

Threat Model

DNAS-Bench models an adversary who:

  1. Obtains a regulated malicious sequence (e.g., from a public database)
  2. Applies deterministic manipulations — no BSS code access or molecular biology expertise required
  3. Submits fragments across multiple thin clients / accounts to evade order-pattern detection
  4. Reassembles fragments in a lab after delivery

We focus empirically on evasion of automated screening (steps 1–3), and explicitly separate this from the downstream biological reconstruction challenge (step 4).


Dataset

We evaluated 11 sequences derived from the HHS and USDA Select Agents and Toxins List, ranging from 753 bp (single toxin gene) to 5.2 Mb (bacterial chromosomal locus). Agent identities are de-identified in public releases.

To request the benchmark dataset, please contact the authors. We share data with BSS developers and researchers working to improve screening pipelines.


Results Summary

BSS Mean Detection Detection at L=300
SeqScreen 41.99% 49.57%
Commec 10.19% 28.72%
Kraken (baseline) 40.08% 49.77%

Detection collapses for all tools when fragment length falls below 50 bp.


Citation

@inproceedings{wong2025dnasbench,
  title     = {DNAS-Bench: Deterministic Nucleic Acid Screener Benchmarking},
  author    = {Wong, Henry C. and Kohno, Tadayoshi and Nivala, Jeff},
  booktitle = {Workshop on Cybersecurity for Biology (CyberBio)},
  year      = {2026}
}

Ethical Considerations

This framework is intended for BSS developers, biosecurity researchers, and DNA synthesis companies to evaluate and improve screening pipelines.

  • Manipulated benchmark data is available on request only — not open-sourced
  • Agent and toxin identities in public materials are de-identified
  • We have notified and shared results with the developers of SeqScreen and Commec
  • We do not believe executing a real attack from public materials alone is tractable

See ETHICS.md for full discussion.


Contact

Henry C. Wong — University of Washington
For dataset access or collaboration inquiries, please open an issue or reach out directly.

About

DNAS-Bench: Deterministic Nucleic Acid Screener Benchmarking

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors