Skip to content

JMencius/MethQC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

108 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MethQC

License: MIT

Methylation quality control (MethQC)

MethQC is a tool for quality control of DNA methylation detection in long-read sequencing data.

Contact: Jun Mencius

Email: zjmeng22@m.fudan.edu.cn

Introduction

MethQC is a robust, benchmark-free quality control framework designed for DNA methylation detection in long-read sequencing data. It introduces a unified metric (MQI) for both sample- and region-level quality assessment。

MethQC takes sorted and indexed BAM files containing modification tags (MM and ML) as input and generates:

  • An interactive HTML QC report
  • An optional BED file with detailed QC metrics
  • An optional JSON file with filename and MQI

Installation

Currently, MethQC does not support online installation, but standard pip installation will be supported in a future release.

  1. Create and activate a new virtual environment:
conda create -n methqc python=3.8
conda activate methqc
  1. Download the latest release from the Release page and extract the archive. Navigate to the extracted project root directory (which contains the pyproject.toml file) and install using pip:
# cd /path/to/extracted/MethQC
pip install .
  1. Test the installation Verify that the installation was successful by running the following commands.
methqc --version;
methqc_batch --version;

The whole installation will take a few minutes depending on the networking speed.

MethQC parameters

Standard parameters for methqc:

Usage: methqc [OPTIONS]

Options:
  -i, --inbam FILE               Input BAM file for evaluation  [required]
  -o, --output FILE              Output HTML file
  -j, --json FILE                Output JSON file
  -t, --threads INTEGER          Maximum number of parallel threads
  -r, --ref FILE                 Reference FASTA file  [required]
  --region TEXT                  Specific region to QC, example: --region
                                 chr1:10000-20000
  -p, --pattern TEXT             Sequence pattern for modification, IUPAC
                                 standard degenerate bases are also supported
  -d, --index INTEGER            Modification site index in sequence pattern
  -c, --code TEXT                Modification code in BAM file, e.g. C+m, C-m,
                                 m for 5mCG
  -f, --threshold FLOAT          Bases with probability < THRESHOLD are
                                 filtered [default: 0.75]
  --modkit TEXT                  Path to modkit for auto methylation
                                 probability threshold
  --chunk-size INTEGER           Chunk size for parallel BAM reading [default:
                                 50_000_000]
  --max-depth INTEGER            Maximum average sequencing depth
  --low-coverage-cutoff INTEGER  Coverage < low-coverage-cutoff is defined as
                                 low coverage sites [default: 5]
  --low-mapq-cutoff INTEGER      mapping quality < low-mapq-cutoff is defined
                                 as low mapq sites [default: 10]
  --high-sb-cutoff FLOAT         strand bias > high-sb-cutoff is defined as
                                 high strand bias sites [default: 0.2]
  --detail-bed FILE              Output detailed per site information to BED
                                 file
  --nums-only                    Do not generate HTML report, just output
                                 quality control metrics to StdOut
  --quiet                        Do not print working progress to standard
                                 output
  --version                      Show the version and exit.
  --help                         Show this message and exit.

Quick demo

  1. Standard usage (only output HTML report)
methqc -t 32 -i HG002_sorted.bam -r grch38.fa -o HG002.html;
  1. Output MQI value to standard JSON file
methqc -t 32 -i HG002_sorted.bam -r grch38.fa -j HG002.json;
  1. Quality control of specific region
methqc -t 32 -i HG002_sorted.bam -r grch38.fa --region chr1:5000000-7000000 -o HG002_chr1_5000000_7000000.html;
  1. Use modkit estimated probability threshold, details in here
methqc -t 32 -i HG002_sorted.bam --modkit ~/modkit/bin/modkit -r grch38.fa -o HG002.html;
  1. Run QC on specific regions defined in a BED file
methqc_batch -t 32 -i HG002_sorted.bam -r grch38.fa -b regions.bed -o result.bed;
  1. Output per-site information to BED file in additional to an HTML report
methqc -t 32 -i HG002.bam -r grch38.fa -o HG002.html --detail-bed HG002.bed ;
  1. Only print QC metrics to standard output (no HTML report generated)
methqc -t 32 -i HG002_sorted.bam -r grch38.fa --nums-only;

Methqc_batch parameters

If you want to restrict your QC to specific regions defined in a BED file, we provide methqc_batch:

Usage: methqc_batch [OPTIONS]

Options:
  -i, --inbam FILE               Input BAM file for evaluation  [required]
  -o, --output FILE              Output BED file  [required]
  -t, --threads INTEGER          Maximum number of parallel threads
  -r, --ref FILE                 Reference FASTA file  [required]
  -b, --bed FILE                 Path to region bed file  [required]
  -p, --pattern TEXT             Sequence pattern for modification, IUPAC
                                 standard degenerate bases are also supported
  -d, --index INTEGER            Modification site index in sequence pattern
  -c, --code TEXT                Modification code in BAM file, e.g. C+m, C-m,
                                 m for 5mCG
  -f, --threshold FLOAT          Bases with probability < THRESHOLD are
                                 filtered [default: 0.75]
  --modkit TEXT                  Path to modkit for auto methylation
                                 probability threshold
  --chunk-size INTEGER           Chunk size for parallel BAM reading [default:
                                 50_000_000]
  --max-depth INTEGER            Maximum average sequencing depth
  --low-coverage-cutoff INTEGER  Coverage < low-coverage-cutoff is defined as
                                 low coverage sites [default: 5]
  --low-mapq-cutoff INTEGER      mapping quality < low-mapq-cutoff is defined
                                 as low mapq sites [default: 10]
  --high-sb-cutoff FLOAT         strand bias > high-sb-cutoff is defined as
                                 high strand bias sites [default: 0.2]
  --quiet                        Do not print working progress to standard
                                 output
  --version                      Show the version and exit.
  --help                         Show this message and exit.

Methqc_batch quick demo

  1. QC for lists of regions defined in BED file
methqc_batch -t 32 -i HG002_sorted.bam -r grch38.fa -b genes.bed -o results.bed;

Output BED description

Detailed BED file column description

The detailed BED file includes a header, the description of each column is listed below

Column index Name Description Type
1 chrom Chromosome str
2 start 0-based start modification position int
3 end 0-based exclusive end position (start position + 1) int
4 modified base code single letter code for modified base str
5 valid coverage coverage after -f or --probability-threshold filter int
6 modified fraction modified fraction = modified count / valid coverage float
7 modified count modified count int
8 unmodified count unmodified count int
9 fail count count of sites fail in -f or --probability-threshold filter int
10 average mapq average mapping quality float
11 strand bias absolute difference of modification in positive and negative strand float
12 quality status PASS or FAIL str

BED file output by methqc_batch

The detailed BED file includes a header, the description of each column is listed below

Column index Name Description Type
1 chrom Chromosome str
2 start 0-based start modification position int
3 end 0-based exclusive end position (start position + 1) int
4 MQI Methylation Quality Index float
5 Low Mapping Quality Count Low mapping quality site count int
6 High Strand Bias Count High strand bias site count int
7 QC Pass Count QC pass site count int

MethQC HTML report

An showcase example is provided in here. An explanation markdown file is also provided.

Legacy support

  • MethQC also supports BAM files with old Mm and Ml tags.

Known limitation

  • Some old PacBio data sequenced by Sequel I may cause a segmentation fault in BAM reading, details are in here.
  • MethQC is primarily tested on human 5m-CpG

Acknowledgements

MethQC is dependent on the following libraries; we are grateful to all the developers/maintainers:

  • click: Python command line interface
  • pyfastx: Reference FASTA processing
  • pysam: BAM file processing
  • bokeh: Interactive HTML

Issues & Contributions

If you encounter any problems, bugs, or unexpected results while using MethQC, please open an issue in this repository.

We welcome all forms of contributions — whether it’s reporting bugs, suggesting new features, improving documentation, or submitting pull requests.

About

Quality control for DNA methylation detection in long-read sequencing

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors