Methylation quality control (MethQC)
MethQC is a tool for quality control of DNA methylation detection in long-read sequencing data.
Contact: Jun Mencius
Email: zjmeng22@m.fudan.edu.cn
MethQC is a robust, benchmark-free quality control framework designed for DNA methylation detection in long-read sequencing data. It introduces a unified metric (MQI) for both sample- and region-level quality assessment。
MethQC takes sorted and indexed BAM files containing modification tags (MM and ML) as input and generates:
- An interactive HTML QC report
- An optional BED file with detailed QC metrics
- An optional JSON file with filename and MQI
Currently, MethQC does not support online installation, but standard pip installation will be supported in a future release.
- Create and activate a new virtual environment:
conda create -n methqc python=3.8
conda activate methqc
- Download the latest release from the Release page and extract the archive. Navigate to the extracted project root directory (which contains the
pyproject.tomlfile) and install usingpip:
# cd /path/to/extracted/MethQC
pip install .
- Test the installation Verify that the installation was successful by running the following commands.
methqc --version;
methqc_batch --version;
The whole installation will take a few minutes depending on the networking speed.
Standard parameters for methqc:
Usage: methqc [OPTIONS]
Options:
-i, --inbam FILE Input BAM file for evaluation [required]
-o, --output FILE Output HTML file
-j, --json FILE Output JSON file
-t, --threads INTEGER Maximum number of parallel threads
-r, --ref FILE Reference FASTA file [required]
--region TEXT Specific region to QC, example: --region
chr1:10000-20000
-p, --pattern TEXT Sequence pattern for modification, IUPAC
standard degenerate bases are also supported
-d, --index INTEGER Modification site index in sequence pattern
-c, --code TEXT Modification code in BAM file, e.g. C+m, C-m,
m for 5mCG
-f, --threshold FLOAT Bases with probability < THRESHOLD are
filtered [default: 0.75]
--modkit TEXT Path to modkit for auto methylation
probability threshold
--chunk-size INTEGER Chunk size for parallel BAM reading [default:
50_000_000]
--max-depth INTEGER Maximum average sequencing depth
--low-coverage-cutoff INTEGER Coverage < low-coverage-cutoff is defined as
low coverage sites [default: 5]
--low-mapq-cutoff INTEGER mapping quality < low-mapq-cutoff is defined
as low mapq sites [default: 10]
--high-sb-cutoff FLOAT strand bias > high-sb-cutoff is defined as
high strand bias sites [default: 0.2]
--detail-bed FILE Output detailed per site information to BED
file
--nums-only Do not generate HTML report, just output
quality control metrics to StdOut
--quiet Do not print working progress to standard
output
--version Show the version and exit.
--help Show this message and exit.
- Standard usage (only output HTML report)
methqc -t 32 -i HG002_sorted.bam -r grch38.fa -o HG002.html;
- Output
MQIvalue to standard JSON file
methqc -t 32 -i HG002_sorted.bam -r grch38.fa -j HG002.json;
- Quality control of specific region
methqc -t 32 -i HG002_sorted.bam -r grch38.fa --region chr1:5000000-7000000 -o HG002_chr1_5000000_7000000.html;
- Use
modkitestimated probability threshold, details in here
methqc -t 32 -i HG002_sorted.bam --modkit ~/modkit/bin/modkit -r grch38.fa -o HG002.html;
- Run QC on specific regions defined in a BED file
methqc_batch -t 32 -i HG002_sorted.bam -r grch38.fa -b regions.bed -o result.bed;
- Output per-site information to BED file in additional to an HTML report
methqc -t 32 -i HG002.bam -r grch38.fa -o HG002.html --detail-bed HG002.bed ;
- Only print QC metrics to standard output (no HTML report generated)
methqc -t 32 -i HG002_sorted.bam -r grch38.fa --nums-only;
If you want to restrict your QC to specific regions defined in a BED file, we provide methqc_batch:
Usage: methqc_batch [OPTIONS]
Options:
-i, --inbam FILE Input BAM file for evaluation [required]
-o, --output FILE Output BED file [required]
-t, --threads INTEGER Maximum number of parallel threads
-r, --ref FILE Reference FASTA file [required]
-b, --bed FILE Path to region bed file [required]
-p, --pattern TEXT Sequence pattern for modification, IUPAC
standard degenerate bases are also supported
-d, --index INTEGER Modification site index in sequence pattern
-c, --code TEXT Modification code in BAM file, e.g. C+m, C-m,
m for 5mCG
-f, --threshold FLOAT Bases with probability < THRESHOLD are
filtered [default: 0.75]
--modkit TEXT Path to modkit for auto methylation
probability threshold
--chunk-size INTEGER Chunk size for parallel BAM reading [default:
50_000_000]
--max-depth INTEGER Maximum average sequencing depth
--low-coverage-cutoff INTEGER Coverage < low-coverage-cutoff is defined as
low coverage sites [default: 5]
--low-mapq-cutoff INTEGER mapping quality < low-mapq-cutoff is defined
as low mapq sites [default: 10]
--high-sb-cutoff FLOAT strand bias > high-sb-cutoff is defined as
high strand bias sites [default: 0.2]
--quiet Do not print working progress to standard
output
--version Show the version and exit.
--help Show this message and exit.
- QC for lists of regions defined in BED file
methqc_batch -t 32 -i HG002_sorted.bam -r grch38.fa -b genes.bed -o results.bed;
The detailed BED file includes a header, the description of each column is listed below
| Column index | Name | Description | Type |
|---|---|---|---|
| 1 | chrom | Chromosome | str |
| 2 | start | 0-based start modification position | int |
| 3 | end | 0-based exclusive end position (start position + 1) | int |
| 4 | modified base code | single letter code for modified base | str |
| 5 | valid coverage | coverage after -f or --probability-threshold filter |
int |
| 6 | modified fraction | modified fraction = modified count / valid coverage | float |
| 7 | modified count | modified count | int |
| 8 | unmodified count | unmodified count | int |
| 9 | fail count | count of sites fail in -f or --probability-threshold filter |
int |
| 10 | average mapq | average mapping quality | float |
| 11 | strand bias | absolute difference of modification in positive and negative strand | float |
| 12 | quality status | PASS or FAIL |
str |
The detailed BED file includes a header, the description of each column is listed below
| Column index | Name | Description | Type |
|---|---|---|---|
| 1 | chrom | Chromosome | str |
| 2 | start | 0-based start modification position | int |
| 3 | end | 0-based exclusive end position (start position + 1) | int |
| 4 | MQI | Methylation Quality Index | float |
| 5 | Low Mapping Quality Count | Low mapping quality site count | int |
| 6 | High Strand Bias Count | High strand bias site count | int |
| 7 | QC Pass Count | QC pass site count | int |
An showcase example is provided in here. An explanation markdown file is also provided.
- MethQC also supports BAM files with old
MmandMltags.
- Some old PacBio data sequenced by
Sequel Imay cause a segmentation fault in BAM reading, details are in here. - MethQC is primarily tested on human 5m-CpG
MethQC is dependent on the following libraries; we are grateful to all the developers/maintainers:
- click: Python command line interface
- pyfastx: Reference FASTA processing
- pysam: BAM file processing
- bokeh: Interactive HTML
If you encounter any problems, bugs, or unexpected results while using MethQC, please open an issue in this repository.
We welcome all forms of contributions — whether it’s reporting bugs, suggesting new features, improving documentation, or submitting pull requests.