feat: add ClustalW .aln format support for sequence alignments#880
feat: add ClustalW .aln format support for sequence alignments#880haoyu-haoyu wants to merge 3 commits intobiotite-dev:mainfrom
Conversation
|
Thanks for adding the parser 👍. I'll have a look 👀 |
padix-key
left a comment
There was a problem hiding this comment.
I finally found the time for review, thanks for you patience. This looks really good 👍, I just have a few minor comments.
I think you need to run ruff on your branch and rebase it on main, then the CI should pass as well.
| """ | ||
|
|
||
| __name__ = "biotite.sequence.io.clustal" | ||
| __author__ = "Biotite contributors" |
There was a problem hiding this comment.
You can put your own name here 🙂
| # information. | ||
|
|
||
| __name__ = "biotite.sequence.io.clustal" | ||
| __author__ = "Biotite contributors" |
| # information. | ||
|
|
||
| __name__ = "biotite.sequence.io.clustal" | ||
| __author__ = "Biotite contributors" |
| continue | ||
| # Skip consensus lines (lines that start with whitespace | ||
| # and contain only consensus characters) | ||
| if line[0] == " " and all(c in _CONSENSUS_CHARS for c in line): |
There was a problem hiding this comment.
- Completely empty lines would also pass this condition, right?
- Do we need the extra
line[0] == " "check? - I assume a regex would be faster than iterating over each character
| # (no leading whitespace) | ||
| if line[0] != " ": | ||
| parts = line.split() | ||
| if len(parts) >= 2: |
There was a problem hiding this comment.
In which valid case would we have a non-empty/non-consensus line with less then two parts? Wouldn't the else case warrant an exception?
| buffer.seek(0) | ||
| file3 = clustal.ClustalFile.read(buffer) | ||
| alignment3 = clustal.get_alignment(file3) | ||
| assert str(alignment) == str(alignment3) |
There was a problem hiding this comment.
You should be able to compare the Alignment objects directly with each other.
| assert str(alignment) == str(alignment3) | |
| assert alignment == alignment3 |
|
Thanks @padix-key! Plan for the follow-up:
Will push and ping when ready. |
- Validate sequence count >= 2 in get_alignment() with clear error - Validate line_length > 0 in set_alignment() - Clear existing entries before writing in set_alignment() to prevent stale data when reusing a ClustalFile object
- Credit __author__ in the three new clustal modules (__init__.py, convert.py, file.py). - Replace the two-part consensus-line check in ClustalFile._parse() with a single precompiled regex (_CONSENSUS_LINE = re.compile(r"[ *:.]+")). The old `line[0] == " " and all(c in _CONSENSUS_CHARS for c in line)` was redundant (the all-check already excludes real sequence lines, since those contain alphanumeric characters) and technically also passed empty lines (harmless only because of the preceding blank-line skip). Regex is simpler and doesn't rely on that ordering. - Raise InvalidFileError on malformed sequence lines (< 2 whitespace- separated parts) instead of silently dropping them. Same exception class already used by the header check. - Tests: round-trip checks now compare Alignment objects directly via `alignment == alignment2/3` instead of `str(alignment) == str(...)`, per reviewer suggestion. Rebased onto current main; ruff format + ruff check --fix applied.
83b69f9 to
7e6fb6e
Compare
|
Pushed
All 9 tests pass locally, Ready for another look @padix-key. |
Summary
Add support for reading and writing ClustalW
.alnalignment files, following the existing FASTA alignment I/O pattern.New files
src/biotite/sequence/io/clustal/__init__.pysrc/biotite/sequence/io/clustal/file.pyClustalFileclass (TextFile + MutableMapping)src/biotite/sequence/io/clustal/convert.pyget_alignment()/set_alignment()conversion functionstests/sequence/data/clustal.alntests/sequence/data/clustal_multi.alntests/sequence/test_clustal.pyImplementation details
ClustalFile
TextFileandMutableMapping(same pattern asFastaFile)clustal_file["seq1"]returns gapped sequence stringConversion functions
get_alignment(clustal_file, seq_type=None)— auto-detects nucleotide/proteinset_alignment(clustal_file, alignment, seq_names, line_length=60)— formats into blocksEdge cases handled
InvalidFileErrorInvalidFileErrorValueError(alignments need ≥ 2 sequences)line_length=0→ValueErrorClustalFileobject → entries cleared before writing*,:,.) correctly skipped during parsingTests
9 tests covering: file reading, multi-block parsing, dict interface, alignment conversion, round-trip consistency (object-level and file I/O-level), name count mismatch error, and explicit sequence type parameter.
Closes #774