feat: add ClustalW .aln format support for sequence alignments by haoyu-haoyu · Pull Request #880 · biotite-dev/biotite

haoyu-haoyu · 2026-04-05T15:31:25Z

Summary

Add support for reading and writing ClustalW .aln alignment files, following the existing FASTA alignment I/O pattern.

New files

File	Description
`src/biotite/sequence/io/clustal/__init__.py`	Package exports
`src/biotite/sequence/io/clustal/file.py`	`ClustalFile` class (TextFile + MutableMapping)
`src/biotite/sequence/io/clustal/convert.py`	`get_alignment()` / `set_alignment()` conversion functions
`tests/sequence/data/clustal.aln`	Single-block test data
`tests/sequence/data/clustal_multi.aln`	Multi-block test data
`tests/sequence/test_clustal.py`	9 test cases

Implementation details

ClustalFile

Extends TextFile and MutableMapping (same pattern as FastaFile)
Parses CLUSTAL header, sequence blocks, and consensus lines
Dict-like interface: clustal_file["seq1"] returns gapped sequence string
Handles multi-block alignments by concatenating segments per sequence name

Conversion functions

get_alignment(clustal_file, seq_type=None) — auto-detects nucleotide/protein
set_alignment(clustal_file, alignment, seq_names, line_length=60) — formats into blocks

Edge cases handled

Empty file → InvalidFileError
Missing CLUSTAL header → InvalidFileError
Single sequence → clear ValueError (alignments need ≥ 2 sequences)
line_length=0 → ValueError
Reusing a ClustalFile object → entries cleared before writing
Consensus lines (with *, :, .) correctly skipped during parsing

Tests

9 tests covering: file reading, multi-block parsing, dict interface, alignment conversion, round-trip consistency (object-level and file I/O-level), name count mismatch error, and explicit sequence type parameter.

Closes #774

padix-key · 2026-04-08T13:02:38Z

Thanks for adding the parser 👍. I'll have a look 👀

codspeed-hq · 2026-04-08T13:48:40Z

Merging this PR will not alter performance

✅ 98 untouched benchmarks

_{Comparing haoyu-haoyu:feat/clustalw-aln-format (83b69f9) with main (00d5b98)}

padix-key

I finally found the time for review, thanks for you patience. This looks really good 👍, I just have a few minor comments.

I think you need to run ruff on your branch and rebase it on main, then the CI should pass as well.

padix-key · 2026-04-08T13:03:53Z

+"""
+
+__name__ = "biotite.sequence.io.clustal"
+__author__ = "Biotite contributors"


You can put your own name here 🙂

padix-key · 2026-04-18T13:17:44Z

+# information.
+
+__name__ = "biotite.sequence.io.clustal"
+__author__ = "Biotite contributors"


padix-key · 2026-04-18T13:17:56Z

+# information.
+
+__name__ = "biotite.sequence.io.clustal"
+__author__ = "Biotite contributors"


padix-key · 2026-04-18T13:26:12Z

+                continue
+            # Skip consensus lines (lines that start with whitespace
+            # and contain only consensus characters)
+            if line[0] == " " and all(c in _CONSENSUS_CHARS for c in line):


Completely empty lines would also pass this condition, right?

Do we need the extra line[0] == " " check?

I assume a regex would be faster than iterating over each character

padix-key · 2026-04-18T13:27:57Z

+            # (no leading whitespace)
+            if line[0] != " ":
+                parts = line.split()
+                if len(parts) >= 2:


In which valid case would we have a non-empty/non-consensus line with less then two parts? Wouldn't the else case warrant an exception?

padix-key · 2026-04-20T07:31:51Z

+    buffer.seek(0)
+    file3 = clustal.ClustalFile.read(buffer)
+    alignment3 = clustal.get_alignment(file3)
+    assert str(alignment) == str(alignment3)


You should be able to compare the Alignment objects directly with each other.

Suggested change

assert str(alignment) == str(alignment3)

assert alignment == alignment3

haoyu-haoyu · 2026-04-20T17:15:33Z

Thanks @padix-key! Plan for the follow-up:

__author__ (__init__.py, convert.py, file.py) — changing all three to "haoyu", matching how sibling packages like biotite.sequence.io.fasta sign the author line.
file.py:94 consensus detection — right, line[0] == " " is redundant after the all(c in _CONSENSUS_CHARS for c in line) check, and a completely empty line would also pass the latter (it is only excluded by the preceding blank-line skip, not by this condition itself). Switching to a single compiled re.fullmatch(r"[ *:.]+", line) — faster, reads more directly, and doesn't hinge on the preceding skip.
file.py:101 len(parts) >= 2 — good catch. A non-empty, non-consensus line with fewer than two whitespace-separated parts is a malformed sequence line; replacing the silent drop with an up-front raise InvalidFileError(...) (same exception used by the header check).
test_clustal.py:98 — applying your suggestion. Will also switch test_round_trip at line 76 to the same alignment == alignment2 comparison for consistency.
Rebase onto main + ruff — will do.

Will push and ping when ready.

Closes biotite-dev#774

- Validate sequence count >= 2 in get_alignment() with clear error - Validate line_length > 0 in set_alignment() - Clear existing entries before writing in set_alignment() to prevent stale data when reusing a ClustalFile object

- Credit __author__ in the three new clustal modules (__init__.py, convert.py, file.py). - Replace the two-part consensus-line check in ClustalFile._parse() with a single precompiled regex (_CONSENSUS_LINE = re.compile(r"[ *:.]+")). The old `line[0] == " " and all(c in _CONSENSUS_CHARS for c in line)` was redundant (the all-check already excludes real sequence lines, since those contain alphanumeric characters) and technically also passed empty lines (harmless only because of the preceding blank-line skip). Regex is simpler and doesn't rely on that ordering. - Raise InvalidFileError on malformed sequence lines (< 2 whitespace- separated parts) instead of silently dropping them. Same exception class already used by the header check. - Tests: round-trip checks now compare Alignment objects directly via `alignment == alignment2/3` instead of `str(alignment) == str(...)`, per reviewer suggestion. Rebased onto current main; ruff format + ruff check --fix applied.

haoyu-haoyu · 2026-04-20T17:21:44Z

Pushed 7e6fb6e (force-push, branch was rebased onto current main):

__author__ — set to "haoyu" in __init__.py, convert.py, file.py.
Consensus detection — replaced with a precompiled re.compile(r"[ *:.]+") and fullmatch call. Since TextFile.read() uses splitlines(), trailing \n is already stripped, so the regex matches the full stored line.
Malformed sequence lines — ClustalFile._parse() now raises InvalidFileError("Malformed sequence line, expected a name and a sequence segment separated by whitespace, got '…'") when a non-empty, non-consensus line has fewer than two whitespace-separated parts.
Alignment equality — both test_round_trip and test_write_read_round_trip now compare Alignment objects directly.

All 9 tests pass locally, ruff format and ruff check are clean.

Ready for another look @padix-key.

padix-key reviewed Apr 20, 2026

View reviewed changes

haoyu-haoyu added 3 commits April 20, 2026 18:16

feat: add ClustalW .aln format support for sequence alignments

4a8e447

Closes biotite-dev#774

haoyu-haoyu force-pushed the feat/clustalw-aln-format branch from 83b69f9 to 7e6fb6e Compare April 20, 2026 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add ClustalW .aln format support for sequence alignments#880

feat: add ClustalW .aln format support for sequence alignments#880
haoyu-haoyu wants to merge 3 commits intobiotite-dev:mainfrom
haoyu-haoyu:feat/clustalw-aln-format

haoyu-haoyu commented Apr 5, 2026

Uh oh!

padix-key commented Apr 8, 2026

Uh oh!

codspeed-hq bot commented Apr 8, 2026

Uh oh!

padix-key left a comment

Uh oh!

padix-key Apr 8, 2026

Uh oh!

padix-key Apr 18, 2026

Uh oh!

padix-key Apr 18, 2026

Uh oh!

padix-key Apr 18, 2026

Uh oh!

padix-key Apr 18, 2026

Uh oh!

padix-key Apr 20, 2026

Uh oh!

haoyu-haoyu commented Apr 20, 2026

Uh oh!

haoyu-haoyu commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	assert str(alignment) == str(alignment3)
	assert alignment == alignment3

Conversation

haoyu-haoyu commented Apr 5, 2026

Summary

New files

Implementation details

ClustalFile

Conversion functions

Edge cases handled

Tests

Uh oh!

padix-key commented Apr 8, 2026

Uh oh!

codspeed-hq bot commented Apr 8, 2026

Merging this PR will not alter performance

Uh oh!

padix-key left a comment

Choose a reason for hiding this comment

Uh oh!

padix-key Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

padix-key Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

padix-key Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

padix-key Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

padix-key Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

padix-key Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

haoyu-haoyu commented Apr 20, 2026

Uh oh!

haoyu-haoyu commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants