Skip to content

feat: add ClustalW .aln format support for sequence alignments#880

Open
haoyu-haoyu wants to merge 3 commits intobiotite-dev:mainfrom
haoyu-haoyu:feat/clustalw-aln-format
Open

feat: add ClustalW .aln format support for sequence alignments#880
haoyu-haoyu wants to merge 3 commits intobiotite-dev:mainfrom
haoyu-haoyu:feat/clustalw-aln-format

Conversation

@haoyu-haoyu
Copy link
Copy Markdown

Summary

Add support for reading and writing ClustalW .aln alignment files, following the existing FASTA alignment I/O pattern.

New files

File Description
src/biotite/sequence/io/clustal/__init__.py Package exports
src/biotite/sequence/io/clustal/file.py ClustalFile class (TextFile + MutableMapping)
src/biotite/sequence/io/clustal/convert.py get_alignment() / set_alignment() conversion functions
tests/sequence/data/clustal.aln Single-block test data
tests/sequence/data/clustal_multi.aln Multi-block test data
tests/sequence/test_clustal.py 9 test cases

Implementation details

ClustalFile

  • Extends TextFile and MutableMapping (same pattern as FastaFile)
  • Parses CLUSTAL header, sequence blocks, and consensus lines
  • Dict-like interface: clustal_file["seq1"] returns gapped sequence string
  • Handles multi-block alignments by concatenating segments per sequence name

Conversion functions

  • get_alignment(clustal_file, seq_type=None) — auto-detects nucleotide/protein
  • set_alignment(clustal_file, alignment, seq_names, line_length=60) — formats into blocks

Edge cases handled

  • Empty file → InvalidFileError
  • Missing CLUSTAL header → InvalidFileError
  • Single sequence → clear ValueError (alignments need ≥ 2 sequences)
  • line_length=0ValueError
  • Reusing a ClustalFile object → entries cleared before writing
  • Consensus lines (with *, :, .) correctly skipped during parsing

Tests

9 tests covering: file reading, multi-block parsing, dict interface, alignment conversion, round-trip consistency (object-level and file I/O-level), name count mismatch error, and explicit sequence type parameter.

Closes #774

@padix-key
Copy link
Copy Markdown
Member

Thanks for adding the parser 👍. I'll have a look 👀

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 8, 2026

Merging this PR will not alter performance

✅ 98 untouched benchmarks


Comparing haoyu-haoyu:feat/clustalw-aln-format (83b69f9) with main (00d5b98)

Open in CodSpeed

Copy link
Copy Markdown
Member

@padix-key padix-key left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally found the time for review, thanks for you patience. This looks really good 👍, I just have a few minor comments.

I think you need to run ruff on your branch and rebase it on main, then the CI should pass as well.

"""

__name__ = "biotite.sequence.io.clustal"
__author__ = "Biotite contributors"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put your own name here 🙂

# information.

__name__ = "biotite.sequence.io.clustal"
__author__ = "Biotite contributors"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

Comment thread src/biotite/sequence/io/clustal/file.py Outdated
# information.

__name__ = "biotite.sequence.io.clustal"
__author__ = "Biotite contributors"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

Comment thread src/biotite/sequence/io/clustal/file.py Outdated
continue
# Skip consensus lines (lines that start with whitespace
# and contain only consensus characters)
if line[0] == " " and all(c in _CONSENSUS_CHARS for c in line):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Completely empty lines would also pass this condition, right?
  • Do we need the extra line[0] == " " check?
  • I assume a regex would be faster than iterating over each character

Comment thread src/biotite/sequence/io/clustal/file.py Outdated
# (no leading whitespace)
if line[0] != " ":
parts = line.split()
if len(parts) >= 2:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which valid case would we have a non-empty/non-consensus line with less then two parts? Wouldn't the else case warrant an exception?

Comment thread tests/sequence/test_clustal.py Outdated
buffer.seek(0)
file3 = clustal.ClustalFile.read(buffer)
alignment3 = clustal.get_alignment(file3)
assert str(alignment) == str(alignment3)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to compare the Alignment objects directly with each other.

Suggested change
assert str(alignment) == str(alignment3)
assert alignment == alignment3

@haoyu-haoyu
Copy link
Copy Markdown
Author

Thanks @padix-key! Plan for the follow-up:

  1. __author__ (__init__.py, convert.py, file.py) — changing all three to "haoyu", matching how sibling packages like biotite.sequence.io.fasta sign the author line.

  2. file.py:94 consensus detection — right, line[0] == " " is redundant after the all(c in _CONSENSUS_CHARS for c in line) check, and a completely empty line would also pass the latter (it is only excluded by the preceding blank-line skip, not by this condition itself). Switching to a single compiled re.fullmatch(r"[ *:.]+", line) — faster, reads more directly, and doesn't hinge on the preceding skip.

  3. file.py:101 len(parts) >= 2 — good catch. A non-empty, non-consensus line with fewer than two whitespace-separated parts is a malformed sequence line; replacing the silent drop with an up-front raise InvalidFileError(...) (same exception used by the header check).

  4. test_clustal.py:98 — applying your suggestion. Will also switch test_round_trip at line 76 to the same alignment == alignment2 comparison for consistency.

  5. Rebase onto main + ruff — will do.

Will push and ping when ready.

- Validate sequence count >= 2 in get_alignment() with clear error
- Validate line_length > 0 in set_alignment()
- Clear existing entries before writing in set_alignment() to prevent
  stale data when reusing a ClustalFile object
- Credit __author__ in the three new clustal modules (__init__.py,
  convert.py, file.py).
- Replace the two-part consensus-line check in ClustalFile._parse()
  with a single precompiled regex (_CONSENSUS_LINE = re.compile(r"[ *:.]+")).
  The old `line[0] == " " and all(c in _CONSENSUS_CHARS for c in line)`
  was redundant (the all-check already excludes real sequence lines,
  since those contain alphanumeric characters) and technically also
  passed empty lines (harmless only because of the preceding blank-line
  skip). Regex is simpler and doesn't rely on that ordering.
- Raise InvalidFileError on malformed sequence lines (< 2 whitespace-
  separated parts) instead of silently dropping them. Same exception
  class already used by the header check.
- Tests: round-trip checks now compare Alignment objects directly via
  `alignment == alignment2/3` instead of `str(alignment) == str(...)`,
  per reviewer suggestion.

Rebased onto current main; ruff format + ruff check --fix applied.
@haoyu-haoyu haoyu-haoyu force-pushed the feat/clustalw-aln-format branch from 83b69f9 to 7e6fb6e Compare April 20, 2026 17:21
@haoyu-haoyu
Copy link
Copy Markdown
Author

Pushed 7e6fb6e (force-push, branch was rebased onto current main):

  1. __author__ — set to "haoyu" in __init__.py, convert.py, file.py.
  2. Consensus detection — replaced with a precompiled re.compile(r"[ *:.]+") and fullmatch call. Since TextFile.read() uses splitlines(), trailing \n is already stripped, so the regex matches the full stored line.
  3. Malformed sequence linesClustalFile._parse() now raises InvalidFileError("Malformed sequence line, expected a name and a sequence segment separated by whitespace, got '…'") when a non-empty, non-consensus line has fewer than two whitespace-separated parts.
  4. Alignment equality — both test_round_trip and test_write_read_round_trip now compare Alignment objects directly.

All 9 tests pass locally, ruff format and ruff check are clean.

Ready for another look @padix-key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support ClustalW .aln file formats for sequence alignments

2 participants