Skip to content

TextFileReader: strip leading invisible Unicode characters on first line only#615

Open
niaBaldoni wants to merge 2 commits into
TypesettingTools:masterfrom
niaBaldoni:strip_leading_invisible_char
Open

TextFileReader: strip leading invisible Unicode characters on first line only#615
niaBaldoni wants to merge 2 commits into
TypesettingTools:masterfrom
niaBaldoni:strip_leading_invisible_char

Conversation

@niaBaldoni

Copy link
Copy Markdown

Fixes #614

Problem

Subtitle files generated by tools like faster-whisper might sometimes contain unexpected invisible Unicode characters (e.g. U+200E LEFT-TO-RIGHT MARK, U+202A LEFT-TO-RIGHT EMBEDDING) at the very start of the file. These cause SRT parsing to fail with an error:

Parsing SRT: Expected subtitle index at line 1

The SRT parser's digit check fails because the invisible character precedes the subtitle index on the first line. Since the characters are invisible in common text editors, the user has no way of knowing what went wrong.

Changes

The existing U+FEFF (BOM) check ran on every line of every file. This has been moved into a first_line guard so it only runs once. The guard now also strips a broader set of invisible Unicode characters before any format parser sees the first line. Characters within subtitle content (such as RTL marks, possibly present in Arabic or Persian subtitles) are intentionally preserved, since the stripping is scoped to the first line only.

Tests

Added tests/tests/text_file_reader.cpp:

  • strips_bom_on_first_line: existing BOM behaviour is preserved
  • strips_leading_invisible_char: single invisible character at file start is stripped
  • strips_stacked_leading_invisible_chars: multiple stacked invisible characters are all stripped
  • preserves_invisible_chars_in_content: U+200F inside subtitle content is not stripped

Test files in tests/text_file_reader/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parsing SRT: Expected subtitle index error caused by invisible Unicode characters at file start

1 participant