TextFileReader: strip leading invisible Unicode characters on first line only#615
Open
niaBaldoni wants to merge 2 commits into
Open
TextFileReader: strip leading invisible Unicode characters on first line only#615niaBaldoni wants to merge 2 commits into
niaBaldoni wants to merge 2 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #614
Problem
Subtitle files generated by tools like
faster-whispermight sometimes contain unexpected invisible Unicode characters (e.g. U+200E LEFT-TO-RIGHT MARK, U+202A LEFT-TO-RIGHT EMBEDDING) at the very start of the file. These cause SRT parsing to fail with an error:The SRT parser's digit check fails because the invisible character precedes the subtitle index on the first line. Since the characters are invisible in common text editors, the user has no way of knowing what went wrong.
Changes
The existing U+FEFF (BOM) check ran on every line of every file. This has been moved into a
first_lineguard so it only runs once. The guard now also strips a broader set of invisible Unicode characters before any format parser sees the first line. Characters within subtitle content (such as RTL marks, possibly present in Arabic or Persian subtitles) are intentionally preserved, since the stripping is scoped to the first line only.Tests
Added
tests/tests/text_file_reader.cpp:strips_bom_on_first_line: existing BOM behaviour is preservedstrips_leading_invisible_char: single invisible character at file start is strippedstrips_stacked_leading_invisible_chars: multiple stacked invisible characters are all strippedpreserves_invisible_chars_in_content: U+200F inside subtitle content is not strippedTest files in
tests/text_file_reader/.