Parsing error for citations with defendant 'Thompson'

In issue [#3924](https://github.com/freelawproject/courtlistener/issues/3924), we identified a bug in Eyecite's parsing method when the defendant's last name is 'Thompson'.

For example, for the citation `'Shapiro v. Thompson, 394 U. S. 618'`:

- Expected output: `volume: 394, reporter: 'U.S.', page: '618'`
- Actual output: `volume: None, reporter: 'Thompson', page: '394'`

Other examples of inputs that are incorrectly parsed are: `Adams v. Thompson, 560 F. Supp. 894` and `Mozena v. Thompson, 44 A.2d 276`.


I've been using the first example to debug this issue, and noticed that Eyecite identifies two tokens within the input string: ["Thompson's Unreported Cases (TN)"](https://github.com/freelawproject/reporters-db/blob/d3c57f01e97f5e46d6a6c6cd4ebab46aeb5e7b65/reporters_db/data/reporters.json#L25080-L25106) and ["United States Supreme Court Reports."](https://github.com/freelawproject/reporters-db/blob/d3c57f01e97f5e46d6a6c6cd4ebab46aeb5e7b65/reporters_db/data/reporters.json#L25405-L25455). The problem arises because these tokens overlap (both include "394") and Eyecite's [tokenize](https://github.com/freelawproject/eyecite/blob/3e7836a62e8bd1c6e60eb95a5dc47fded74a9283/eyecite/tokenizers.py#L293-L329) method prioritizes the rightmost token when encountering overlaps, leading to this results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Parsing error for citations with defendant 'Thompson' #174

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

Parsing error for citations with defendant 'Thompson' #174

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions