Skip to content

Try to avoid treating censorship as "not visual duplicates (alternate)" #2000

@roachcord3

Description

@roachcord3

Currently, the algorithm for detecting visual dupes does not have cases like small areas of the image being covered with a censor (such as a mosaic, bar, blur, or solid color) will pretty much always say that a censored and uncensored pair are "not visual duplicates (alternate)" or some variant of that type of negative. After years of using Hydrus and seeing people talk about it in Discord, I believe most users consider censored versions of images to be inferior duplicates rather than alternates. Thus, I'd like to request that there be some attempts in the algorithm that detects visual dupes to spot censorship and treat it as a new, different case that:

  • Can be selected as an option in the auto-filtering rules
  • Has a directional relationship; e.g., "A is probably a censored version of B" in auto-filtering rules, "this is probably a censored version of the other"/"the other is probably a censored version of this" (with associated red/green text colors) in manual filtering.
  • Is not treated as being on the same axis as the other visual duplicate detection states, so as not to interfere with any existing rules/workflows that rely on that heuristic confidence gradient

For example, if a pair is exactly the same (minus some basic JPEG noise) except for a couple blobs of completely different pixels, those regions of completely different pixels in each respective image could be searched for common visual patterns of censorship; if censorship is detected in one but not the other, then that'd be a positive match for this relationship.

I believe it's considered not that difficult to provide a good guess as to whether an image contains a mosaic or other large contiguous shapes of the same color, and scanning sub-sections of the images based on the region of big obvious differences should further reduce the downtime. Nevertheless, if it would inevitably add to the runtime of the algorithm too much to be acceptable for the default user experience, then options to enable it (maybe something in the dupe filter page that says "attempt to detect censorship" which enables this search) would be a good solution for advanced users to opt-in if they're ok with the slower visual duplicate detection.

It's pretty common for people to collect freely available images from boorus, imageboards, and social media over time based on the content of the images–e.g., fan art of a certain character–and then eventually notice a pattern that a particular artist makes stuff they really like, so they eventually support that artist on the paysite... only to get met with 4-6 figures of potential duplicates from a huge back catalog. Speaking at least in my case, if Hydrus had had this feature since I started using it, I'd have saved hundreds of hours of my life letting the machine do a good chunk of the work for me; even if the false negative rate was pretty high, it'd still be worth it, because every single censored dupe represents dozens or hundreds of pairs to filter, so even a 25% success rate would still be thousands fewer pairs to go through.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions