Jump to content

Topic on Talk:Reading/Readers contributions via Android

Add similarity checks to the app using Perceptual hashing (similarity analysis)

2
197.218.91.247 (talkcontribs)

Something that would benefit both commons, and editors of every wiki is some form of similarity analysis (perceptual hashing) for media:

While most of these would require some research to evaluate and integrate it into an app or even commons. Perceptual hashing for images is already supported by the backend software that wikimedia currently uses:

If exposed using an API or even a special page that enhances or supersedes Special:FileDuplicateSearch, this would be a huge gain for curation, and readers.

This can also be exposed in the app, by asking editors whether these images are similar, or even implemented as image captcha. It can detect attacks such as " rotation, skew, contrast adjustment and different compression/format". In some cases it will even identify but useful images. A similar algorithm is the blockhash ( http://blockhash.io/) that claims to be more efficient in some instances. A non-academic comparison of the two is here:

http://littlesvr.ca/grumble/2015/04/27/perceptual-hash-comparison-phash-vs-blockhash-false-positives/

Other usecases include surfacing similar images in search results (like google), or even in media search tool within VisualEditor / new wikitext editor, or within the image file page, or in the mediaviewer. The applications are endless.

Even if these tools for the app fail and are eventually removed this technology can continue being used in other products. Win Win !

197.218.91.247 (talkcontribs)

As far as curation is concerned. Some ways to help moderators / editors / admins include:

  • Fingerprinting deleted files ( create an index of deleted files)
    • For readers in the app - quickly get them to determine false positives, and help mark them for deleting
    • Identify duplicates - possibly shown in recent changes or special:newfiles
    • Surface this in abusefilter - to block or tag images that are proven to be the same, e.g. rotation
  • Automatic tag them
    • Suggest adding them to categories related to file it matches
    • Tag these - As files of interest
    • Adding Images to article - Inform readers / editors that possible duplicate exists, and encourage them to reuse instead of upload
  • Make finding duplicates more efficient
    • When determining files to delete, an editor could search for duplicates to make the cleanup more efficient

Of course some of these depend on the speed at which it can process the newfiles and which can it match them against the existing fingerprints. But even it is a scheduled operation running hours or days later it will still be way better than the status quo.

At least the demo at phash is very fast:

http://www.phash.org/cgi-bin/phash-demo-new.cgi

Reply to "Add similarity checks to the app using Perceptual hashing (similarity analysis)"