Jump to content

Topic on Help talk:Extension:Wikisource/Wikimedia OCR

OCR images on Commons in specified categories

2
Prototyperspective (talkcontribs)

Hello, this seems like a very useful tool. I think it has much larger potential and could be used far more if it could be used for Commons categories / petscan category intersections. @Samwilson and others could you please take a look at this proposal? (now here)

Maybe there already is some tool to make it scan all images in a category which could be altered so as to also enable using petscan results and the addition of categories based on which text has been found. It would be really useful for many applications.

Prototyperspective (talkcontribs)

I exported the petscan file results and converted it to URLs so they can be opened quickly in new tabs for categorization. I still think a feature to OCR files in a specified category would be very useful. Instead of enabling adding categories based on that I guess one could have the tool write the OCR text somehow to the file info whereby one could then create a search query to bulk-categorize them from SpecialSearch using cat-a-lot....e.g. sth like ocr:", 2016" deepcategory:"Our World in Data maps" (or insource:"|ocr=, 2016") would go into cat c:Category:2016 maps of the world (except for nonworld maps which can be easily spotted). This was just an example.

  • Adding a feature to OCR all files in a category using incategory search operator
  • Adding a feature to write the OCRd text to the file description

@Enhancing999 and Glrx: you may be interested in it since you participated in the discussion. Nevertheless, I don't think it's an overly important issue and having so much OCRd text in Commons could also cause problems if files also show up when terms in the ocr field of the Information template(?) are searched for without something like ocr:"search terms". However, since that OCR tool is already there maybe implementing it wouldn't take that much time and be worth it or it may be good to track this somewhere else.

Reply to "OCR images on Commons in specified categories"