Hi, i imported few thousand article to my mediawiki site. At the time of importing the articles all the article had a common category. then i replaced that category with a new one. i used 'pywikibot replace.py' to replace the category. now the problem is the previous category shows that it has has pages but it is not. and some cases it shows duplicate page names. I used the Manual:rebuildall.php but it did not solve the issues. what might be the reason of this and what is the solution?
Topic on Project:Support desk/Flow
Is your wiki publicly accessible so we can take a look?
+1 to do you have a link. - duplicate page names in a category sounds super weird and should not be possible
Also, are you using file cache?
The wiki is publicly available but it is not publicly editable. Another thing is the language is not English. That is why i did not mentioned the link earlier.
The Category 'Banglapedia' (http://bn.banglapedia.org/index.php?title=Category:Banglapedia https://www.dropbox.com/sh/txzcb6zdw461xde/JhhIB-58Pc/banglapedia%20issue/list.jpg ) should not list any article as i replaced this with the category 'বাংলাপিডিয়া'
http://bn.banglapedia.org/index.php?title=special:AllPages&from=A page lists all the pages and if you check the following image you can see the duplicate. I marked one duplicate name on the page but there are others too. https://www.dropbox.com/sh/txzcb6zdw461xde/Exe6ponqPd/banglapedia%20issue/list%20duplicate%20names.jpg The marked article is http://bn.banglapedia.org/index.php?title=%E0%A6%85%E0%A6%A8%E0%A7%8D%E0%A6%A4%E0%A7%8D%E0%A6%AF%E0%A7%87%E0%A6%B7%E0%A7%8D%E0%A6%9F%E0%A6%BF%E0%A6%95%E0%A7%8D%E0%A6%B0%E0%A6%BF%E0%A6%AF%E0%A6%BC%E0%A6%BE Page view: https://www.dropbox.com/sh/txzcb6zdw461xde/JqTGzc3x84/banglapedia%20issue/page%20view.jpg Source: https://www.dropbox.com/sh/txzcb6zdw461xde/bvUWvZ00l7/banglapedia%20issue/page_edit%20mode.jpg
This is really strange. And for some reason, on Category:Banglapedia, I've entered on page "অ-নবায়নযোগ্য শক্তি" but the title on the page is "অ-নবায়নযোগ্য শক্তি" instead. Doing CTRL+F and finding the title on the other page does not find it, so there are different characters.
Furthermore:
- API:Categorymembers of বিষয়শ্রেণী:Banglapedia:
- The original page appears at the first position as <cm pageid="2601" ns="0" title="অ-নবায়নযোগ্য শক্তি" />
- API:Categorymembers of বিষয়শ্রেণী:বাংলাপিডিয়া:
- The page that has a very similar title (transparently redirected from the first) appears at the second position as <cm pageid="11646" ns="0" title="অ-নবায়নযোগ্য শক্তি" /> (different Page ID)
Doing API:Revisions for those pages I get 1 revision (as if both pages were the same):
But, doing API:Revisions for those page ID's I get 2 revisions, one with the first category, and other with the second:
At the bottom of the content (and in the revision comment of one of them) it mentions that it was imported. How did you import those pages?
I found the page "অ-নবায়নযোগ্য শক্তি" in both category by CTRL+F. I am not sure why you missed it. the page id might be different but they have the same content. You can update any of the page and you will find the reflection on both pages.
I used "MwImporter" (http://www.donationcoder.com/Software/Mouser/mwimporter/index.html) to import the contents. I imported almost all the contents at the same time and due to some issues some pages were not imported and i imported those (about 1500 pages) last week directly via Mediawiki API.
how this issue can be resolved?
Its the difference between combining characters ('BENGALI LETTER YA' (U+09AF) followed by 'BENGALI SIGN NUKTA' (U+09BC)), and precomposed characters ( Just 'BENGALI LETTER YYA' (U+09DF)).
MediaWiki expects all text to be in unicode Normal form C (NFC). Which means U+09DF is the proper encoding. Everything is supposed to be converted to this on save. Some import tools may import things incorrectly.
You can fix this by running the command cleanupTitles.php. Its also possible to clean up specific titles via the api (e.g. commons:MediaWiki:Invisible_characters_unveiled.js), but I would reccomend the maintenance script
I used the cleanupTitles.php but it did not solved my issue. but it creates a list of sub pages under the page 'Broken (http://bn.banglapedia.org/index.php?title=Broken)'. any other ways to resolve?