SQL/XML Dumps/Debugging PHP
Appearance
Step 1: Stream of consciousness log of actions:
- Using codesearch, find where the error is coming from.
- Confused: is this getText calling this getText?
- Find TextContent.
- seems like this text is something else entirely, set when this class is constructed
- So who's calling our getText?
- This maintenance script
- Not called from anywhere codesearch can see (makes sense, dumps repo is not indexed)
- clone it, search it, find it, get lucky and find some examples.
- Try to build an example for our revisions, frwiki revisions 331925, 346198, 346206, 296457, 346221
- Which is weird because I remember the addresses being tt:... instead of es:... (putting a pin in that for a sec)
select rev_page, rev_id, content_size, model_name, role_name, content_address from content join slots on slot_content_id = content_id join revision on slot_revision_id = rev_id join content_models on model_id = content_model join slot_roles on role_id = slot_role_id where rev_id in (331925,346198,346206,296457,346221); +----------+--------+--------------+------------+-----------+-----------------------------------+ | rev_page | rev_id | content_size | model_name | role_name | content_address | +----------+--------+--------------+------------+-----------+-----------------------------------+ | 51665 | 296457 | 12265 | wikitext | main | es:DB://cluster5/180153/aafb24... | | 57040 | 331925 | 2905 | wikitext | main | es:DB://cluster5/188443/03d899... | | 64577 | 346198 | 1881 | wikitext | main | es:DB://cluster5/197946/3b205a... | | 64577 | 346206 | 1925 | wikitext | main | es:DB://cluster5/197946/703850... | | 64577 | 346221 | 1946 | wikitext | main | es:DB://cluster5/197946/3f81b7... | +----------+--------+--------------+------------+-----------+-----------------------------------+
es:
was introduced in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1020948. That was the thing necessitating the bump on the XML schema version in task T365155 as a workaround to mask the symptoms. IIRC, new revisions began gettinges:
upon change 1020948's deployment. Thinking out loud: if there's a correlated increase in the errors presently under investigation now relating to revision velocity, that might be strong a hint - that would mean that the fix to mask symptoms had an additional side effect (or maybe: didn't address another unidentified code path) - recall that the tables had already started to be tainted, and rolling back to the oldtt:
was thought to be too expensive so the idea was to roll forward. It's also entirely possible that this field's convention for the value has nothing to do with the thing being investigated presently, but I'll be interested to hear if turns out to be the thing. ABaso (WMF) (talk) 15:02, 29 October 2024 (UTC)
- now we have the pages to dump, but calling the dump maintenance script is involved. I'd have to create a stub first and mostly understand a lot more code. Switching gears.
- What if I can just call the relevant code, starting from the db fetch and checking sizes manually?
- Erik advised me to use `mwscript shell.php --wiki=frwiki` which on snapshot1012 currently translates to `/usr/bin/php7.4 /srv/mediawiki/multiversion/MWScript.php shell.php --wiki=frwiki` (I believe this may be changing as we dockerize how we call MW, so it may not work in the near future)
- Revision 331925 has this es:DB:... id in the query results above, but the error message mentions it was trying to fetch text id tt:331925. Looked around the db to understand this, should probably ask someone. For now I guess I can try both in the db fetches and see what happens.
- I'm not sure what this spawning thing's about, but it looks like dumps gets text one of three ways. First it tries a prefetch, if that fails then it uses a spawn or a db call. The spawn basically shells out to maintenance/fetchText.php, so maybe I can try just calling that directly. And for the db call I can use the REPL I opened above.
- `php7.4 /srv/mediawiki/multiversion/MWScript.php fetchText.php --wiki=frwiki` returns -1 when I type in tt:331925 but a revision-looking thing when I use the es:DB:// address.
- Checking the revision-looking thing against what's in a previous dump: `7z x frwiki-20240901-pages-meta-history1.xml-p56454p64216.7z` and then `grep -C100 "<id>331925</id>" frwiki-20240901-pages-meta-history1.xml-p56454p64216`
- The content is almost identical, but the dumps have it encoded so it doesn't clash with XML, which makes sense, I think I remember that transformation happening in XmlDumpWriter.
- So let's try the db call from the shell
- Trying with all the other ids gives the same results. At this point I'm just guessing that for some reason we're using the wrong id, this "tt:" thing seems wrong, it's not related to these revisions anywhere in the db (checked text and content tables). Going to ask.
# /usr/bin/php7.4 /srv/mediawiki/multiversion/MWScript.php shell.php --wiki=frwiki use MediaWiki\MediaWikiServices; $services = MediaWikiServices::getInstance() $store = $services->getBlobStore() $text = $store->getBlob( "tt:331925" ) # MediaWiki\Storage\BlobAccessException Unable to fetch blob at tt:331925. Use findBadBlobs.php to remedy. $text = $store->getBlob( "es:DB://cluster5/188443/03d8995c4b8b437077225ee9dbeabb11" ) # works, shows the correct text # NOTE: /usr/bin/php7.4 /srv/mediawiki/multiversion/MWScript.php eval.php --wiki=frwiki -d 3 # you can use this ^ but use ... doesn't work so fully qualify stuff as in MediaWiki\MediaWikiServices::getInstance()
- This maintenance script
- The investigation led us to finding an ongoing database migration that was racing with dumps. Dumps would write the content address into a stub and circle back to it later to find the content in the prefetch or fall back to the database. By the time it got back to it, the migration erased the data from the database. Details on that here.
- Erik wrote this patch, reviewed with Piotr, and I'm trying to test it on snapshot1012.
- One of the commands I found in the logs that seems to have led to these kinds of failures is: `/usr/bin/php7.4 /srv/mediawiki/multiversion/MWScript.php dumpTextPass.php --wiki=enwiki --stub=gzip:/mnt/dumpsdata/xmldatadumps/temp/e/enwiki/enwiki-20241001-stub-meta-history8.xml-p2762152p2801961.gz --prefetch=7zip:/mnt/dumpsdata/xmldatadumps/public/enwiki/20240901/enwiki-20240901-pages-meta-history8.xml-p2723140p2762956.7z;/mnt/dumpsdata/xmldatadumps/public/enwiki/20240901/enwiki-20240901-pages-meta-history8.xml-p2762957p2802548.7z --dbgroupdefault=dump --report=1000 --spawn=/usr/bin/php7.4 --output=bzip2:/mnt/dumpsdata/xmldatadumps/public/enwiki/20241001/enwiki-20241001-pages-meta-history8.xml-p2762152p2801961.bz2.inprog --full` (line 249982 of /var/log/dumps/fulldumps_04_2014033.log).
- My theory is that this would be a good test because the stub has the bad content address in it. And the new code from Erik's patch should be resilient to that and try the refresh. So, let's see if that pans out.
- Step 0: copy MW code and patch it: [done] `/srv/tmp/php-1.44.0-wmf.1-custom`
- Step 1: figure out if I can copy the multiversion/MWScript.php over to my temp folder where I have the patched version of code (`/srv/tmp/`) or if something else weird is going on
- Step 2: update that command with temp output but keep the stubs.
- Step 3: run the command with the patch and without the patch and compare logs
- Some problems with the above plan, Erik is testing, some details here.
- Deployed the patch and launched Dumps 1. We are 6 days late and the UI reporting dumps progress seems a little misaligned, but no big errors or anything obviously wrong so far.
- [ developing ]