Jump to content

SQL/XML Dumps/Debugging PHP

From mediawiki.org

Step 1: Stream of consciousness log of actions:

  • Using codesearch, find where the error is coming from.
  • Confused: is this getText calling this getText?
  • So who's calling our getText?
    • This maintenance script
    • Try to build an example for our revisions, frwiki revisions 331925, 346198, 346206, 296457, 346221
      •  select rev_page,
                rev_id,
                content_size,
                model_name,
                role_name,
                content_address
           from content
                    join
                slots           on slot_content_id = content_id
                    join
                revision        on slot_revision_id = rev_id
                    join
                content_models  on model_id = content_model
                    join
                slot_roles      on role_id = slot_role_id
          where rev_id in (331925,346198,346206,296457,346221);
          
        +----------+--------+--------------+------------+-----------+-----------------------------------+
        | rev_page | rev_id | content_size | model_name | role_name | content_address                   |
        +----------+--------+--------------+------------+-----------+-----------------------------------+
        |    51665 | 296457 |        12265 | wikitext   | main      | es:DB://cluster5/180153/aafb24... |
        |    57040 | 331925 |         2905 | wikitext   | main      | es:DB://cluster5/188443/03d899... |
        |    64577 | 346198 |         1881 | wikitext   | main      | es:DB://cluster5/197946/3b205a... |
        |    64577 | 346206 |         1925 | wikitext   | main      | es:DB://cluster5/197946/703850... |
        |    64577 | 346221 |         1946 | wikitext   | main      | es:DB://cluster5/197946/3f81b7... |
        +----------+--------+--------------+------------+-----------+-----------------------------------+
        
        Which is weird because I remember the addresses being tt:... instead of es:... (putting a pin in that for a sec)
        • es: was introduced in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1020948. That was the thing necessitating the bump on the XML schema version in task T365155 as a workaround to mask the symptoms. IIRC, new revisions began getting es: upon change 1020948's deployment. Thinking out loud: if there's a correlated increase in the errors presently under investigation now relating to revision velocity, that might be strong a hint - that would mean that the fix to mask symptoms had an additional side effect (or maybe: didn't address another unidentified code path) - recall that the tables had already started to be tainted, and rolling back to the old tt: was thought to be too expensive so the idea was to roll forward. It's also entirely possible that this field's convention for the value has nothing to do with the thing being investigated presently, but I'll be interested to hear if turns out to be the thing. ABaso (WMF) (talk) 15:02, 29 October 2024 (UTC)[reply]
      • now we have the pages to dump, but calling the dump maintenance script is involved. I'd have to create a stub first and mostly understand a lot more code. Switching gears.
      • What if I can just call the relevant code, starting from the db fetch and checking sizes manually?
      • Erik advised me to use `mwscript shell.php --wiki=frwiki` which on snapshot1012 currently translates to `/usr/bin/php7.4 /srv/mediawiki/multiversion/MWScript.php shell.php --wiki=frwiki` (I believe this may be changing as we dockerize how we call MW, so it may not work in the near future)
        • Revision 331925 has this es:DB:... id in the query results above, but the error message mentions it was trying to fetch text id tt:331925. Looked around the db to understand this, should probably ask someone. For now I guess I can try both in the db fetches and see what happens.
        • I'm not sure what this spawning thing's about, but it looks like dumps gets text one of three ways. First it tries a prefetch, if that fails then it uses a spawn or a db call. The spawn basically shells out to maintenance/fetchText.php, so maybe I can try just calling that directly. And for the db call I can use the REPL I opened above.
        • `php7.4 /srv/mediawiki/multiversion/MWScript.php fetchText.php --wiki=frwiki` returns -1 when I type in tt:331925 but a revision-looking thing when I use the es:DB:// address.
        • Checking the revision-looking thing against what's in a previous dump: `7z x frwiki-20240901-pages-meta-history1.xml-p56454p64216.7z` and then `grep -C100 "<id>331925</id>" frwiki-20240901-pages-meta-history1.xml-p56454p64216`
          • The content is almost identical, but the dumps have it encoded so it doesn't clash with XML, which makes sense, I think I remember that transformation happening in XmlDumpWriter.
        • So let's try the db call from the shell
        • # /usr/bin/php7.4 /srv/mediawiki/multiversion/MWScript.php shell.php --wiki=frwiki
          use MediaWiki\MediaWikiServices;
          
          $services = MediaWikiServices::getInstance()
          $store = $services->getBlobStore()
          
          $text = $store->getBlob( "tt:331925" )
          # MediaWiki\Storage\BlobAccessException  Unable to fetch blob at tt:331925. Use findBadBlobs.php to remedy.
          
          $text = $store->getBlob( "es:DB://cluster5/188443/03d8995c4b8b437077225ee9dbeabb11" )
          # works, shows the correct text
          
          # NOTE: /usr/bin/php7.4 /srv/mediawiki/multiversion/MWScript.php eval.php --wiki=frwiki -d 3
          # you can use this ^ but use ... doesn't work so fully qualify stuff as in MediaWiki\MediaWikiServices::getInstance()
          
          Trying with all the other ids gives the same results. At this point I'm just guessing that for some reason we're using the wrong id, this "tt:" thing seems wrong, it's not related to these revisions anywhere in the db (checked text and content tables). Going to ask.
  • The investigation led us to finding an ongoing database migration that was racing with dumps. Dumps would write the content address into a stub and circle back to it later to find the content in the prefetch or fall back to the database. By the time it got back to it, the migration erased the data from the database. Details on that here.
  • Erik wrote this patch, reviewed with Piotr, and I'm trying to test it on snapshot1012.
    • One of the commands I found in the logs that seems to have led to these kinds of failures is: `/usr/bin/php7.4 /srv/mediawiki/multiversion/MWScript.php dumpTextPass.php --wiki=enwiki --stub=gzip:/mnt/dumpsdata/xmldatadumps/temp/e/enwiki/enwiki-20241001-stub-meta-history8.xml-p2762152p2801961.gz --prefetch=7zip:/mnt/dumpsdata/xmldatadumps/public/enwiki/20240901/enwiki-20240901-pages-meta-history8.xml-p2723140p2762956.7z;/mnt/dumpsdata/xmldatadumps/public/enwiki/20240901/enwiki-20240901-pages-meta-history8.xml-p2762957p2802548.7z --dbgroupdefault=dump --report=1000 --spawn=/usr/bin/php7.4 --output=bzip2:/mnt/dumpsdata/xmldatadumps/public/enwiki/20241001/enwiki-20241001-pages-meta-history8.xml-p2762152p2801961.bz2.inprog --full` (line 249982 of /var/log/dumps/fulldumps_04_2014033.log).
    • My theory is that this would be a good test because the stub has the bad content address in it. And the new code from Erik's patch should be resilient to that and try the refresh. So, let's see if that pans out.
      • Step 0: copy MW code and patch it: [done] `/srv/tmp/php-1.44.0-wmf.1-custom`
      • Step 1: figure out if I can copy the multiversion/MWScript.php over to my temp folder where I have the patched version of code (`/srv/tmp/`) or if something else weird is going on
      • Step 2: update that command with temp output but keep the stubs.
      • Step 3: run the command with the patch and without the patch and compare logs
    • Some problems with the above plan, Erik is testing, some details here.
    • Deployed the patch and launched Dumps 1. We are 6 days late and the UI reporting dumps progress seems a little misaligned, but no big errors or anything obviously wrong so far.
  • [ developing ]