Jump to content

User:DWalden (WMF)/Test data sampling using SQL

From mediawiki.org

Problem

[edit]
  • I want to come up with a rich set of test data.
  • I already have a test environment with lots of data.
  • Therefore, I don't need to create test data.

Solution

[edit]

Query your SQL database to find data that is already available.

  • Identify what you want to test.
    • This could be an API, HTML form or something else.
    • We will use as an example deleting pages.
  • Identify which variables might affect its behaviour.
    • For example, if deleting a page, two variables which may affect behaviour are number of edits to the page and whether or not it is a file.
    • It will generally work best if you can identify discrete values for these variables.
      • For example, number of edits could be either less than 1000 or 1000 and greater. Whether something is a file is either true or false.
  • Construct the query to list all these variables

For example:

SELECT page_title, page_namespace = 6 AS "Is File", COUNT(rev_id) < 1000
FROM revision
INNER JOIN page ON rev_page=page_id
GROUP BY page_title;

Which will output something like:

page_title  Is File     COUNT(rev_id) < 1000
----------  ----------  --------------------
Main_Page   0           1
Foobar      0           1                   
File:Foo    1           0
File:Bar    1           0
File:Baz    1           1
  • On a large database, there will likely be lots of pages in each partition. You will likely only want to test a sample from each.
SELECT page_title, page_namespace = 6 AS "Is File", COUNT(rev_id) < 1000 AS "size", ROW_NUMBER() OVER (PARTITION BY "Is File", "size") AS part
FROM revision
INNER JOIN page ON rev_page=page_id
GROUP BY page_title;

Which will output something like:

page_title  Is File     size     part
----------  ----------  -------  -------
Main_Page   0           1        1
Foobar      0           1        2
File:Foo    1           0        1
File:Bar    1           0        2
File:Baz    1           1        1
  • Notice that the numbering produced by ROW_NUMBER() will reset to 1 with each new partition. You can then select a subset of a partition based on the numbering. For example:
SELECT * FROM
(SELECT page_title, page_namespace = 6 AS "Is File", COUNT(rev_id) < 1000 AS "size", ROW_NUMBER() OVER (PARTITION BY "Is File", "size") AS part
FROM revision
INNER JOIN page ON rev_page=page_id) AS foo
WHERE foo.part = 1
GROUP BY page_title;

Which will output something like:

page_title  Is File     size     part
----------  ----------  -------  -------
Main_Page   0           1        1
File:Foo    1           0        1
File:Baz    1           1        1
  • You may wish to order the rows within each partition. For example:
    • Number of revisions: ROW_NUMBER() OVER (PARTITION BY "Is File", "size" ORDER BY COUNT(rev_id) DESC)
    • Randomly: ROW_NUMBER() OVER (PARTITION BY "Is File", "size" ORDER BY RAND())

API testing

[edit]

To make this work with the API testing script, name the columns after the API parameters.

For example, the delete API has the parameter title. Therefore:

SELECT page_title AS title FROM
(SELECT page_title, page_namespace = 6 AS "Is File", COUNT(rev_id) < 1000 AS "size", ROW_NUMBER() OVER (PARTITION BY "Is File", "size") AS part
FROM revision
INNER JOIN page ON rev_page=page_id) AS foo
WHERE foo.part = 1
GROUP BY page_title;

Pass this to the script:

python3 pwb.py all_params_api_db -m delete -q 'SELECT page_title AS title FROM (SELECT page_title, page_namespace = 6 AS "Is File", COUNT(rev_id) < 1000 AS "size", ROW_NUMBER() OVER (PARTITION BY "Is File", "size") AS part FROM revision INNER JOIN page ON rev_page=page_id) AS foo WHERE foo.part = 1 GROUP BY page_title' -d ~/core/cache/sqlite/my_wiki.sqlite -lang:en -family:localwiki

Examples

[edit]