User:DWalden (WMF)/Test data sampling using SQL
Appearance
Problem
[edit]- I want to come up with a rich set of test data.
- I already have a test environment with lots of data.
- Therefore, I don't need to create test data.
Solution
[edit]Query your SQL database to find data that is already available.
- Identify what you want to test.
- This could be an API, HTML form or something else.
- We will use as an example deleting pages.
- Identify which variables might affect its behaviour.
- For example, if deleting a page, two variables which may affect behaviour are number of edits to the page and whether or not it is a file.
- It will generally work best if you can identify discrete values for these variables.
- For example, number of edits could be either less than 1000 or 1000 and greater. Whether something is a file is either true or false.
- Construct the query to list all these variables
For example:
SELECT page_title, page_namespace = 6 AS "Is File", COUNT(rev_id) < 1000
FROM revision
INNER JOIN page ON rev_page=page_id
GROUP BY page_title;
Which will output something like:
page_title Is File COUNT(rev_id) < 1000
---------- ---------- --------------------
Main_Page 0 1
Foobar 0 1
File:Foo 1 0
File:Bar 1 0
File:Baz 1 1
- On a large database, there will likely be lots of pages in each partition. You will likely only want to test a sample from each.
- You can do this using PARTITION BY, for example
SELECT page_title, page_namespace = 6 AS "Is File", COUNT(rev_id) < 1000 AS "size", ROW_NUMBER() OVER (PARTITION BY "Is File", "size") AS part
FROM revision
INNER JOIN page ON rev_page=page_id
GROUP BY page_title;
Which will output something like:
page_title Is File size part
---------- ---------- ------- -------
Main_Page 0 1 1
Foobar 0 1 2
File:Foo 1 0 1
File:Bar 1 0 2
File:Baz 1 1 1
- Notice that the numbering produced by
ROW_NUMBER()
will reset to 1 with each new partition. You can then select a subset of a partition based on the numbering. For example:
SELECT * FROM
(SELECT page_title, page_namespace = 6 AS "Is File", COUNT(rev_id) < 1000 AS "size", ROW_NUMBER() OVER (PARTITION BY "Is File", "size") AS part
FROM revision
INNER JOIN page ON rev_page=page_id) AS foo
WHERE foo.part = 1
GROUP BY page_title;
Which will output something like:
page_title Is File size part
---------- ---------- ------- -------
Main_Page 0 1 1
File:Foo 1 0 1
File:Baz 1 1 1
- You may wish to order the rows within each partition. For example:
- Number of revisions:
ROW_NUMBER() OVER (PARTITION BY "Is File", "size" ORDER BY COUNT(rev_id) DESC)
- Randomly:
ROW_NUMBER() OVER (PARTITION BY "Is File", "size" ORDER BY RAND())
- Number of revisions:
API testing
[edit]To make this work with the API testing script, name the columns after the API parameters.
For example, the delete API has the parameter title
. Therefore:
SELECT page_title AS title FROM
(SELECT page_title, page_namespace = 6 AS "Is File", COUNT(rev_id) < 1000 AS "size", ROW_NUMBER() OVER (PARTITION BY "Is File", "size") AS part
FROM revision
INNER JOIN page ON rev_page=page_id) AS foo
WHERE foo.part = 1
GROUP BY page_title;
Pass this to the script:
python3 pwb.py all_params_api_db -m delete -q 'SELECT page_title AS title FROM (SELECT page_title, page_namespace = 6 AS "Is File", COUNT(rev_id) < 1000 AS "size", ROW_NUMBER() OVER (PARTITION BY "Is File", "size") AS part FROM revision INNER JOIN page ON rev_page=page_id) AS foo WHERE foo.part = 1 GROUP BY page_title' -d ~/core/cache/sqlite/my_wiki.sqlite -lang:en -family:localwiki