Analytics/Archive/Editor Engagement Vital Signs/Backfilling
This page is archived! Find new documentation at https://wikitech.wikimedia.org/wiki/Analytics/Vital_Signs
Backfilling
[edit]Some benchmarks of how log did it take to backfill data for Rolling Active Editor EEVS metric. The point is to have a ballpark estimate for our backfilling and make sure future code changes do not blow up these numbers.
Numbers will fluctuate depending on labs DB access but they should not diverge much from these.
We use as baseline our master branch on 2014-06-12 versus changes on this patchset: https://gerrit.wikimedia.org/r/#/c/150475/
Labs db infrstructure ( labsdb1002 dewiki, commons, etc) was upgraded to maria db about the last week of July. All data is on an SSD now.
Config for celery was:
BROKER_URL : redis://localhost:6379/0 CELERY_RESULT_BACKEND : redis://localhost:6379/0 CELERY_TASK_RESULT_EXPIRES : 2592000 CELERY_DISABLE_RATE_LIMITS : True CELERY_STORE_ERRORS_EVEN_IF_IGNORED : True CELERYD_CONCURRENCY : 10 CELERYD_TASK_TIME_LIMIT : 3630 CELERYD_TASK_SOFT_TIME_LIMIT : 3600 DEBUG : False LOG_LEVEL : INFO MAX_PARALLEL_PER_RUN : 10 MAX_INSTANCES_PER_RECURRENT_REPORT : 365 CELERY_BEAT_DATAFILE : /var/run/wikimetrics/celerybeat_scheduled_tasks CELERY_BEAT_PIDFILE : /var/run/wikimetrics/celerybeat.pid CELERYBEAT_SCHEDULE : 'update-daily-recurring-reports': 'task' : 'wikimetrics.schedules.daily.recurring_reports' # The schedule can be set to 'daily' for a crontab-like daily recurrence 'schedule' : debug
Results with patchset https://gerrit.wikimedia.org/r/#/c/150475/
[edit]rowiki
[edit]- Backfilling of 3 months of data takes about 3 minutes
- Backfilling of 1 year of data takes about 10 minutes
eswiki
[edit]- Backfilling of 3 months of data took 8 minutes.
- Backfilling of 5 months of data took 10 minutes
- Backfilling of 1 year of data took 30 minutes
frwiki
[edit]- backfilling 3 months took 12 mins
Results with master branch (72ac421affa0c90183d9dde743cc79a91525fe12)
[edit]rowiki
[edit]- Backfilling of 3 months of data takes about 3 minutes
- Backfilling of 1 year of data takes about 6 minutes
frwiki
[edit]- Backfilling 3 months: 7 mins
Results with patchset https://gerrit.wikimedia.org/r/#/c/158630/
[edit]rowiki RollingNewActiveEditor
[edit]- Backfilling 3 months: 42 seconds
rowiki RollingSurvivingNewActiveEditor
[edit]- Backfilling 3 months: 44 seconds
frwiki RollingNewActiveEditor
[edit]- Backfilling 3 months: 3.5 minutes
frwiki RollingSurvivingNewActiveEditor
[edit]- Backfilling 3 months: 4.5 minutes
RollingRecurringOldActiveEditor, patch: https://gerrit.wikimedia.org/r/#/c/161521/
[edit]ruwiki
[edit]- Backfilling 1 day - 4 minutes
- Backfilling 1 week - 4 minutes
- Backfilling a month - 7minutes
frwiki
[edit]- Backfilling 2 months: 12 minutes
Rolling Recurrent old active editor https://gerrit.wikimedia.org/r/#/c/161521/
[edit]Select as is did not run (as in it run forever)
SELECT anon_1.user_id AS anon_1_user_id, IF(SUM(anon_1.count_one) >= %s AND SUM(anon_1.count_two) >= %s, %s, %s) AS `IF_1` FROM (SELECT anon_2.user_id AS user_id, anon_2.count_one AS count_one, anon_2.count_two AS count_two FROM (SELECT revision_userindex.rev_user AS user_id, SUM(IF(revision_userindex.rev_timestamp <= %s, %s, %s)) AS count_one, SUM(IF(revision_userindex.rev_timestamp > %s, %s, %s)) AS count_two FROM revision_userindex INNER JOIN user ON user.user_id = revision_userindex.rev_user INNER JOIN logging ON user.user_id = logging.log_user WHERE logging.log_type = %s AND logging.log_action = %s AND logging.log_timestamp < %s AND revision_userindex.rev_timestamp BETWEEN %s AND %s GROUP BY revision_userindex.rev_user UNION ALL SELECT archive.ar_user AS user_id, SUM(IF(archive.ar_timestamp <= %s, %s, %s)) AS count_one, SUM(IF(archive.ar_timestamp > %s, %s, %s)) AS count_two FROM archive INNER JOIN user ON user.user_id = archive.ar_user INNER JOIN logging ON user.user_id = logging.log_user WHERE logging.log_type = %s AND logging.log_action = %s AND logging.log_timestamp < %s AND archive.ar_timestamp BETWEEN %s AND %s GROUP BY archive.ar_user) AS anon_2) AS anon_1 GROUP BY anon_1.user_id
Pages created
[edit]Changes were done for pages created to default to all name spaces, gerrit change: https://gerrit.wikimedia.org/r/#/c/167214/
enwiki
[edit]We were able to backfill a month for enwiki in 20 minutes
ruwiki
[edit]We were able to backfill a month for ruwiki in 3 minutes