Jump to content

Content translation/analytics/Daily stats

From mediawiki.org

This is knowledge sharing about how to collect daily Content Translation statistics.

Unless otherwise noted, "<TIMESTAMP>" is eight digits: yyyymmdd. For example 20160425 is April 25 2016.

The scripts can be found in the bash/ directory in the limn-language-data Gerrit repo.

crontab configuration

[edit]

The scripts for collecting deletion and suggestions stats run every day using cron. To configure it in your account, run crontab -e and write 00 01 * * * ~/cxscriptsg/scheduled_daily_stats.sh.

Yandex

[edit]
  1. Go to https://translate.yandex.com/developers/stat
  2. Copy the number from the top chart to the Characters column in the Yandex stats spreadsheet.
  3. Copy the number from the bottom chart to the Request column in the Yandex stats spreadsheet.

Daily suggestions accepted

[edit]
  1. Run:
    1. ssh terbium
    2. cat events_<TIMESTAMP>.txt
  2. In the query results table, copy just the numbers from the "count(event_targetLanguage)" column. (In OSX terminal this is easy by selecting with the mouse while holding the Alt key.)
  3. Paste them to a text editor.
  4. Select them in the text editor and copy. (The intermediate step is needed because direct copying from the terminal to a spreadsheet is broken.)
  5. Open the "Suggestions Accepted Daily" tab in the "Suggestions Enablement" spreadsheet.
  6. Find today's date (NB: one day after the <TIMESTAMP>).
  7. Click row 3 and paste.
  8. Scroll down and check whether the list is not too long. The last number must align with the last language. If not:
    1. Delete the pasted numbers (the Undo action usually accomplishes that).
    2. Find the missing language by examining the query result.)
    3. Add a row for that language.
    4. Go back to step "Click row 3 and paste".
  9. The number in the "Per day" row is the number of accepted suggestions per day.

NB: The events_<TIMESTAMP>.txt file is generated by the script events_scheduled.sh. It automatically runs for yesterday. To get the numbers until a particular day, run the following:

  1. ./rprod.sh
  2. Run the following query—it takes about three minutes:
SELECT
	event_targetLanguage,
	count(event_targetLanguage)
FROM
	log.ContentTranslationCTA_11616099 cta
WHERE
	cta.event_cta like 'suggestions%' AND
	cta.timestamp < 20160425000000 -- write the appropriate timestamp here
AND
	cta.event_action = 'accept'
GROUP BY
	event_targetLanguage;

Save, restore, publish errors (s/r/p)

[edit]
  1. Make the terminal window wide. The numbers row can be long and you don't want it to clip.
  2. Run:
    1. ssh terbium
    2. cd cxscriptsg
  3. Just for yesterday:
    1. ./srp.py
      1. NB 1: It is usually fast in the beginning, and slowish in the end. The total time is up to two minutes.
      2. NB 2: The reason "a week ago" is shown for "What was published" is that this number can sometimes change (E.g., it took a few days until the translator actually published.)
  4. For a particular day:
    1. ./sorted_save_events.py <TIMESTAMP>
    2. ./sorted_restore_events.py <TIMESTAMP>
    3. ./sorted_publish_events.py <TIMESTAMP>
    4. ./what_was_published.py <TIMESTAMP>
      1. NB: The last step is can take up to two minutes.
  5. For a range of dates:
    1. for day in $(seq 20160401 20160430); do ./sorted_restore_events.py $day; done
    2. (etc.)
  6. Carefully copy the row of numbers, and paste it to a text editor.
  7. Select them in the text editor and copy. (The intermediate step is needed because direct copying from the terminal to a spreadsheet is broken.)
  8. Open the s/r/p spreadsheet and paste the numbers from the text editor in the appropriate column at the appropriate date.

Deletion stats

[edit]
  1. Run
    1. ssh terbium
  2. For a particular day in the recent past:
    1. cat deletion_<TIMESTAMP>.txt
      1. For older reports see the schedule-archive directory, but it's better to just run the script, because the data may have changed.
  3. For a particular day:
    1. cd cxscriptsg
    2. ./count_deletion.sh <TIMESTAMP>
      1. Default is yesterday.
  4. For a range of dates:
    1. cd cxscriptsg
    2. ./count_deletion_range.sh <TIMESTAMP> <TIMESTAMP>
      1. NB 1: Note that this is very silly and only works for dates within the same month.
      2. NB 2: The default range is from the beginning of the current month until yesterday.
  5. Copy the numbers to the deletion stats spreadsheet.
  6. NB: The data changes practically every day, so the numbers are an approximation. They are less likely to change after two weeks. It's suggested to check not only yesterday, but also a week ago.

Adding a language

[edit]
  1. Add a new row, alphabetically, and put the code there.
  2. Make sure to fill the bottom totals rows by copying from adjacent languages' columns.

Adding a new month tab

[edit]
  1. Create a new spreadsheet tab. Make it the first one, because that's the one you'll need most frequently from now on.
  2. Copy the data from the previous month.
  3. Select all the deletion numbers cells (but not the totals cells) and delete.
  4. Select the whole tab and remove the background color.
  5. Select the Sundays' rows and make their background light gray.
  6. In the weekly total column, write the SUM formula for the week (for the first week this will usually include a few days from the previous month).

Monthly report stats

[edit]
  1. Run
    1. ssh terbium
    2. cd cxscriptsg
    3. ./monthly_stats.sh
  2. That's pretty much it! Just be sure to get the deletion number from the deletion spreadsheet.