Jump to content

Analytics/Roadmap/PlanningMeetings/2012 Sept 20

From mediawiki.org

Notes for the Team Analytics roadmap planning meeting for 20 Sept 2012, taken by Dave Schoonover.

Need to create a set of Data Release Processes

[edit]
  • Search log data release contained unacceptable data. How can we prevent this in the future?
    • Diederik on point
    • Conversation with: Team Analytics, robla, Moeller, Legal, Dario, Chris MzMcBride?, Chris Steipp?
    • Who else might have a valued/paranoid perspective? MZMcBride??
  • Process for NEW datasets, as well as Smoke-check each data upload prior to public notice
  • Concrete threads:
    • For any new datastream: "What's the Attack Surface?"
      • Need to spend more time thinking about what sort of privacy exploits are possible
      • Strip all (no matter if we take other steps):
        • URLs? Spam, SEO links, etc
        • Email addresses
        • IP addresses
    • What criterion for k-anonymity are we going to use (if any)? --> Publish behavioral/request data only as aggregates
  • Followup on this release:
    • Disclosure requirements
    • Legal is looking into the impact and our obligations
    • Need to convey clearly to the community what happened and what we're doing

Milestone Planning

[edit]

By Project

[edit]

Kraken

[edit]
  • Set up Cassandra cluster, get it working with Hadoop. [otto + dsc] (Sept)
    • Load in sample data sets. [otto] (Sept)
    • Tee the udp2log stream into Kraken. [otto + dsc] (Sept)
    • First-pass at Hive/Pig Jobs [dsc + otto] (Sept)
  • Set up Maven / Somatype Artifact Repository (continuous integration) [dsc] (Oct)
    • WMF Maven parent pom [dsc] (Oct)
  • Puppetize Kraken [otto] (Ongoing)
  • Set up JMX monitoring -- needs to be our LAN [otto + dsc] (Oct)
  • Get Storm set up [dsc + otto] (Oct)
    • Start work on ETL topology [dsc] (Oct)
  • Hardware reinstallation -- Depends on Ops [otto] (Oct)
  • Get to consensus with Ops regarding logging of the firehose [dsc + otto] (Oct)
    • Research needed: test running cli JVM producers does not cause extra load [otto] (Oct)


Legacy Log Collection

[edit]
  • Add support for new domain names in webstatscollector (blog, etc) [diederik] (Sept)
  • udp2log filters
    • Update filters for Wikipedia Zero [otto] (Ongoing)
    • Filter by X-Carrier headers. [otto + asher + diederik] (Oct)
    • udp-filter to filter by http status. [otto] (Oct)

WikiStats

[edit]
  • Reduce backlog regarding Wikistats traffic (squid etc) scripts [stefan] (Oct)
  • Repair data errors in wikistats, and add process for checking data integrity [ezachte] (Sept)
  • Make wikistats more robust (MoM validations) [ezachte] (Oct)
  • Add Blackbox testing to WikiStats [diederik + ezachte] (Oct)


Ops & Maintenance

[edit]
  • Access/support requests for stat1, stat1001 [otto] (Ongoing)
  • Migrate Reportcard off Labs onto stat1001 -- reportcard.wikimedia.org [otto + dsc] (Oct)
  • Maintenance of oxygen/emery/locke [otto] (Ongoing)


Data

[edit]
  • Publish Monthly Report Card -- deal with monthly data processing irregularities, perform correction/validation [ezachte + diederik + dsc] (Ongoing)
  • Create Data Release Practices Task Force [diederik] (Sept)
  • Start pushing datasets to AWS [diederik] (Oct)
  • Finalize scripts to massively compact dammit.lt data [erik] (Oct)
    • Blogpost about what awesome stuff you can do with this [diederik + ?] (Oct)


Limn

[edit]
  • Bootstrap Dan [dan + dsc] (Sept) [DONE]
  • Refactor charting to use d3 [dan + dsc]
    • Initial Prototype with Options UI (Sepåt)
    • Feature Parity with Dygraphs (plus bugfixes, etc) (Oct)
  • Bugfixes (like Save-As, UI Error Notifications, ...) [dan] (Oct)
  • Mirror GitHub to Gerrit [dsc] (Sept)
  • Improve Limn wiki, docs, & guides (esp those shameful screenshots) [dan] (Oct)
  • Coke (make for Coco) task to create symlinks into dataDir from an existing data repository (such as, say, analytics/reportcard/data) [dsc] (Sept)
    • Coke task to download and setup dummy testing data for ease of development [dsc] (Sept)
  • UI support for remote datasets via proxy [dsc + dan] (Oct)
  • Migrate Dario's dashboards to Limn [dsc] (Sept)
  • Support the Global Dev dashboard [evan] (Ongoing)
  • Support the Gerrit Stats dashboard [diederik] (Ongoing)
  • Deploy reportcard / gerrit-stats on stat1001 (aka, "the Debian packaging discussion") [otto + dsc] (Oct)


By Month

[edit]

September

[edit]
  • (Kraken) Set up Cassandra cluster, get it working with Hadoop. [otto + dsc] (Sept)
    • Load in sample data sets. [otto] (Sept)
    • Tee the udp2log stream into Kraken. [otto + dsc] (Sept)
    • First-pass at Hive/Pig Jobs [dsc + otto] (Sept)
  • (Kraken) Puppetize Kraken [otto] (Ongoing)
  • (Legacy Log Collection) Add support for new domain names in webstatscollector (blog, etc) [diederik] (Sept)
  • (Data) Create Data Release Practices Task Force [diederik] (Sept)
  • (Limn) Bootstrap Dan [dan + dsc] (Sept) [DONE]
  • (Limn) Refactor charting to use d3 [dan + dsc]
    • Initial Prototype with Options UI (Sept)
  • (Limn) Mirror GitHub to Gerrit [dsc] (Sept)
  • (Limn) Coke (make for Coco) task to create symlinks into dataDir from an existing data repository (such as, say, analytics/reportcard/data) [dsc] (Sept)
    • Coke task to download and setup dummy testing data for ease of development [dsc] (Sept)
  • (Limn) Migrate Dario's dashboards to Limn [dsc] (Sept)
  • (Limn) Support the Global Dev dashboard [evan] (ongoing)
  • (Limn) Support the Gerrit Stats dashboard [diederik] (Ongoing)


October

[edit]
  • (Kraken) Set up Maven / Somatype Artifact Repository (continuous integration) [dsc] (Oct)
    • WMF Maven parent pom [dsc] (Oct)
  • (Kraken) Puppetize Kraken [otto] (Ongoing)
  • (Kraken) Set up JMX monitoring -- needs to be our LAN [otto + dsc] (Oct)
  • (Kraken) Get Storm set up [dsc + otto] (Oct)
    • Start work on ETL topology [dsc] (Oct)
  • (Kraken) Hardware reinstallation -- Depends on Ops [otto] (Oct)
  • (Kraken) Get to consensus with Ops regarding logging of the firehose [dsc + otto] (Oct)
    • Research needed: test running cli JVM producers does not cause extra load [otto] (Oct)
  • (Legacy Log Collection) udp2log filters
    • Update filters for Wikipedia Zero [otto] (Ongoing)
    • Filter by X-Carrier headers. [otto + asher + diederik] (Oct)
    • udp-filter to filter by http status. [otto] (Oct)
  • (WikiStats) Reduce backlog regarding Wikistats traffic (squid etc) scripts [stefan] (Oct)
  • (WikiStats) Make wikistats more robust (MoM validations) [ezachte] (Oct)
  • (WikiStats) Add Blackbox testing to WikiStats [diederik + ezachte] (Oct)
  • (Ops & Maintenance) Access/support requests for stat1, stat1001 [otto] (Ongoing)
  • (Ops & Maintenance) Migrate Reportcard off Labs onto stat1001 -- reportcard.wikimedia.org [otto + dsc] (Oct)
  • (Ops & Maintenance) Maintenance of oxygen/emery/locke [otto] (Ongoing)
  • (Data) Publish Monthly Report Card -- deal with monthly data processing irregularities, perform correction/validation [ezachte + diederik + dsc] (Ongoing)
  • (Data) Start pushing datasets to AWS [diederik] (Oct)
  • (Data) Finalize scripts to massively compact dammit.lt data [erik] (Oct)
    • Blogpost about what awesome stuff you can do with this [diederik + ?] (Oct)
  • (Limn) Refactor charting to use d3 [dan + dsc]
    • Feature Parity with Dygraphs (plus bugfixes, etc) (Oct)
  • (Limn) Bugfixes (like Save-As, UI Error Notifications, ...) [dan] (Oct)
  • (Limn) Improve Limn wiki, docs, & guides (esp those shameful screenshots) [dan] (Oct)
  • (Limn) UI support for remote datasets via proxy [dsc + dan] (Oct)
  • (Limn) Support the Global Dev dashboard [evan] (Ongoing)
  • (Limn) Support the Gerrit Stats dashboard [diederik] (Ongoing)
  • (Limn) Deploy reportcard / gerrit-stats on stat1001 (aka, "the Debian packaging discussion") [otto + dsc] (Oct)


Followups

[edit]
  • [dsc] Update wiki with project pages for everything on the Roadmap page
    • Each project owner will then update their Project Status for Sept
  • [dsc] Update the Engineering Roadmap wiki page: https://www.mediawiki.org/wiki/Roadmap
  • [dsc] Fill in week-by-week team roadmap without breakout by project

&heart;

[edit]

<3 http://art.less.ly/2012/heart-dino.png <3