Status

This is a proposal on how the Analytics Team runs its Scrum Retrospectives. This is not yet a process adopted by the team, these are my personal thoughts.

Types of Retrospectives

I think we need two types of retrospectives:

The general retrospective where we identify process-improvements, improve team happiness and focus on making the team more mature.
The post-mortem retrospective where we dive deep into a card, an epic, dataloss, prolonged breakdown of a service -- basically bigger issues that either affected our customers or we feel we can draw valuable lessons from it.

Regardless of the type of retrospective, we need a facilitator who runs the retrospective and does not participate with the contents. We can choose case-by-case who should do this.

The general retrospective

At the end of each sprint, usually after the Sprint Planning, we do our retrospective. To decide what we should discuss we could use the following approach:

On an etherpad, each team member lists the topics that he/she feels are worthy to reflect on. This should take between 3 and 5 minutes.
Next, each team member has 2 pluses (+) that can be assigned to any topic, you can assign two pluses to a single topic
The two topics that have the most pluses will be discussed.
Each topic is timeboxed at 30 minutes.
In case of a tie, discuss the three topics and adjust the timebox accordingly.

The post-mortem retrospective AKA The Three R's

Regret

This is the first and easy but important step -- recognize the issue, apologize to the affected customers and explain what happened. If the explanation on what happened depends on doing the five whys then just inform the customer(s) and recognize that an issue is affecting them and that action is being taken.

Reason

To uncover the reason we could use the five why's methodology. The 5 Whys is an iterative question-asking technique used to explore the cause-and-effect relationships underlying a particular problem. Five is a rule of thumb -- you should keep asking why as long as it's giving meaningful answers.

Remedy

After having conducted the 5 why's one or more remedies will become evident. The team should commit to a particular remedy, once that has happened the post-mortem can be finished.

Example 1

Here are two examples of post-mortems that the Analytics team has conducted in the past. We did a 5 whys analysis of card 148 (Network ACLs (#148)) May be related to Card #280

Planned

otto implements network ACLs
Submit RT ticket to ensure it gets reviewed by Mark.
mark reviews and approves implementation
Mark uses our input regarding what services need to be blocked / available and would configure the router;
Done within a week
High urgency;
Discontinue use of XXX until implemented as part of MVC

Actual

otto implemented network ACLs (within 2 weeks)
Mark collected requirements from Diederik and Ottomata, then nothing happened.
Still not complete after 8 weeks of waiting (RT ticket open for 3 months)
mark picked up the task last week and said he hoped to have it done that day
Analytics team continues to use Kraken as configured without Network ACLs in place
Erik M bumped Mark to handle on 2013-04-24; Mark picked it up.
"Pressure is off due to iptables in place." -Mark, 2013-04-12
Andrew corrects Mark.

Problems/Issues

Not sure whether this is done or not; Why are we not sure if this is done or not?
- Status on RT ticket seems to conflict with status on Mingle; Why is the status between the two systems in conflict?
  - No automatic way to synchronize between two systems
  - Categories don't map between systems; Why do the categories not map between the two systems?
    - Never considered RT tickets as part of our workflow; Why did we not consider them?
      - RT is an Ops tool used to interface with other teams in the organization, delineating the boundaries between teams. Why do we need to cross team boundaries to get this work done?
        Members of our team don't have sufficient training to do the job correctly
        
        Members of our team are not authorized to do the work
        
        Members of our team do not have the mandate to do the work
        
        Higher interdependencies between analytics and ops than other teams; Why are there higher interdependencies for analytics?
        Feature completeness for our work in the cluster requires complex infrastructure and configuration over custom software development; Why is our stuff more reliant on infrastructure/config than other teams like product?
        Industry solutions to big data have been to throw hardware at the problem; tools we chose like hadoop are built on that assumption. Why did we choose hadoop?
        Hadoop offers massive storage base for web request data. Why is it necessary to store so much data?
        WP generates a lot of data and we need sufficient data (unsampled data) to analyze traffic for the patterns we seek
      - We didn't try to incorporate all of our work into Mingle. Why didn't we try to incorporate all of this work?
      - Andrew didn't update the Mingle card as he's responsible to do? Why did Andrew not update the Mingle card's status?
        No procedure or process for manual synchronization
- Status on RT is ambiguous;
- Difference in understanding of the requirement
Change in sense of urgency to get this done
Difference in team understanding between who is responsible for completing the work

Example 2

Notes

Why didn't we notice that two Kafka producers stopped working?
1. Ganglia continued to report the same ProduceRequestsPerSecond value even when udp2log kafka producer stopped working.
2. WebRequestLoss checks reported 0% lost (despite the fact we are sure we lost data)
  1. Why didn't WebRequestLoss report loss?
    1. Because WebRequestLoss is not monitoring the data stream that had loss
      1. Which of the datastreams are monitored by WebRequestLoss?
        just the geocoded mobile stream
3. Kafka monitoring also didn't work as expected -- theory is threshold is too low
4. because the three alert-triggering monitoring tools did not catch this particular scenario?
  1. Is there a single monitoring system that would catch all scenarios in a basic way?
    1. yes, WebRequestLoss above
5. Why did the Kafka producers stop working?
6. We do not know

Action Items

Investigate turning off un-anonymized stream - ?
OR turn on monitoring for this streamInvestigate failure in Kafka monitoring - AO
Longer term: how can we abstract the data format from the tools - ?
Longer term: think about use of IP in data - DvL

References

Some background reading on 5 whys and some cases: