Analytics/Meetings/June webrequest data loss post mortem
Appearance
Notes
[edit]- Why didn't we notice that two Kafka producers stopped working?
- Ganglia continued to report the same ProduceRequestsPerSecond value even when udp2log kafka producer stopped working.
- WebRequestLoss checks reported 0% lost (despite the fact we are sure we lost data)
- Why didn't WebRequestLoss report loss?
- Because WebRequestLoss is not monitoring the data stream that had loss
- Which of the datastreams are monitored by WebRequestLoss?j
- ust the geocoded mobile stream
- Which of the datastreams are monitored by WebRequestLoss?j
- Because WebRequestLoss is not monitoring the data stream that had loss
- Why didn't WebRequestLoss report loss?
- Kafka monitoring also didn't work as expected -- theory is threshold is too low
- because the three alrert-triggering monitoring tools did not catch this particular scenario?
- Is there a single monitoring system that would catch all scenarios in a basic way?
- yes, WebRequestLoss above
- Is there a single monitoring system that would catch all scenarios in a basic way?
- Why did the Kafka producers stop working?
- We do not know
Action Items
[edit]- Investigate turning off un-anonymized stream - ?
- OR turn on monitoring for this streamInvestigate failure in Kafka monitoring - AO
- Longer term: how can we abstract the data format from the tools - ?
- Longer term: think about use of IP in data - DvL