Chico Questions

next session

Is there a reason for collecting less metrics about puppet?
- We have an addapted minimalpuppetagent.py that collects a lot less than the original puppetagent.py (it alsos add a _check_sudo method)
- Maybe .puppetagent.changes.total, puppetagent.events.failure puppetagent.events.success and servers.hostname.puppetagent.events.total could be useful as well?
- https://diamond.readthedocs.io/en/latest/collectors/PuppetAgentCollector/

We can force pods to restart after 30 days, but it sounds like a terrible idea
- Revisit after tools-beta

We have Icinga being phased out for prometheus in productions servers
Shinken in lab instances
- My goal is to add alerts for tools-bations, seems it should be done in Icinga/prometheus task T186552 https://phabricator.wikimedia.org/T186552
  - We already collect CPU and IO data for tools-bations (https://tools.wmflabs.org/nagf/?project=tools#h_tools-bastion-03_cpu )
    - I see we can use a check_graphite_series_threshold to get the loadavg like we are doing with iowait (from https://graphite-labs.wikimedia.org/ )
      - There is no total_cpu metric, we need number of cores to know what to set for warning and critical in loadavg

Do we have documentation about how to triage tasks and move them arround projects and workboards?
- TBD

I am still unfamiliar with the interfaces and common questions, maybe I should create a temp project and go through docs.
- make a task for a chicotestproject T187213
Where are things configured?
- Wikitech
- operations-puppet repo
- Horizon

~/git/wmf/puppet cpettet@cair>ls hieradata/labs/tools

toolsadmin.wikimedia.org

Is there something else I should be looking into?
- Let's start slow and I'll try to integrate you into my sort of normal workflow
- Flapping alerts in shinken
  - host* as way to get % of hosts in failure