Wikimedia Apps/Team/RESTBase services for apps/Deployment process
This page is obsolete. It is being retained for archival purposes. It may document extensions or features that are obsolete and/or no longer supported. Do not rely on the information here being up-to-date. This page describes the historical process for deploying Node.js services prior to the WMF's adoption of Kubernetes. For the current service deployment process, see Wikimedia_Product/Wikimedia_Product_Infrastructure_team/Service_deployment_process. |
Developer setup for updating deployment repo
[edit]Check out the basics for deployment.
We build the deploy repo in Docker in order to ensure that the Node modules have the correct binaries. We build the deploy repo using Docker for Mac or a Linux machine. Before you start the first build of the deploy repo run through the setup instructions.
Update and build the deployment repo
[edit]Sync the code and deploy repos with current master:
cd ~/code/mcs/mobileapps
git status
git reset --hard origin/master
git checkout .
git clean -fd
git checkout master
git pull
git status
git --no-pager log --decorate -n 1
rm -rf node_modules/
npm ci service-runner
cd ../deploy
git status
git reset --hard origin/master
git checkout .
git clean -fd
git checkout master
git pull
git --no-pager log --decorate -n 1
git submodule update --init
git branch
git status
cd ../mobileapps
If using Docker for Mac start the Docker daemon by clicking on the Whale icon in the menu bar. (Should work automatically on Linux.) Run the tests in Docker and build the new commit for the deploy repo:
./server.js build --deploy-repo --force
And push to Gerrit:
cd ../deploy
git review
You will find the new patch in the deploy repo in Gerrit.
C:+2
in Gerrit
Deploy to Beta Cluster
[edit]The steps are similar to #Deploy to Production but using different machines, of course.
To deploy:
ssh deployment-deploy03.deployment-prep.eqiad.wmflabs
(instead of instead of ssh deployment.)
Example URL: https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/mobile-html/Dog.
To verify something on the box you can ssh into deployment-mcs01.deployment-prep.eqiad.wmflabs
.
Use #wikimedia-releng
(instead of #wikimedia-operations
) to see if there are issues. You may want to log manually in this channel using !log
until phab:T156079 is resolved. This logs to the Releng team's Server admin log
More about: Beta Cluster
Deploy to Production
[edit]Scan through recent chat in #wikimedia-operations
channel on IRC to make sure there's nothing blocking the deploy.
Optional: Look at deployment logs:
ssh deployment #deployment.eqiad.wmnet, currently points to deploy1001
cd /srv/deployment/mobileapps/deploy/
scap deploy-log
In another terminal start the actual deployment:
ssh deployment
cd /srv/deployment/mobileapps/deploy/
git pull && git submodule update
git log -n 1
scap deploy "`git log --pretty=format:'%s' -n 1`"
The scap deploy command above takes a reason string argument. If this string contains phab tasks, those tasks will get comments about the deployment happening (start + finish). So, let's say in the deployment we have fixes for tasks T123 and T234 you could write instead of the last command:
scap deploy "`git log --pretty=format:'%s' -n 1` (T123 T234)"
The command will deploy first on the canary server scb2001.codfw.wmnet. In a different terminal you can log in to the canary server and verify that the service responds to expectations. Examples:
ssh scb2001
curl localhost:8888/_info/version
curl localhost:8888/en.wikipedia.org/v1/page/mobile-sections/Dog
curl localhost:8888/en.wikipedia.org/v1/feed/announcements | jq .
curl localhost:8888/en.wikipedia.org/v1/page/news | jq .
Once satisfied press c
in the deployment terminal to continue deploying on the other servers without asking again. You can also press y
to be prompted after every group.
The string parameter for the scap deploy command will show up in IRC #wikimedia-operations
and SAL. Once for the start and then at the end.
Consider running following commands from the same directory to check deployment:
grep '^t\|commit\|user' .git/DEPLOY_HEAD
git --no-pager log --decorate -n 1
In case of issues see how to undo deploy.
See also scap3 and deployment guide for further info.
Consider purging URLs
[edit]If the pagelib has changed we should consider purging the pagelib URLs. See Purge Varnish cache below.
Tagging deployments in Git
[edit]Production deployments are tracked with git tags in the main mobileapps repo. The most recent commit included in each deployment is given a tag in this format: <deploy/YYYY-MM-DD/{short deploy repo short commit hash}> (e.g., deploy/2016-01-12/683d73e
).
The mobileapps repo contains a shell script at scripts/git-deploy.sh that is used to apply these tags. Tags are cryptographically signed and a GPG signing key is therefore required. See the Git tag setup section for the one-time setup of that.
Note: First update the source and deploy repos on your machine if you use another machine for tagging!
Then run:
./scripts/tag-deploy.sh
Example:
cd ~/code/mcs/deploy
git checkout master
git pull --rebase origin master
git submodule update --init
cd ~/code/mcs/mobileapps
git pull --rebase origin master
./scripts/tag-deploy.sh
To verify it worked you can do either of these:
- You can fetch the tag from a different clone of the repo.
- A bit later you can see the new tag on Github.
Update tasks in Phabricator
[edit]Move the tasks in the 'To deploy' column of the Product Infrastructure Kanban board to the 'Sign off' column and add a comment with the deploy tag if not already there.
Monitor log files
[edit]A few minutes after the deploy is finished monitor Logstash for RESTBase and mobileapps.
Troubleshooting & Restarting services
[edit]Logs
[edit]The service is running on the following machines:
scb1001.eqiad.wmnet
scb1002.eqiad.wmnet
scb1003.eqiad.wmnet
scb1004.eqiad.wmnet
scb2001.codfw.wmnet
scb2002.codfw.wmnet
scb2003.codfw.wmnet
scb2004.codfw.wmnet
scb2005.codfw.wmnet
scb2006.codfw.wmnet
In your first terminal tail log file:
tail-mobileapps -f
Alternatively:
tail -2000 /srv/log/mobileapps/main.log \
| grep -v 'Could not find a definition for' | grep -v 'missingtitle' | grep -v 'Page or revision not found' | grep -v '501: unsupported_language'
Restart from deploy host via scap
[edit]From the deploy host restart the mobileapps service Node.js processes for one host, example scb2003:
cd /srv/deployment/mobileapps/deploy/
scap deploy --service-restart -l scb2003.codfw.wmnet "Restarting mobileapps on scb2003"
(-l is shorthand for --limit-hosts)
Restart (directly on machine)
[edit]In another terminal restart the mobileapps service Node.js processes:
cd /srv/deployment/mobileapps/deploy/
git log -n 1
ps -ef|grep mobileapps|wc -l
sudo service mobileapps restart
ps -ef|grep mobileapps|wc -l
Simple checks
[edit]Check version and run the automatic monitoring check manually:
check-mobileapps
# runs:
/usr/local/lib/nagios/plugins/service_checker 127.0.0.1 http://localhost:8888
Wait 5-10 minutes, watching the log file and #wikimedia-operations for alerts.
Other things to check:
- Uptime of service:
sudo service mobileapps status
- Versions:
curl localhost:8888/_info/version
- If Swagger spec was changed for this deploy:
curl localhost:8888/?spec
- Example command to check an endpoint:
curl localhost:8888/en.wikipedia.org/v1/feed/announcements
# beta cluster:
curl localhost:8888/en.wikipedia.beta.wmflabs.org/v1/feed/announcements
Refresh RESTBase cache
[edit]Refresh the aggregated featured feed stored in RESTBase/Cassandra for a single day. Example to run this from the prod cluster:
curl -H 'Cache-Control: no-cache' https://restbase.discovery.wmnet:7443/en.wikipedia.org/v1/feed/featured/2017/01/11
Notes:
- Adjust the date (and wikipedia.org subdomain if necessary).
- Another RESTBase machine could be used, too, but only one is needed to update the entry in Cassandra storage.
- There's still Vagrant cache, see
curl -sI https://en.wikipedia.org/api/rest_v1/feed/featured/2017/01/11 | grep '^cache-control:'
cache-control: s-maxage=300, max-age=60
Purge Varnish cache
[edit]See Multicast_HTCP_purging#One-off_purge on Wikitech
From mwmaint1002.eqiad.wmnet (terbium or deployment?). Examples:
echo 'https://meta.wikimedia.org/api/rest_v1/data/css/mobile/base' | mwscript purgeList.php
echo 'https://meta.wikimedia.org/api/rest_v1/data/css/mobile/pcs' | mwscript purgeList.php
echo 'https://meta.wikimedia.org/api/rest_v1/data/javascript/mobile/pcs' | mwscript purgeList.php
Dashboards
[edit]Logstash/Kibana
[edit]mobileapps, RESTBase (direct), RESTBase (ES), Parsoid
Performance
[edit]- mobileapps (old)
- Event bus delays: look at page-edit_delay Parsoid HTML and mobile-rerender-resource-change_delay
- RESTBase backend requests: Parsoid rates
- Public entry point request rates, req/s
Configuration
[edit]- If only one of the server sees load: (also see each host's weight)
- https://noc.wikimedia.org/conf/
Load
[edit]Grafana
[edit]mobileapps, RESTBase, EventBus
- eqiad: scb1001, scb1002, scb1003, scb1004; Cassandra: enwiki, other
- codfw: scb2001, scb2002, scb2003, scb2004, scb2005, scb2006; Cassandra: enwiki, other
Icinga
[edit]Icinga (lower case user name):
- eqiad: Mobileapps LVS eqiad; scb1001, scb1002, scb1003, scb1004
- codfw: Mobileapps LVS codfw; scb2001, scb2002, scb2003, scb2004, scb2005, scb2006
See also
[edit]See also Dealing with deploy problems and reverting deploys.