Data Platform Engineering/Data Platform SRE/Status Update/2024-11-29
No update was sent last week, so this update covers 2 weeks!
Airflow migration to k8s
[edit]All Airflow webservers are migrated to k8s. This brings some quality of life improvements:
- reach their airflow UI via a public domain (no need for SSH tunnels)
- manage roles and permissions via LDAP group management
- get working links in alert emails
We've migrated the scheduler of our test instance to k8s. We'll need to replicate this work for all production instances, but at this point we are confident that this should work with only minor surprises.
A new T368033 automated DAG deployment process has been discussed, implemented, documented, and communicated. Merge requests to Airflow DAGs now require formal approval by a peer before being deployed.
- T379267 Migrate the airflow webservers to Kubernetes
- T375729 Create LDAP groups to use for OIDC permission mapping with corresponding airflow DAG Authors groups
- T380591 Migrate the airflow-analytics-test database to Kubernetes
- T380284 Migrate the airflow-analytics-test scheduler to Kubernetes
- T368033 Design a suitable DAG deployment method
- T380733 Change the Airflow email From address so that it refers to the instance name instead of the k8s cluster
- T380727 Restore original behavior for Airflow variable management
Spark version upgrade (in support of Dumps 2.0)
[edit]- T380035 Create Spark docker images for version 3.5.3
- T380038 Create a debian package of the spark shuffler for yarn version 3.5.3
- T380039 Create and distribute as assembly file for spark version 3.5.3
Replace Archiva with Gitlab artifact repositories
[edit]Migration of the Search clusters to OpenSearch
[edit]Operations
[edit]We've had some disk space and number of folders issues related to changes in how we deploy Refine. The immediate issue has been resolved (big thanks to DC-Ops for a quick reaction on adding disks!). This needs to be further addressed and has been communicated with Data Engineering.
- T378854 an-presto1018.eqiad.wmnet: DRAC is down
- T376118 Update druid config to automatically drop unused segments
- T378954 Build bigtop 1.5 packages for bookworm
- T380674 Log aggregation is failing for the analytics user due to too many files in /var/log/hadoop-yarn/apps/analytics/logs on HDFS
- T380278 High priority: Disk space expansion on an-launcher1002
- T380566 Increase the number of partitions for the webrequest_frontend topics. (in support of the migration from Varnish to HAProxy)
- T380494 java.io.IOException: Permission denied when trying to access the hadoop cluster
- T374602 Licensing / Legal risk around the use of Miniconda (Legal is approving the use of Miniconda - We are bound by conda-forge CoC: https://conda-forge.org/community/code-of-conduct/, but there isn't anything unexpected in there.We are bound by conda-forge CoC: https://conda-forge.org/community/code-of-conduct/, but there isn't anything unexpected in there)
- T379571 Kernel error Server an-redacteddb1001 may have kernel errors
- T380477 Jupyter/Conda: spawn new server with 'create and use new cloned env' times out
- T365878 Test whether or not CPU performance governor helps Hadoop Performance
- T219507 Create cookbook to reindex into elasticsearch / cirrus (declined, we now have better ways than cookbooks to reindex)
- T379182 ProbeDown - wdqs1015/migrate LDF alerts to Data Platform SRE