Machine Learning

Group:	Technology & Product
Team members:	Aiko Chou, Kevin Bazira, Tobias Klausmann, Ilias Sarantopoulos, (list of former staff and volunteers)
Backlog:	#machine-learning-team
Lead:	Chris Albon

This page is a translated version of the page Machine Learning and the translation is 49% complete.

Benvenuti alla pagina iniziale del team di Machine Learning della Wikimedia Foundation.

Il nostro team supervisiona lo sviluppo e la gestione dei modelli di machine learning per gli utenti, nonché le infrastrutture necessarie per progettare, formare e implementare questi modelli.

Progetti attuali

Machine Learning Model Cards
The Modernizzazione del Machine Learning Progetto
LiftWing - Un modello di apprendimento automatico scalabile che serve infrastrutture su Kubernetes utilizzando KServe.

For archived projects, see this list.

Contattaci

Hai una domanda? Vuoi parlare con il team oppure con la nostra comunità di volontari sul machine learning? Ecco i modi migliori per connettersi con noi.

Chat di Team

Discuti di machine learning e guarda il lavoro del team unendovi alla nostra chat IRC pubblica #wikimedia-ml ^connect su irc.libera.chat.

Consiglio di lavoro attivo

Se avete un compito particolare di cui volete discutere o lavorare, unitevi al nostro consiglio pubblico di Phabricator Visita il nostro consiglio di lavoro

Novità

20 settembre 2024
- Continueremo a integrare il modello di paese di articolo nel Liftwing. Il modello del Paese dell'articolo prevede per quali Paesi sarà applicabile un determinato modello ed è un'estensione del modello dell'argomento dell'articolo, che utilizziamo da anni.
- We're trying different approaches to build vllm (a high-throughput and memory-efficient system designed for serving large language models) and ROCm (the code that allows the CPU to talk to AMD GPUs) with Ubuntu. This is part of the work of making production LLMs on Liftwing possible.
- We're currently working on configuring the ML Lab servers. These are for model training.
- Updated the rec-api image deployment model. Deployed the reference need model to production.
23 agosto 2024
- Following up on recurring issue reported by the Structured Content team: The MediaDetection API can access the logo-detection endpoint via mwdebug1001.eqiad.wmnet and mwdebug2001.codfw.wmnet, but can't access it on k8s-mwdebug
- Aggiungere la documentazione di rilevamento del logo ai documenti del portale API.
- Investigare occasionalmente le domande lente su LiftWing quando si utilizzano alcuni modelli RevScoring
- Proseguire il lavoro rimanente sul modello di risparmio pre-reserva. Questo modello è progettato per fornire una previsione di un atto vandalico prima che un edit sia salvato a Wikipedia (e quindi non abbia un ID di revisione)
- Proseguono i lavori per il miglioramento del servizio a 0,13
- Inizializzazione di configurazione di installazione per gli host GPU in eqiad
14 giugno 2024
- Mi scuso per il ritardo negli update, purtroppo ho il covid.
17 maggio 2024
- Continua il lavoro sul modello di Logo Detection. Abbiamo fatto un esempio di un modello server per un logo detection che processa base64 oggetti immagini al posto di immagini URLs e le invia al team di Structured Content per le loro considerazioni.
- Continua il lavoro sul modello server di HuggingFace.
- Correzioni generali di bug e miglioramenti.
10 maggio 2024
- Il lavoro sul modello di rilevamento del logo continua. Le questioni che abbiamo discusso questa settimana come squadra se l'immagine codificata sarebbe stata inviata direttamente a LiftWing. In alternativa, riceveremmo un URL della posizione dell'immagine, da cui permetteremmo a LiftWing di accedere/caricare. Ciò è importante perché influisce sulla dimensione del carico utile REST, in particolare per le richieste in lotto.
- Stiamo ancora lavorando sul problema del server di modello HuggingFace GPU (cioè non riconosce la nostra GPU AMD). Ci sono diverse possibilità del perchè questo succeda, ma vogliamo che questo sia risolto prima di concludere il nostro ordine per quest'anno fiscale.
- Un certo numero di correzioni e miglioramenti di errori misc.
3 maggio 2024
- Il nostro grande rifattoraggio di Istio è in corso (slide)! Questo rifattorizzamento ci permetterà di rimuovere un sacco di logica di rete da singoli contenitori di modello. Ad esempio, attualmente, se ci fosse qualche cambiamento al endpoint `discovery.wmnet` (il endpoint interno di WMF per le API), dovremmo aggiornare centinaia di singoli container di modelli e ridistribuirli. Questa rifattazione elimina completamente questo bisogno.
- Abbiamo distribuito il software open source di AMD (ROCm) all'interno di ogni nodo k8s, ma sospettiamo che questo sia stato inutile (e in realtà potrebbe avere anche causato alcuni problemi) perché PyTorch ha già una versione di ROCm inclusa nella biblioteca. Questo lavoro viene priorizzato perché completarlo è un requisito per fare l'ordine di grandi GPU che abbiamo pianificato per il prossimo trimestre.
- Stiamo preparando una patch che consente al server di modello di rilevamento di logo di accedere ad URL esterne utilizzando gli endpoint interni di k8. Questo fa parte di alcuni dei cambiamenti che abbiamo dovuto fare per implementare il modello.
- Continuiamo a testare l'immagine del server del modello HuggingFace sui nostri nodi LiftWing. Questo lavoro è stato interrotto per una settimana mentre l'ingegnere ha partecipato all'hackathon di Wikipedia a Tallinn.
- Il lavoro di archiviazione di Lift Wing è stato sospeso fino a quando non sarà completata la rifattazione dell'Istio.
26 aprile 2024
- Reviewing and testing the big patch for the ORES extension. The ORES extension provides a way to see the probability that a particular edit is reverted for all edits on the recent changes page of many Wikis. The new revert risk model into the extension so that volunteers can use that new model when hunting down potential vandalism.
- We're still doing some tweaks for the image processing for the logo detection model, specifically restricting the image processing to trusted domains that host Wikimedia comments images.
- We have a big Istio (Istio is the service mesh for k8s that controls how microservices share data with each other) refactoring proposal under discussion. On Tuesday the team will have a special meeting to discuss the proposed refactoring and decide on the path forward. I'll post the slides next week if people are interested.
19 aprile 2024
- The logo detection model is being moved to the experimental namespace. This will be a moment where we can test the model in a production setting to make sure that it has the performance that we want. This work is being coordinated really closely with the structured content team to make sure it meets their needs.
- ML and research Airflow Pipeline Sprint has started this week. This is a effort to see how we can use Airflow pipelines and GPS on the existing Hadoop Cluster to train models.
- Work continues on the Cassandra clusters that will be part of the caching solution.
- Work continues on the Hugging Face model server image. This is an effort that we're working on that will allow us to easily host many of the models that are available on Hugging Face onto Lift Wing directly. This is actually a really interesting project because it's an easy way for the community to experiment with the models that they might want to host on Lift Wing and even propose models that they might want to have on Lift Wing.
- We are working with the data center operations team on the procurement of new machines with GPUs. The current status is that we are working with the vendor to an issue around the availability of a particular server configuration and looking at some alternatives.
12 aprile 2024
- Chris on vacation. No update this week.
5 aprile 2024
- Big win for the week: Our HuggingFace Docker image patch has been reviewed and approved. This Docker image allows us to deploy HuggingFace models quickly onto LiftWing, in a way that will speed up all development process going forward.
- Continuing to integrate the logo-detection prototype into KServe custom model-server that will be hosted on LiftWing
- Work on revertrisk-multilingual GPU image, ensure the RRML model is compatible with torch 2.x (e.g. predictions are correct as the model was trained with 1.13)
29 marzo 2024
- We are still working on the logo detection model for Wikimedia Commons. The current status is that we have confirmed with the product team working on the feature that the model is returning the expected outputs. The next step is to look at input validation and image size limits. The open question we are discussing with the product team is whether resizing of images should be done inside Lift Wing or prior to the image being sent to Lift Wing. Resizing is important because the logo detection model expects an image of a certain size.
- Work / banging our heads continues on the pytorch base image. For those following along, we are working with Service Ops to make a reasonably sized docker image that contains pytorch and ROCm support. If the base image is too big it becomes a problem for our Docker registry and we are trying to be good stewards of that common resource. Turns out it is harder than we thought.
- More work is happening on Lift Wing caching. We are still working out how we want Lift Wing (specifically KServe's Istio) to talk to the Cassandra servers.
- A new version of the Language Agnostic Revert Risk model has been deployed to staging and is currently doing load testing.
- More work on the HuggingFace model server integration with Lift Wing. Once we crack this we will be able to deploy most models on HuggingFace quickly.
22 marzo 2024
- We stood up a Wikimedia community of practice for ML this week. The goal is to provide a space for all the folks around WMF that are working on the technical side of ML to share insights and learn together. Currently there are folks from a number of teams in the community of practice, including ML, Research, Content Translation, and others.
- We are still waiting for our test GPUs (one server with two MI210s) to be installed in the data center. Once we test this configuration works well in our infrastructure (a few days of testing max) we can continue with the full order.
- I am starting work on a white paper that surveys all the work Wikimedia'verse is doing around AI, this includes models WMF hosts, advocacy work done by WMF, work by volunteers, etc. If you know some people I should talk with, definitely reach out.
- We are really pushing hard on getting caching deployed. The reason is that with caching, it means we can really take full advantage of the CPUs we have now by pre-caching predictions. The end result for users is that a prediction that might take 500ms would take a fraction of that time. The exact current status of the work is that our SRE is trying to get Lift Wing to speak to the Cassandra servers.
- Our SLO dashboards need to be fixed. They are giving some wild numbers that are clearly incorrect. Our team is working with folks to figure it out.
- Work on the Logo Detection model continues. The request to host this model comes from the Structured Content team. The goal is to predict logos in Wikimedia Commons because logos account for a significant chunk of files that receive a deletion request.
- We are continuing to try to load the HuggingFace model server onto Lift Wing. When completed this offers the potential to load a model hosted on HuggingFace into Lift Wing quickly and easily, opening a huge new library of models for folks to use.
15 marzo 2024
- We are working on deploying a model for the Structured Content team that detects potentially copyrighted image uploads on Commons, specifically images with logos. (T358676)
- We are continuing to work on hosting HuggingFace model server on Lift Wing. This would make deploying HuggingFace models super simple.
- We have deployed Dragonfly cache on Lift Wing to help with Docker image sizes.
- Our Cassandra databases for an eventual caching system is in production. Still more work to do but its a good start.
- General updates and bug fixes.
9 marzo 2024
- Sorry for the update being one day late, Chris (I) attended the Strategy meeting in NYC and is writing this update from the plane back.
- An issue we are facing is that WMF's docker registry is set up for smaller docker images (~2GBs). However, the docker images of the team can get pretty big because of ROCm/Pytorch (~6-8GB). We are working out how to resolve that. There a number of strategies can do, from optimizing the image layers better to requesting the max docker image size limit to be increased.
- As a partial solution to the above, we installed Dragonfly, which is a peer-2-peer layer between our Kubernetes cluster and the WMF docker registry. We will also work on some other improvements.
- We are continue working on including HuggingFace's prebuilt model server into Lift Wing. This would mean we could quickly deploy any model on HuggingFace with all the optimizations HuggingFace provides. (T357986). This isn't done yet but it would be really nice to have.
- Fixing a bug reported about inconsistent data type for article quality scores on ptwiki. The error as because of the mixed schema of the responses returned by ORES.(T358953)
- We made our server hardware request for the next fiscal year. The short version is: GPUs.
1 marzo 2024
- GPU order is underway. We are in the process of ordering a series of servers to use for training and inference. Each server will have two MI210 AMD GPUs. Most will be reserved for model inference (specifically, larger models like LLMs), but we will use two servers (4 GPUs) to create a model training environment. This model training environment will start very small and scrappy but will hopefully grow into a place for automated retraining of models and the standardization of model training approaches. The next steps are a single server will on its way to our data center, once this is tested we will make the full order.
- Work on caching for Lift Wing continues. We have in the process of making a large order of GPUs. However, to optimize our resource use, one of the best strategies we can do is conduct model inference using our existing CPUs. This is not always possible, for example cases when the set of possible model inputs is not finite. However, in cases where the possible inputs are finite we can cache the predictions for those inputs and then serve them to users rapidly with minimal compute used. This is a similar system to that which was originally used on ORES.
- The pentesting of Lift Wing continues. The testing is being done by a third party contractor and is examining our vulnerability to malicious code.
- Wikimedia's branding team has come out with some suggestions for the naming of machine learning tools and models. The hope is that our naming is more systematic and less ad-hoc.
- Chris helped organize and attend an event in Bellagio, Italy to craft a research agenda for researchers interested in Wikipedia. That research agenda is available here.