User:Aisha Khatun
About me
I am an ML and NLP enthusiast from Bangladesh. I love working wih data and drawing information from them. I did my Bachelors in Computer Science and Engineering from Shahjalal University of Science and Technology, Bangladesh and Masters in Computer Science from University of Waterloo, Canada. Upon graduation, I worked as a Machine Learning Engineer for about a year before joining Wikimedia Foundation as a Data Analyst and Researcher, performing several roles along the way.
N.B: This is my personal wiki page.
My work
- I work with the Research Team as a Research Data Scientist (NLP) to develop Copyediting as a structured task. To increase and maintain the standard of Wikipedia articles, it is important to ensure articles don't have typos, spelling, or grammatical errors. While there are ongoing efforts to automatically detect "commonly misspelled" words in English Wikipedia, most other languages are left behind. We intend to find ways to detect errors in articles in all languages in an automated fashion.
- Previously I worked with the Search and Analytics team to find ways to scale the Wikidata Query Service by analyzing the queries being made. Find the analysis results in User:AKhatun Subpages. Phabricator Work Board (WDQS Analysis).
- I worked on the Abstract Wikimedia project to analyze find out central Scribunto Modules across all the wikis. This work leads to the creation of a central repository of functions to be used in a language-independent manner in the future. See our work in Phabricator and Github.
Contact me
- IRC: tanny411 on Libera
- Personal: website/blog
- LinkedIn: tanny411
- GitHub: tanny411
- Meta: Aisha Khatun
- WMF meta: AKhatun (WMF)
Outreachy Round 21 Internship Work
[edit]Overview
[edit]I am an Outreachy intern with the Wikimedia Foundation. My internship runs from 1 Dec, 2020 to 2 March, 2021. I am working on an initial step of the Abstract Wikipedia project - a project to make wikipedia reach millions by storing information in a more language independent manner.
I have blogged throughout my internship period here and will be doing more. More details about the specifics of my work can be found here:
- My blog: Internship progress
- Our work in GitHub: wikimedia/abstract-wikipedia-data-science/
- Tasks in Phabricator: T263678
Project Description
[edit]The Abstract Wikipedia initiative will make it possible to generate Wikipedia articles with a combination of community authored programming functions on a "wiki of functions" and the data and lexicographic (dictionary, grammar, etc.) knowledge on Wikidata. Today the way community authored programming functions are used on different language editions of Wikipedia involves a lot of copying and pasting. If someone wants to calculate the age of someone for a biography in their native language, they may need to first go to English Wikipedia for example, find the community authored programming function that calculates ages in English, then copy and paste it to their non-English Wikipedia. This process is error prone, can lead to code duplication, and worse, improvements to functions on one language edition may not ever make their way to other language editions. Wouldn't it be easier if all of these functions were instead available centrally and people didn't have to go through this manual process? This Outreachy task is about an important first step: finding the different community authored functions that are out there and helping to prioritize which ones would be good candidates for centralizing for Abstract Wikipedia and its centralized wiki of functions.
Mentor
[edit]Task partner
[edit]Blog posts
[edit]- Internship progress
- Getting started with Outreachy
- Struggle and Grow
- What is Abstract Wikipedia
- Modifying Expectations
- Future Goals
Outreachy Internship Updates
[edit]Week | Work done | Date |
---|---|---|
1 |
Created Wikitech account. Connected phabricator to Wikitech and Mediawiki. Set up 2FA in everything. Create and set committed identity in MediaWiki and MetaWiki User pages. Read about Gerrit and set up Mediawiki-Docker, ran the PHP unit tests, set up git-reviews etc following the How to become a MediaWiki Hacker page. Joined the required mailing lists, joined IRC channels (these channels have been so much help). Read the very awesome paper on abstract wikipedia. Also read Wikimedia Engineering Architecture Principles, Movement Strategy, WMF values and guiding principles. Started reading Wikipedia @ 20 and Debugging teams. Wrote my first blog post for people looking to intern through Outreachy and join Open Source: Blog. Challenges and lessons
|
1-7 Dec, 2020 |
2 |
There were a couple of new things I had to read and understand the working of this week. Among them were toolforge, VC in toolforge, Grid, cronjobs, MediaWiki databases, and the MediaWiki API. I created a tool and tested out various things to get comfortable working with this environment. I also experiemnted with local python, working in toolforge through terminal and PAWS jupyter notebooks to figure out which way is better suited for me. I ended up working in PAWS as it can connect to databases easily and export my finished code as python scripts our GitHub repo. I also tested running jobs in Grid, setting up dummy cronjobs. Poked around the database a little from terminal, PAWS, and locally. Challenges and lessons
|
8-14 Dec, 2020 |
3 |
Started creating scripts to collect contents of Scribunto Modules across all wikis. Set up the script in Grid to run everyday as a cronjob. For now it collects all the contents fresh everyday. Since API can miss some pages, I collected page list (id, title) from database as well to check against the page list collected by API. Note that DB does not contain wiki contents, contents are only returned by the API. Next I compared page lists from DB and API to find some inconsistencies in both places. I had a bunch of issues when loading from csv files due to use of various symbols in the contents of the files (commas, quotes) and also some broken rows due to multiple crons writing to the same file at the same time. These are to be fixed next. Wrote my second blog post about my struggles: Blog. Challenges and lessons
|
15-21 Dec, 2020 |
4 |
This week I finalised my content fetcher. All code was transitioned to take input and give output to database. I cleaned and tested the database transition. Fixed some more memory errors and divided the cron jobs more to take advantage of parallel processing. Due to some large wiki sites (e.g. enwiki, frwiki) some jobs take upto 60 minutes. Cronjob re-arrangement was also necessary to fix some of the memory issues, probably due to large content of individual pages. Another script was set up to fetch page_id from database and fetch their content through API. The implementation of database usage made update, search and delete on the collected information much easier and more robust. Some analysis was done on pages found explicitly through API or DB. Pages that don't have content or were missed multiple times (due to not having content or not being Scribunto modules) were deleted. Task done here. Finally, started analyzing the database. Work progress here. From database analysis I will select relevant page information that we will use for data analysis later on. Challenges and lessons
|
12-28 Dec, 2020 |
5 |
Almost done exploiting the database this week(work progress). I explored all the tables, tried to understand what these held and how the information may be usefule to us. Spent some time finding our what pagelinks, langlinks, iwlinks and templatelinks were and their differences to make sure they don't have overlaps. Details of my work progress in my internship progress blog. I set up queries to maneouver and fetch data from all dbs, save them in user-db, and set cronjobs for the same. Unexpectedly db-fetching is taking much longer than fetching data from API took. Segmentation faults and memory error had to be sovled by `LIMIT OFFSET` queries. This made the queries even slower but able to fit into memory (chunk_size in dataframe failed me). Wrote my third blog post introducing Abstract Wikipedia and our work: Blog. Challenges and lessons
|
29 Dec, 2020-4 Jan, 2021 |
6 |
Finalized collecting data from databases. Incorporated mentors suggetsion about query optimization and decided to not use iwlinks for now as it was taking over multiple days to run and thus slowing down toolforge for other purposes. Introduced better exception handling and error reporting and edited code so that queries are retired few times before failing. Also set up 'fetch missed contents' scripts. Finished collecting pageviews of pages that transclude modules using php and REST APIs. Both blurted out errors of their own, I ended up using the php API. Challenges and lessons
|
5-11 Jan, 2021 |
7 |
This week I started data analysis of the various numeric data from the user-table. There were lots of nulls, which I analysed by looking into respective wiki database tables and concluded that certain columns in our user database need to be modified to have default value 0. Next most important observation is that the data is HIGHLY skewed. The only way to visualize anything is to plot in log scale. So I viewed the data in small intervals and for each column I tried to find some very basic initial heuristic to identify what modules are important. As for pageviews, the script is still running to fetch pageviews for all data for 60 days. Since it is taking multiple days to run, I changed the script to fetch weekly instead of daily for later runs. But I think I might have to change it to fetch monthly instead. Wrote my fourth blog about Modifying Expectations: blog Challenges and lessons
|
12-18 Jan, 2021 |
8 |
Continued with data analysis for the numeric columns. Took some time to analyse transcluded_in and transclusions in more depth. For each column tried to find some heuristic values to determine which modules may be more important. Did some source code analysis as well. Page protection seems to have some new values that I couldn't understand. Understood from an answer in IRC that recent pages have page protection terms of user rights. i.e user with `X` rights can edit/move pages with `X` edit/move protection. Read up and tried to understand this more. Concluded that this isn't something universal across wikis, so I stuck to old page protection values as those are the majority. Challenges and lessons
|
19-25 Jan, 2021 |
9 |
Closed db-fetching task in phabricator T270492 after fixing bug T272822. The issue with `mysql connection lost` errors was that we were running it on `web` instead of `analytics` cluster as defaulted by toolforge python library. Tested by running all scripts, it seems to work now. Finished data analysis and created a PDF version of a summary. Shared with others. Merged db-fetcher with develop branch and incorporated feedback from Jade. Wrote 5th blog on my Future Goals Challenges and lessons
|
26 Jan-1 Feb, 2021 |
10-11 |
Started applying the findings from data analysis to find important modules. As evident from my data analysis, the data we have is heavily skewed. We want pages with more transclusions or more page links etc to be counted as important modules. But the number of such modules is very low and lost within in the 99.99999...th percentile. To fix this I changed the distribution slightly and regared the-percentile-a-value-is-in as the score. See more details of how I did that in the phabricator task T272003. Finally a module-score is calculated as a weighted sum of feature-scores. The weights can be altered by the user to prioritize number of lang links over transclusions for example. |
2-15 Feb, 2021 |
12-13 |
After weekly discussion, Jade and I decided to split our tasks once again. I continued working on similarity analysis that Jade had started. Jade on the other hand started building the web interface. Week 12 was a hectic week. I was buried in tons of experiments and had to come up with a way to find out modules similar to each other. Jade started by using levenshtein distance as features and performed DBSCAN clustering on it. This approach was a bit problematic in that levenshtein distance calculation is too slow to compute and takes `n x n` memory to create the distance matrix. Wouldn't be possible with our ~300k modules. To fix these I started out fresh and looked for other ways to make features. After a lot of experiments (see details in phabricator task T270827) I decided to go with FastText word embedding as features and OPTICS clustering algortihm. Next, I fixed the high number of noise that the algorithm detects by tuning the algorithm a bit, creating some pseudo-clusters from the noise, and finding ways to relate the clusters themselves. All of these are documented in a pdf uploaded in the phabricator task. |
16 Feb-2 Mar, 2021 |
Final Update |
All of our work is now accessible through the web-interface of our tool: abstract-wiki-ds.toolforge.org. You will be able to select some or all wiki projects, some or all languages, and give your weights (or use the defaults) to generate a list of 'important modules' based of their scores. Click on any module to get a list of modules that are similar to it. Now users can easily start the process of merging modules and move towards a more language independent wikipedia - Abstract Wikipedia! |
- |