There an extremely relevant example of performance measurement from a Youtube engineer that illustrates how the statistics being collected are garbage.
It's worth reading the link, but in short: Youtube pages were bloated and inefficient. A base page was 1.2 meg and many dozens of requests before you could even begin viewing a video. The engineer optimized the base to under 0.1 meg and just 14 requests. When he tested it, it was much much faster. So he sent it live. A week later he checked the performance stats. He was shocked to discover that average load times had gotten WORSE! Nothing made sense. He was going crazy trying to figure out why. Then a colleague came up with the answer:
For much of the world, the original (slow) Youtube page was absolutely unusable. The original metrics collected only included people who were already getting acceptable performance. When the lightweight version was released, news spread like wildfire across Southeast Asia, South America, Africa, Siberia, and elsewhere, that Youtube WORKED now! The metrics got worse because they included vast numbers of people who couldn't use the slow version at all.
In this case the 2017Editor is the slow version, and the normal wikitext editor is the fast version. Collecting this kind of bulk data and comparing the results is based on an assumption that sample populations are actually comparable. However as illustrated above, that assumption isn't true. Anyone who found the 2017Editor unusable has stopped using it. The data for the 2017Editor is going to be skewed towards people who get better than average performance, it's going to be grossly skewed towards people who only edit small pages. Anyone who tries to edit a large page with the 2017Editor is going to turn the damn thing off.