The following patches for the maintenance script generateSitemap.php from https://gerrit.wikimedia.org/r/c/620746 works (removed noindex pages from sitemap file) only for the behavior switch magic word (___NOINDEX___), but does not remove pages marked 'noindex' via the LocalSettings.php from the generated sitemap file.
I think there might be a solution to this because, if there wasn't, Wikipedia would have a problem excluding talkpages from its sitemap, which I think it doesn't: https://en.wikipedia.org/wiki/Wikipedia:Controlling_search_engine_indexing
Now, the wiki in question is by default noindex. Pages that are to be index have Template:INDEX added to them but the entire wiki is noindex by default, because: $wgDefaultRobotPolicies = true; in LocalSettings.php. Thus the desire sitemap solution is to generate sitemap for pages that has ___INDEX___ or Template:INDEX in them or that indicate 'index' in the HTML output of the page.
``` diff --git a/maintenance/generateSitemap.php b/maintenance/generateSitemap.php index 6060567..bc5e865 100644 --- a/maintenance/generateSitemap.php +++ b/maintenance/generateSitemap.php
@@ -305,15 +305,27 @@
* @return IResultWrapper */ private function getPageRes( $namespace ) {
- return $this->dbr->select( 'page', + return $this->dbr->select( + [ 'page', 'page_props' ],
[ 'page_namespace', 'page_title', 'page_touched',
- 'page_is_redirect' + 'page_is_redirect', + 'pp_propname',
], [ 'page_namespace' => $namespace ],
- __METHOD__ + __METHOD__, + [], + [ + 'page_props' => [ + 'LEFT JOIN', + [ + 'page_id = pp_page', + 'pp_propname' => 'noindex' + ] + ] + ]
); }
@@ -335,7 +347,13 @@
$fns = $contLang->getFormattedNsText( $namespace ); $this->output( "$namespace ($fns)\n" ); $skippedRedirects = 0; // Number of redirects skipped for that namespace
+ $skippedNoindex = 0; // Number of pages with switch for that NS
foreach ( $res as $row ) {
+ if ( $row->pp_propname === 'noindex' ) { + $skippedNoindex++; + continue; + } +
if ( $this->skipRedirects && $row->page_is_redirect ) { $skippedRedirects++; continue;
@@ -380,6 +398,10 @@
} }
+ if ( $skippedNoindex > 0 ) { + $this->output( " skipped $skippedNoindex page(s) with switch\n" ); + } +
if ( $this->skipRedirects && $skippedRedirects > 0 ) { $this->output( " skipped $skippedRedirects redirect(s)\n" ); }
```