The following patches for the maintenance script generateSitemap.php from https://gerrit.wikimedia.org/r/c/620746 works only for the behavior switch magic word (), but does not remove pages marked 'noindex' via the LocalSettings.php from the sitemap.
I think there might be a solution to this because, if there wasn't, Wikipedia would have a problem excluding talkpages from its sitemap, which I think it doesn't: https://en.wikipedia.org/wiki/Wikipedia:Controlling_search_engine_indexing
Now, the wiki in question is by default noindex. Pages that are to be index have {{INDEX}} added to them but the entire wiki is noindex by default, because: $wgDefaultRobotPolicies = true; in LocalSettings.php. Thus the desire sitemap solution is to generate sitemap for pages that has __INDEX__ or {{INDEX}} in them or that indicate 'index' in the HTML output of the page.
diff --git a/maintenance/generateSitemap.php b/maintenance/generateSitemap.php index 6060567..bc5e865 100644 --- a/maintenance/generateSitemap.php +++ b/maintenance/generateSitemap.php @@ -305,15 +305,27 @@ * @return IResultWrapper */ private function getPageRes( $namespace ) { - return $this->dbr->select( 'page', + return $this->dbr->select( + [ 'page', 'page_props' ], [ 'page_namespace', 'page_title', 'page_touched', - 'page_is_redirect' + 'page_is_redirect', + 'pp_propname', ], [ 'page_namespace' => $namespace ], - __METHOD__ + __METHOD__, + [], + [ + 'page_props' => [ + 'LEFT JOIN', + [ + 'page_id = pp_page', + 'pp_propname' => 'noindex' + ] + ] + ] ); } @@ -335,7 +347,13 @@ $fns = $contLang->getFormattedNsText( $namespace ); $this->output( "$namespace ($fns)\n" ); $skippedRedirects = 0; // Number of redirects skipped for that namespace + $skippedNoindex = 0; // Number of pages with __NOINDEX__ switch for that NS foreach ( $res as $row ) { + if ( $row->pp_propname === 'noindex' ) { + $skippedNoindex++; + continue; + } + if ( $this->skipRedirects && $row->page_is_redirect ) { $skippedRedirects++; continue; @@ -380,6 +398,10 @@ } } + if ( $skippedNoindex > 0 ) { + $this->output( " skipped $skippedNoindex page(s) with __NOINDEX__ switch\n" ); + } + if ( $this->skipRedirects && $skippedRedirects > 0 ) { $this->output( " skipped $skippedRedirects redirect(s)\n" ); }