Jump to content

Topic on Manual talk:GenerateSitemap.php

generateSitemap.php should remove __NOINDEX__ pages added via $wgNamespaceRobotPolicies or $wgDefaultRobotPolicies in LocalSettings.php

2
Goodman Andrew (talkcontribs)

The following patches for the maintenance script generateSitemap.php from https://gerrit.wikimedia.org/r/c/620746 works (removed noindex pages from sitemap file) only for the behavior switch magic word (___NOINDEX___), but does not remove pages marked 'noindex' via the LocalSettings.php from the generated sitemap file.

I think there might be a solution to this because, if there wasn't, Wikipedia would have a problem excluding talkpages from its sitemap, which I think it doesn't: https://en.wikipedia.org/wiki/Wikipedia:Controlling_search_engine_indexing

Now, the wiki in question is by default noindex. Pages that are to be index have Template:INDEX added to them but the entire wiki is noindex by default, because: $wgDefaultRobotPolicies = true; in LocalSettings.php. Thus the desire sitemap solution is to generate sitemap for pages that has ___INDEX___ or Template:INDEX in them or that indicate 'index' in the HTML output of the page.

``` diff --git a/maintenance/generateSitemap.php b/maintenance/generateSitemap.php index 6060567..bc5e865 100644 --- a/maintenance/generateSitemap.php +++ b/maintenance/generateSitemap.php

@@ -305,15 +305,27 @@

	 * @return IResultWrapper
	 */
	private function getPageRes( $namespace ) {

- return $this->dbr->select( 'page', + return $this->dbr->select( + [ 'page', 'page_props' ],

			[
				'page_namespace',
				'page_title',
				'page_touched',

- 'page_is_redirect' + 'page_is_redirect', + 'pp_propname',

			],
			[ 'page_namespace' => $namespace ],

- __METHOD__ + __METHOD__, + [], + [ + 'page_props' => [ + 'LEFT JOIN', + [ + 'page_id = pp_page', + 'pp_propname' => 'noindex' + ] + ] + ]

		);
	}

@@ -335,7 +347,13 @@

			$fns = $contLang->getFormattedNsText( $namespace );
			$this->output( "$namespace ($fns)\n" );
			$skippedRedirects = 0; // Number of redirects skipped for that namespace

+ $skippedNoindex = 0; // Number of pages with switch for that NS

			foreach ( $res as $row ) {

+ if ( $row->pp_propname === 'noindex' ) { + $skippedNoindex++; + continue; + } +

				if ( $this->skipRedirects && $row->page_is_redirect ) {
					$skippedRedirects++;
					continue;

@@ -380,6 +398,10 @@

				}
			}

+ if ( $skippedNoindex > 0 ) { + $this->output( " skipped $skippedNoindex page(s) with switch\n" ); + } +

			if ( $this->skipRedirects && $skippedRedirects > 0 ) {
				$this->output( "  skipped $skippedRedirects redirect(s)\n" );
			}

```

Goodman Andrew (talkcontribs)

How does one skip redirects or namespace redirects that are add via the LocalSettings.php during sitemap generation?

Reply to "generateSitemap.php should remove __NOINDEX__ pages added via $wgNamespaceRobotPolicies or $wgDefaultRobotPolicies in LocalSettings.php"