Jump to content

Topic on Project:Support desk

Disallowing /w/index.php? in robots.txt - will it stop crawling the whole site?

5
Summary by Biologically

Use of robots.txt for the MediaWiki sites (and more) installed in the web root directory (in this case public_html) explained.

Biologically (talkcontribs)

If I write this in the robots.txt file -

Disallow: /w/index.php?

Will it stop crawling in the whole site? In other words, will it drop the site from indexing in Google?

Bawolff (talkcontribs)

Depends on your url setup. Also you need to specify user-agent.

If you have pages listed as /wiki/page_name_here or '/w/index.php/page_name_here than the answer is no.

If you really want noindexing, you should probably just do

User-agent: *
Disallow: /

(or if there are other things on your domain, Disallow: /w/

Biologically (talkcontribs)

Thank you so much for explaining. My site is in the web root folder (public_html in apache) and my short-URL contains http://site_name.com/all/page_name_here structure where I used "all" in place of "wiki" as in your example.

I also don't want to completely no-index the site, so although my site is in "/" (web root or public_html) directory, I probably can't use (can you please confirm) -

User-agent: *

Disallow: /

as compared to -

User-agent: *

Disallow: /w/

that is used in most wikis because they installed the site in /w/ directory.

So, using -

User-agent: *

Disallow: /index.php?

Would it completely block my site from being crawled?

Bawolff (talkcontribs)

If you just want to allow your all directory, you can do something like

User-agent: *
Disallow: /
Allow: /all/

Blocking /index.php? should block all non normal page views (normal page views are still possible via /index.php/page_here)

Biologically (talkcontribs)

Thank you.