Jump to content

User:OrenBochman/Bugs

From mediawiki.org

Bugs Fix Plan for Search

[edit]
[edit]
Bug Id & Name Testing Classification Comments
32655 Improving search for templates search for {{Authority control}}, {{Authority indexing,ui,ranking see options below

Specifying This Behaviour

[edit]
  1. use case 1: readers who do not want to see templates in their search results.
  2. use case 2: editors who want to find template to use (knowing it's name)?.
  3. use case 3: editors who want in finding suitable template in a catagory.
  4. use case 4: template dev would be interested in finding all the pages where a template is used.
  5. use case 5: template dev would be interested in finding all templates that use a template.
  6. use case 6: template dev would be interested in finding all templates in a template catagory.
  7. use case 7: admins would want to find all pages pages using a template.
  8. use case 8: admins who want to find all pages using a template with a certain value parameters.
  9. use case 9: admins whom want to find all pages using non existing templates.
  10. use case 10: users whom want to find all pages containing arbitary code.

Open Questions

[edit]
  • are there some more use cases ?
  • how common are these situations?
  • what is the current practice for the above use cases?
    1. use case 2: Special:what links here.
    2. use case 3: look at the templates catagory.
  • should search the results diffrentiate between template that exists templates that don't?
  • what about transclusion from outside the templates namespace?:
    • when templates do not contain template syntax should they be shown?
    • when a template is not in the template namespace (say in user's) how can we know they are templates?

Analysis

[edit]

Here are some approaches possible to implement this feature.

  1. Option 1: Quick and Dirty
    1. storing raw page's source in a Field source with unexpanded source
    2. querying with a litralStringQuery and litralStringPrefixQuery.
    3. it will double the index size a WFTU[1] per wiki.
    4. it requires no UI change - just extra syntax + documentation.
      1. source:text → to search for wiki source text
      2. source:"text" → to search for exact wiki source text
      3. source:text* → to search for wiki source text
      4. source:{{text}} → to search for wiki source text
      5. source:{{text*}} → to search for wiki source text
    5. it may require its own ranking.
  2. Option 2: Elegant
    1. indexing and storing the page's parsed source in a parsedSourceTreeField
    2. and querying with a sourceSearchQuery to search the source
    3. it would increase index by a factor of a WFTU.
    4. it could require UI change
    5. it could require its own ranking.
  3. option 3: Efficient
    1. indexing the page's parsed source in a flat parsedSourceField
    2. querying using a sourceSearchQuery which would provide markup search capability.
    3. it would increase index by a log(WFTU). (this is a guess)
    4. it could require UI change
    5. it could require its own ranking.

option 1 will likely be inefficient. To effectively index wiki code a (java) parser for wiki code would be required.< The requirements are a parser that can process and tag

  • templates
  • template parameters
  • magic words
  • parser functions
  • extensions
  • comments
  • nowiki
  • includeonly
  • noinclude
    1. I have been doing some work on writing a preprocessor but the work is far from over - it could be completed do this task.

Ranking & User Interface

[edit]
  • it is possible to avoid UI change by adding a new search syntax
  • if the source search feature will function as a stand alone aplication its ranking will need just a little tweeking.
  • if it is necessary to integrate it with general search it will require a more significant effort inolving.
    • specification.
    • design.
    • implementation.
Bug Id & Name Testing Classification Comments
23629 incorrect UTF-8 processing on output of page and section titles search for א Render Results

Specifying This Behaviour

[edit]

highlighted text in search reults is sometimes corrupt when showing multibyte characters

Open Questions

[edit]
  • where is this behaviour taking place?
    • (analyzer) during indexing
    • (analyzer) during retrieval
    • (highlighter) during result rendering
    • later in php

Analysis

[edit]
  • investigate by unit testing


Bug Id & Name Testing Classification Comments
Bug 20173 - Lucene Search update script fails while downloading DTD search for א Render Results

Specifying This Behaviour

[edit]

highlighted text in search reults is sometimes corrupt when showing multibyte characters

Open Questions

[edit]
  • where is this behaviour taking place?
    • (analyzer) during indexing
    • (analyzer) during retrieval
    • (highlighter) during result rendering
    • later in php

Analysis

[edit]
  • investigate by unit testing


Bug Id & Name Testing Classification Comments
Bug 20173 - Lucene Search update script fails while downloading DTD search for א Render Results

Specifying This Behaviour

[edit]

when running the update script the DTD download fails with "Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"

This is the explanation given in w3.org for 503 response code

10.5.4 503 Service Unavailable

The server is currently unable to handle the request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay. If known, the length of the delay MAY be indicated in a Retry-After header. If no Retry-After is given, the client SHOULD handle the response as it would for a 500 response.

Note: The existence of the 503 status code does not imply that a server must use it when becoming overloaded. Some servers may wish to simply refuse the connection.

Open Questions

[edit]
  • how to reproduce the error?

Analysis

[edit]

looking at the stack trace the error occurs:

  • org.wikimedia.lsearch.oai.OAIParser.parse(OAIParser.java:64) called by
    • org.wikimedia.lsearch.oai.OAIHarvester.read(OAIHarvester.java:64) called by
      • org.wikimedia.lsearch.oai.IncrementalUpdater line:191
  • workarounds
    1. use commons-httpclients instead of HttpUrlConnection -- how to tell xerces
    2. try to clear the poxy
  System.setProperty("http.proxyHost", proxyHost);
  System.setProperty("http.proxyPort", proxyPort);
  ......
  some code...
  .......
  System.clearProperty("http.proxyHost");
  System.clearProperty("http.proxyPort");
  • testing

multithreading

[edit]

http://phplens.com/phpeverywhere/?q=node/view/254

missing pages

[edit]

debugging page id of a missing main page

SELECT page_id
  FROM `page`
 WHERE page_namespace =4
   AND page_title = 'Main_Page'
 LIMIT 0 , 30

debugging page id of a missing category page

SELECT page_id 
  FROM `page` 
 WHERE page_namespace=14 
   AND page_title='Latin_nouns'
 LIMIT 0 , 30


SQL schema

[edit]

https://secure.wikimedia.org/wikipedia/mediawiki/wiki/File:MediaWiki_database_schema_1-17_%28r82044%29.png

References

[edit]
  1. WFTU is a wiki full text unit = the size of all the text in a wiki.