User:OrenBochman/Bugs
Bugs Fix Plan for Search
[edit]Bugzilla Links
[edit]Bug Id & Name | Testing | Classification | Comments |
---|---|---|---|
32655 Improving search for templates | search for {{Authority control}}, {{Authority | indexing,ui,ranking | see options below |
Specifying This Behaviour
[edit]- use case 1: readers who do not want to see templates in their search results.
- use case 2: editors who want to find template to use (knowing it's name)?.
- use case 3: editors who want in finding suitable template in a catagory.
- use case 4: template dev would be interested in finding all the pages where a template is used.
- use case 5: template dev would be interested in finding all templates that use a template.
- use case 6: template dev would be interested in finding all templates in a template catagory.
- use case 7: admins would want to find all pages pages using a template.
- use case 8: admins who want to find all pages using a template with a certain value parameters.
- use case 9: admins whom want to find all pages using non existing templates.
- use case 10: users whom want to find all pages containing arbitary code.
Open Questions
[edit]- are there some more use cases ?
- how common are these situations?
- what is the current practice for the above use cases?
- use case 2: Special:what links here.
- use case 3: look at the templates catagory.
- should search the results diffrentiate between template that exists templates that don't?
- what about transclusion from outside the templates namespace?:
- when templates do not contain template syntax should they be shown?
- when a template is not in the template namespace (say in user's) how can we know they are templates?
Analysis
[edit]Here are some approaches possible to implement this feature.
- Option 1: Quick and Dirty
- storing raw page's source in a
Field source
with unexpanded source - querying with a
litralStringQuery
andlitralStringPrefixQuery
. - it will double the index size a WFTU[1] per wiki.
- it requires no UI change - just extra syntax + documentation.
source:text
→ to search for wiki source textsource:"text"
→ to search for exact wiki source textsource:text*
→ to search for wiki source textsource:{{text}}
→ to search for wiki source textsource:{{text*}}
→ to search for wiki source text
- it may require its own ranking.
- storing raw page's source in a
- Option 2: Elegant
- indexing and storing the page's parsed source in a
parsedSourceTreeField
- and querying with a
sourceSearchQuery
to search the source - it would increase index by a factor of a WFTU.
- it could require UI change
- it could require its own ranking.
- indexing and storing the page's parsed source in a
- option 3: Efficient
- indexing the page's parsed source in a flat
parsedSourceField
- querying using a
sourceSearchQuery
which would provide markup search capability. - it would increase index by a log(WFTU). (this is a guess)
- it could require UI change
- it could require its own ranking.
- indexing the page's parsed source in a flat
option 1 will likely be inefficient. To effectively index wiki code a (java) parser for wiki code would be required.< The requirements are a parser that can process and tag
- templates
- template parameters
- magic words
- parser functions
- extensions
- comments
- nowiki
- includeonly
- noinclude
- I have been doing some work on writing a preprocessor but the work is far from over - it could be completed do this task.
Ranking & User Interface
[edit]- it is possible to avoid UI change by adding a new search syntax
- if the source search feature will function as a stand alone aplication its ranking will need just a little tweeking.
- if it is necessary to integrate it with general search it will require a more significant effort inolving.
- specification.
- design.
- implementation.
Bug Id & Name | Testing | Classification | Comments |
---|---|---|---|
23629 incorrect UTF-8 processing on output of page and section titles | search for א | Render Results |
Specifying This Behaviour
[edit]highlighted text in search reults is sometimes corrupt when showing multibyte characters
Open Questions
[edit]- where is this behaviour taking place?
- (analyzer) during indexing
- (analyzer) during retrieval
- (highlighter) during result rendering
- later in php
Analysis
[edit]- investigate by unit testing
Bug Id & Name | Testing | Classification | Comments |
---|---|---|---|
Bug 20173 - Lucene Search update script fails while downloading DTD | search for א | Render Results |
Specifying This Behaviour
[edit]highlighted text in search reults is sometimes corrupt when showing multibyte characters
Open Questions
[edit]- where is this behaviour taking place?
- (analyzer) during indexing
- (analyzer) during retrieval
- (highlighter) during result rendering
- later in php
Analysis
[edit]- investigate by unit testing
Bug Id & Name | Testing | Classification | Comments |
---|---|---|---|
Bug 20173 - Lucene Search update script fails while downloading DTD | search for א | Render Results |
Specifying This Behaviour
[edit]when running the update script the DTD download fails with "Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
This is the explanation given in w3.org for 503 response code
10.5.4 503 Service Unavailable
The server is currently unable to handle the request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay. If known, the length of the delay MAY be indicated in a Retry-After header. If no Retry-After is given, the client SHOULD handle the response as it would for a 500 response.
Note: The existence of the 503 status code does not imply that a server must use it when becoming overloaded. Some servers may wish to simply refuse the connection.
Open Questions
[edit]- how to reproduce the error?
Analysis
[edit]looking at the stack trace the error occurs:
- org.wikimedia.lsearch.oai.OAIParser.parse(OAIParser.java:64) called by
- org.wikimedia.lsearch.oai.OAIHarvester.read(OAIHarvester.java:64) called by
- org.wikimedia.lsearch.oai.IncrementalUpdater line:191
- org.wikimedia.lsearch.oai.OAIHarvester.read(OAIHarvester.java:64) called by
- workarounds
- use
commons-httpclients
instead ofHttpUrlConnection
-- how to tell xerces - try to clear the poxy
- use
System.setProperty("http.proxyHost", proxyHost);
System.setProperty("http.proxyPort", proxyPort);
......
some code...
.......
System.clearProperty("http.proxyHost");
System.clearProperty("http.proxyPort");
- testing
Extension:DumpHTML extension
[edit]multithreading
[edit]http://phplens.com/phpeverywhere/?q=node/view/254
missing pages
[edit]debugging page id of a missing main page
SELECT page_id
FROM `page`
WHERE page_namespace =4
AND page_title = 'Main_Page'
LIMIT 0 , 30
debugging page id of a missing category page
SELECT page_id
FROM `page`
WHERE page_namespace=14
AND page_title='Latin_nouns'
LIMIT 0 , 30
SQL schema
[edit]References
[edit]- ↑ WFTU is a wiki full text unit = the size of all the text in a wiki.