Jump to content

Topic on Extension talk:SolrStore

summary line wrong

15
David Mason (talkcontribs)

hi,

We're now trying SolrStore with MW 1.19/Solr 4.1/SMW 1.8. It all seems to work, but the summary line is wrong:

Relevance: 27.0% - 2 KB (19 words) - 22:28, 16 May 2013

The search word is in the title, so relevance should be higher? The article is more than 19 words (not sure about KB size), and the date is incorrect since the article was last modified on the 15th. Is there a known fix?

thanks!

SBachenberg (talkcontribs)

Hi David, nice to here that it's working with Solr 4.1 we haven't test it yet. The Relevance is a Bit tricky, because Solr generates a Score for each result based on TF-IDF. Normally you can not convert a TF-IDF score cleanly into a percentage. But the default MediaWiki search form wants a Relevance in percent. We have often relevance values far over the 100% so Please do not take it as accurate.

For the last modified date you have to do 2 things:

  1. Look at your solr search result xml and find the actual field name of your Modification date. The Problem here is, its based on the language you are using. In an English wiki it should be "Modification date_dt", in German it's "Zuletzt geändert_dt".
  2. Go into SolrStore/templates/SolrSearchTemplate_Standart.php line 81 and change it from: if ( $docd[ 'name' ] == 'Zuletzt geändert_dt'){ to your language. if it's English: if ( $docd[ 'name' ] == 'Modification date_dt'){

EDIT: I just uploaded a fix for the English language to SourceForge: http://sourceforge.net/projects/smwsolrstore/files/SolrStore_0.8.1.zip/download

If you have any other Problems etc. just ask.

Kghbln (talkcontribs)

Heiya Simon,

it would be nice to have the commit for the new version also in Gerrit. Thus all the translation update would move into this version, too.

Cheers

David Mason (talkcontribs)

Hi,

Thanks for the translation fix, the date is correct now. However, the file size is still incorrect. For example, it shows "212 B (16 words)" for a page that is 752 words, 4534 bytes. As it's different for each entry I presume it's not a translation problem.

How should the "relevance" score be interpreted? Is the sorting correct? I don't want to put something in front of the users that's confusing.

SBachenberg (talkcontribs)

Hi, the Score is a correct tf-idf score, the higher the score the better and the sorting is also correct.

I'll look into this Bytes/Words Problem. I have currently no idea where the problem is, but I'll answer you as soon as possible.

One thing you should know about the extension is, that we currently don't support the search in selected namespaces. You can only search in all namespaces, but you can disable some namespace in your LocalSettings.php with the parameter $wgSolrOmitNS.

The default is:$wgSolrOmitNS = array('102' );

You should hide you advance search options so nobody gets confused. The CSS for that is: .mw-search-formheader div.search-types, #mw-searchoptions{ display: none; }

David Mason (talkcontribs)

Thanks very much for your diligence! Let me know if I can help.

SBachenberg (talkcontribs)

Sorry that it took so long, but I was a bit busy the last Days.

Could you please change the Code in the File /SolrStore/Templates/SolrSearchTemplate_Standart.php line 33 to this:

// get Size, Namespace, Wordcound, Date from XML:		
foreach ( $xml->arr as $doc ) {
	switch ( $doc[ 'name' ] ) {
		case 'text':
			$textsnip = '';
			$textsnipvar = 0;
			foreach ( $doc->str as $inner ) {
				$textsnipvar++;
				if ( $textsnipvar >= 4 && $textsnipvar <= $snipmax ) {
					$textsnip .= ' ' . $inner;
				}
				$textsnip = substr( $textsnip, 0, $textlenght );
			}
			$this->mDate = $doc->date;
			break;
		case 'wikitext':
			$this->mSize = strlen( $doc->str );
			$this->mWordCount = count( $doc->str );
			$textsnipy = "";
			$textsnipy = $doc->str;
			$textsnipy = str_replace( '{', '', $textsnipy );
			$textsnipy = str_replace( '}', '', $textsnipy );
			$textsnipy = str_replace( '|', '', $textsnipy );
			$textsnipy = str_replace( '=', ' ', $textsnipy );
			$textsnipy = substr( $textsnipy, 0, $textlenght );
			break;
	}
}

I will upload a fix to SourceForge later.

David Mason (talkcontribs)

Hi again,

I tried this fix, it changes the output but it's still not correct, unless there is something unusual about how it handles text in SMW templates, where most of our text is located. I also had to comment out code references to $nsText.

SBachenberg (talkcontribs)

Hi,I thought that solves the problem.

Let me tell you a bit about how to handle the wikitext. We store the wikitext in solr field "wikitext" and each SMW attribut in its own field. We also have a field called "text", in which we save all fields combined. Before the patch we used "text" for the calculation now we changed it to wikitext, which should be the right field for that purpose.

All of this fields can be customized through solr it self and thats where the Problem must be. Could you please have a look in your Solr schema.xml. In line 953 should be something like that:

<field name="wikitext" type="text_general" indexed="true" stored="true" multiValued="true"/>

This defines "wikitext" with the Solr FieldType "text_general", which I thought would be the right, but I never thought about to count words and Bytes. Could you please change it to "string", because "text_general" uses analyzers, tokenizer and a handful of filter. All these things manipulate the original text, which leads to the miss calculation.

The only big Problem is, that you have to restart you solr after altering the schema.xml and also have to re-index you wiki, so that the new field definition can show it results.


Please tell me if it works, because re-indexing our SofisWiki takes up to 3 Days and you will probably be faster :-)

David Mason (talkcontribs)

Hi again,

I've changed that line and restarted solr, then I ran SMW_refreshData (?) . It's not clear how to refresh the data so I ran SemanticMediaWiki/maintenance/SMW_refreshData.php and also maintenance/runJobs.php. During the former I saw lots of this:

PHP Notice: Array to string conversion in /var/www/mw/extensions/SemanticMediaWiki/includes/storage/SQLStore/SMW_SQLStore3_Writers.php on line 383

Unfortunately now the result looks the same as it did previously, and I see results like this "2 KB (1 word)" — that's some word!

If it helps I could set up an isolated instance for you to connect to directly?

SBachenberg (talkcontribs)

Hi David, this sounds nice, but I think it would be enough if you could sent me an XML result from your Solr. Then i can find out, why the stored data is not counted correctly.

The way you re-indexed was absolutely right, but the error you get is not from the SolrStore. Thats an known SMW error: https://bugzilla.wikimedia.org/show_bug.cgi?id=42321

SBachenberg (talkcontribs)

HI David, I may have found the Error. Could you please change

$this->mWordCount = count( $doc->str );

to

$this->mWordCount = str_word_count( $doc->str );

Sorry, that fixing this takes so long. Because I haven't written this "Template" part. Thats all the work of my workmate Sascha Schüller, but he has currently no ambition to fix that.

David Mason (talkcontribs)

Hi,

I think that is better, it is higher than the "wc" word count but that may be what it considers "words." I will run this by the users with some caveats.

Thanks again!

I'd like to talk about ways to extend this project, for example to support 'classes' without a php-coded template.

And it could also support uploaded documents (Word, PDF) since it's based on Solr.

Are these being considered?

SBachenberg (talkcontribs)

Hi David,

your ideas sound really good, but I'm not a good "Extension Developer", because I have almost no idea how the Mediawiki works internally. But maybe you have the knowledge that lacks me. I also would like to change the way, how the Fieldbased search templates get defined. Writing them into the LocalSettings.php is so uncool. It would be much nicer if I could define them with Semantic Forms.


So if you want to extend this project, you can do it on your own or we can make it together. Feel free to ask me everything about this extension.

David Mason (talkcontribs)

yes, I'm absolutely interested in working on this, though I'm swamped for the next week. Can we meet next week online to talk to about it?

Reply to "summary line wrong"