Filtering Text Field Searches

Basic Customization > User Interface Customization > Windchill Search Customization > Customizing Solr > Filtering Text Field Searches

For more detailed information about analyzers, filters and tokenizers, see the following link:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Every text field uses the com.ptc.solr.analysis.PTCWordDelimiterFilterFactory filter. This filter splits words into subwords and performs optional transformations on subword groups. Words are split into subwords using the following rules:

Rule

Example

Split on intra-word delimiters (by default, all non alpha-numeric characters)

"Wi-Fi" splits into "Wi" and "Fi"

Split on case transitions.

"TransAM" splits into "Trans" and "AM"

Leading and trailing intra-word delimiters on each subword are ignored.

"__hello---there, 'dude'" splits into "hello", "there", and "dude"

Trailing "'s" characters are removed for each subword.

This step is not performed in a separate filter because of possible subword combinations.

"O'Neil's" splits into "O" and "Neil".

This filter is a replica of solr.WordDelimiterFilter, which is shipped with Solr. It has been customized to protect the following characters: ".", "-" and "_"

Splitting is affected by the following parameters:

• generateWordParts=1

Parts of words are generated: "whistle-blower" = "whistle" "blower"

• generateNumberParts=1

Number subwords are generated: "500-42" = "500" "42"

• catenateWords=1

Maximum runs of word parts are catenated: "re-confirm" = "reconfirm"

• catenateNumbers=1

Maximum runs of number parts are catenated: "500-42" = "50042"

• catenateAll=1

All subword parts are catenated: "wi-fi-4000" = "wifi4000"

• splitOnCaseChange=1

Split on case transitions: “PowerShot” = "Power" "Shot"

• preserveOriginal=1

Includes original words in subwords: "500-42" = "500" "42" "500-42"

The com.ptc.solr.analysis.PTCSpecialCharacterFilterFactory filter is also used. This filter creates sub-tokens for tokens that end with PTC protected special characters. Currently there are only three protected special characters:

• dot or period (.)

• dash (-)

• underscore (_)

Sub-tokens are created with the following rules:

Rule	Example
Tokens ending with a period(.)	"dot." = "dot.", "dot"
Tokens ending with a dash (-)	"dash-" = "dash-", "dash"
Tokens ending with an underscore (_)	"under_" = "under_", "under"

Ensure that the same order of tokenizers is maintained at indexing and query time. Tokens generated at query time should be the same as when indexing for a given word.

Stop Words

The words mentioned in $solr-home\wblib\conf\stopwords.txt are not indexed. These words should be words that a user would not enter in a meaningful search. For example, “if” or “not”. To include these words in searches, remove them from stopwords.txt.

For English, the text field is used and is configured using the StopFilterFactory filter.

Synonyms

The synonym entries in $solr-home\wblib\conf\synonyms.txt ensure that searching on one word can find records with synonymous words. You can edit this file to enter or remove synonyms.

The SynonymFilterFactory filter is configured for English text fields.