Transforming Free-Form Text for Analysis

ThingWorx Analytics Data > Prepare Data and Metadata > Transforming Free-Form Text for Analysis

Overview

Beginning with Analytics Server 9.2, a TEXT opType allows free-form text information to be used in analytics operations, including training, scoring, signals, profiles, and clustering. In order for the free text information to be used in these analyses, it must be converted into continuous data. This process is handled automatically by the analytics operations, but the procedure below provides a detailed description of the steps involved in transforming English-language text into analysis-ready numeric data.

The TEXT opType is currently designed to handle English text. While non-English text is untested it might appear to work. However, it could yield some unpredictable results.

Transformation Procedure

1. Text data is converted to all lowercase and is split into tokens (words). Tokens are created by splitting the text string at whitespace characters. Punctuation is removed if it is located at the beginning or end of the text entry or next to a whitespace.

Examples:

• Dog, DOG, dog are considered the same word.

• dog, dogs, dog’s are considered to be different words.

• As a result of this tokenization step, a text entry of “The quick brown fox jumps over the lazy dog.” becomes the following list of words: the, quick, brown, fox, jumps, over, the, lazy, dog

2. A set of English-language stop words are removed. These are common words that do not add much information. The set of stop words is predefined and not user-configurable.

Example:

• a, an, the, i, she, he, her, him, it, and, but, if, of, at, for

• As a result of this step, word list 1 below becomes word list 2:

1. the, quick, brown, fox, jumps, over, the, lazy, dog

2. quick, brown, fox, jumps, lazy, dog

3. N-grams are formed. These are words or phrases that become the units available for counting. Depending on the value of your maxNGramSize, n-grams can range in size from n=1 up to n= maxNGramSize. A 1-gram is an individual word and a 2-gram is a phrase of two consecutive words.

Examples:

• 1-grams: quick, brown, fox, jumps, lazy, dog

• 2-grams: quick brown, brown fox, fox jumps, jumps lazy, lazy dog

• 3-grams: quick brown fox, brown fox jumps, jumps lazy dog

4. The vocabulary is formed as follows:

◦ The total occurrences of each n-gram is counted throughout all the rows.

◦ The vocabulary is limited to n-gram terms that occur in at least a minimum number of rows, based on your minDocumentFrequency. If minDocumentFrequency is 0.5, then n-grams are only counted if they appear in at least half of the rows.

◦ Based on your maxVocabSize, the vocabulary is further limited by selecting only the most frequently-appearing terms in each n-gram category. If maxVocabSize is 5, then only the 5 most frequent 1-gram terms, the 5 most frequent 2-gram terms, up to the 5 most frequent terms of maxNGramSize length are selected for inclusion in the vocabulary.

5. The free-form text field is transformed into multiple integer-valued fields in the dataset, corresponding to the number of occurrences of each vocabulary term.

Example:

• If the vocabulary for the dataset includes: quick, brown, fox, brown fox, red, red fox

Then a row containing the text entry: The quick brown fox jumps over the lazy dog.

Will now contain 6 new fields with the following counts: 1, 1, 1, 1, 0, 0

• A row that contained the text entry: The red fox jumps over the brown fox.

Will now contain fields with the following counts: 0, 1, 2, 1, 1, 1

6. The resulting n-gram count fields are continuous data that can be used in all of the analytics algorithms, including training, signals, profiles, and clusters. If one or more n-gram count fields are selected as model inputs during training, the transformation required to compute those n-gram counts is stored in the PMML and performed during scoring as well.

For prescriptive scoring, free text fields cannot be used as lever features. Predictive scoring with important fields is not supported if the model has text inputs but predictive scoring without important fields is supported.

Note:

• In the context of maxAllowedFields parameter during model training configuration, each of the resulting n-gram count fields is treated as an individual field and shall be treated same as other features included for training the model.

• n-gram count field features are not separately displayed in the Analytics Builder.

Was this helpful?