OpType DataType Combinations

ThingWorx Analytics Data > Prepare Data and Metadata > OpType DataType Combinations

Overview

Metadata parameters opType and dataType work together to describe what type of data a field contains and how that data can be used. Both parameters are required and values for each must be entered in all upper case letters. They can be used in certain combinations. The chart below shows which combinations are valid or invalid.

Data Types

• STRING – Selecting the String data type for numeric data can lead to undesirable results.

• BOOLEAN – Can only be used with Boolean and Informational op types.

• DOUBLE – Can be used with Continuous, Boolean, Entity ID, and Informational op types.

• INTEGER – Selecting the Integer data type for a Continuous goal does not indicate that the scores output during Training will also be integers. Because the validation process cannot accept integers, the data type for Continuous goals is converted internally from Integer to Double. In the resulting PMML output, scores are reported as more accurate floating point numbers.

• LONG – Can be used with Continuous, Boolean, Entity ID, Temporal, and Informational op types.

• DATETIME – Can only be used with Temporal and Informational op types to display readable date and time information. This data is automatically converted to epoch time for use in time series analytics. The following chart shows the valid formats for this data type.

Syntax	Examples*
ISO8601 Standard: See the Oracle Date Time Format site.	2019-09-18T10:02 2019-09-18T10:02:00 2019-09-18T10:02:00.000 2019-09-18T10:02:00.000+00:00 2019-09-18T10:02:00.000Z
Custom excluding T: yyyy-MM-dd HH:mm[:ss][.SSS][.SS][XXX][X]	2019-09-18 10:02 2019-09-18 10:02:00 2019-09-18 10:02:00.000 2019-09-18 10:02:00.000Z
Custom with / and excluding T: MM/dd/yyyy HH:mm[:ss][.SSS][.SS][XXX][X]	09/18/2019 10:02 09/18/2019 10:02:00 09/18/2019 10:02:00.000 09/18/2019 10:02:00.002Z

* Month and day designations must contain two digits. Correct date = 09/18/2019. Incorrect date = 9/18/2019.

Op Types

• CONTINUOUS – A numeric field (DOUBLE, INTEGER, or LONG dataType) that can include data of any value across the provided range. Data in this opType usually falls between the min and max values.

When a goal field is defined as a Continuous opType with a dataType of Integer, scores output in PMML format during training will probably NOT be integers. During the validation process, the system cannot accept an integer for a Continuous goal. Internally, all Continuous/Integer fields are converted to Continuous/Double fields. In the resulting PMML output, scores are reported as more accurate floating point numbers.

• CATEGORICAL – A field with a finite set of values, where each value is treated independently and there is no ordering. Only the provided values are acceptable unless no values are provided. In that case, values are calculated from all observed entries within the dataset.

• BOOLEAN – A true or false field, typically used as a flag.

• ORDINAL – A field with a finite set of values that have an inherent ordering. Only the provided values are acceptable and they must be provided in the correct order.

• ENTITY_ID – For time series data, this identifier allows for differentiation between entities (machines, lines, assets). For example, if a dataset contains entries for three different machines, each machine should have its own entity ID. Multiple entries with the same time stamp and Entity ID can cause errors during model training.

Beginning in Analytics Server 9.2, the ENTITY_ID column is optional when preparing time series data. The entity is inferred when the data is used to train a model. The resulting PMML model will contain the following field to indicate that the entity is defaulted:

• TEMPORAL – For time series data, this time value field indicates the expected time between adjacent rows of data (provided in the timeSamplingInterval).

• TEXT – A field that contains free-form text information such as Descriptions, Comments, or Notes. Text analysis techniques are used to extract information from the free text field for use in training, scoring, signals, profiles, and cluster jobs. For more information about how text information is transformed into analysis-ready information, see Transforming Free-Form Text for Analysis.

To limit the performance impact on training, signals, clusters, and profiles jobs that include free text data, the create jobs for these services can be configured with the following parameters:

◦ maxNGramSize – The maximum size of the text units to count. A value of 1 indicates that every word is counted. A value of 2 indicates that phrases of two consecutive words are counted, in addition to individual words. Default = 1.

◦ maxVocabSize – The maximum number of words or phrases for each n-gram size that can be included in the vocabulary. Default = 1000 for each n-gram size.

◦ minDocumentFrequency – A threshold filter to count only words or phrases that appear with a minimum level of frequency across rows. Values can range from 0.0 (inclusive) to 1.0 (exclusive). A value of 0 indicates that every word or phrase is counted with no filtering. A value of 0.1 indicates that words and phrases are only counted if they appear in at least 10% of rows in the dataset. Default = 0.

Free-form text data cannot include line breaks within text entries.

Was this helpful?