Key Analytics Infotables

In ThingWorx Analytics Server, two key infotables are in use for providing dataset and metadata information to a job request. Regardless of how the request is submitted (via a REST call, a mashup, or a service), the dataset and metadata must be provided in the form of an infotable.

An infotable is an instance of a data shape that includes data. Two commonly–used infotables are the following:

• AnalyticsDatasetRef – This infotable references a specific dataset, including a URI and the data format (both will vary depending on whether the data is a stored dataset or in-place data provided with a request). The data field in AnalyticsDatasetRef can be based on any data shape, as long as it is flat (does not contain any nested infotables) and it matches the corresponding metadata infotable.

• AnalyticsDatasetMetadata – This infotable is a data shape containing the machine learning characteristics that describe a dataset, including fieldName, dataType, opType, and more.

Metadata only needs to be provided if you are accessing data that is not already in the Data microservice, that is, the data URI does not use the scheme dataset://.

The data field in this infotable can be based on any Data Shape as long as it is flat (contains no nested infotables) and it matches the corresponding metadata. Each field of data must be defined according to the structure established in the metadata infotable. For additional information about creating and working with Data Shapes, see the Data Shapes section of the ThingWorx Foundation Help Center.

• datasetUri – A string that points to the location of the data you want to include in the request. Data can be accessed from a variety of locations. The syntax of this field includes two components separated by a colon. On the left side of the colon is a scheme, which indicates the source of the data (such as a stored dataset or in-place data provided directly to a request). On the right side of the colon is a path, which provides the specific location. The possible schemes include:

◦ thingworx:// – Points to a ThingWorx file repository, such as AnalyticsUploadStorage, where data can be uploaded and stored.

◦ thingworxs:// – Functions the same as the thingworx:// scheme but is used when ThingWorx Foundation server is accessed over SSL (https).

◦ file:/ – Points to the ThingWorx Analytics Server file system where data can be loaded directly and accessed in place. Accessing data from the file system is useful for small datasets with rapidly changing data.

◦ body:/ – Points to in-place data that can be supplied directly as part of an API request body. This method is useful for scoring or model evaluation jobs. Data supplied this way is not stored.

• format – A string that indicates the storage format of the data. Supported values include:

• filter – A string that contains clause conditions for an SQL WHERE statement to describe the characteristics of the data that should be included. It has the effect of removed rows of data from the dataset (does not remove columns).

• exclusions – A list of strings that remove specific fields from the dataset. It has the effect of removed columns from each row in the dataset (does not remove rows).

• data – An untyped infotable that must be provided in order to pass data as part of a request body. It can accept any infotable that includes only primitive base type fields (STRING, INTEGER, NUMBER).

• metadata – A string that points to the metadata infotable and must be provided in order to pass metadata as part of a request body.

This infotable is a static data shape containing the machine learning characteristics that describe a dataset. It includes the following fields:

• fieldName – A string that provides the field name for a column of data.

• dataType – A string that indicates the format of the data in this field. Acceptable values include:

• opType – A string that indicates how the data behaves. Acceptable values include:

◦ CONTINUOUS – A numeric field (DOUBLE, INTEGER, or LONG dataType) that can include data of any value across the provided range. Data in this opType usually falls between the min and max values.

When a goal field is defined as a Continuous opType with a dataType of Integer, scores output in PMML format during training will probably NOT be integers. During the validation process, the system cannot accept an integer for a Continuous goal. Internally, all Continuous/Integer fields are converted to Continuous/Double fields. In the resulting PMML output, scores are reported as more accurate floating point numbers.

◦ CATEGORICAL – A field with a finite set of values, where each value is treated independently and there is no ordering. Only the provided values are acceptable unless no values are provided. In that case, values are calculated from all observed entries within the dataset.

◦ ORDINAL – A field with a finite set of values that have an inherent ordering. Only the provided values are acceptable and they must be provided in the correct order.

◦ TEMPORAL – For time series data, this time value field indicates the expected time between adjacent rows of data (provided in the timeSamplingInterval).

◦ ENTITY_ID – For time series data, this identifier allows for differentiation between entities. For example, if a dataset contains entries for three different machines, each machine should have its own entity ID.

• min – For continuous values, this field represents the lowest expected value.

• max – For continuous values, this field represents the highest expected value.

• values – For ordinal and categorical values, this field contains a list of possible values. For ordinal, the values must be listed in the correct order. For categorical, the order of values doesn’t matter.

• timeSamplingInterval – For time series datasets, this value indicates the time interval between adjacent rows of data. If the dataset does not adhere to the specified interval, an error will occur.

• isStatic – For time series datasets, this flag can be used to indicate that a field should not change over time. When set to true, this field will not undergo any time series transformations. Setting this flag where appropriate will improve performance.