Key Analytics Infotables
In ThingWorx Analytics Server, two key infotables are in use for providing dataset and metadata information to a job request. Regardless of how the request is submitted (via a REST call, a mashup, or a service), the dataset and metadata must be provided in the form of an infotable.
An infotable is an instance of a data shape that includes data. Two commonly–used infotables are the following:
• AnalyticsDatasetRef – This infotable references a specific dataset, including a URI and the data format (both will vary depending on whether the data is a stored dataset or in-place data provided with a request).
• AnalyticsDatasetMetadata – This infotable is a data shape containing the machine learning characteristics that describe a dataset, including fieldName, dataType, opType, and more.
|
Metadata only needs to be provided if you are accessing data that is not already in the appropriate microservice, that is, the data URI does not use the scheme dataset://.
|
AnalyticsDatasetRef
The AnalyticsDatasetRef contains the following fields:
• datasetUri – A string that points to the location of the data you want to include in the request. Data can be accessed from a variety of locations. The syntax of this field includes two components separated by a colon. On the left side of the colon is a scheme, which indicates the source of the data (such as a stored dataset or in-place data provided directly to a request). On the right side of the colon is a path, which provides the specific location. The possible schemes include:
◦ thingworx:// – Points to a ThingWorx file repository, such as AnalyticsUploadStorage, where data can be uploaded and stored.
◦ thingworxs:// – Functions the same as the thingworx:// scheme but is used when ThingWorx Foundation server is accessed over SSL (https).
◦ file:/ – Points to the ThingWorx Analytics Server file system where data can be loaded directly and accessed in place. Accessing data from the file system is useful for small datasets with rapidly changing data.
◦ body:/ – Points to in-place data that can be supplied directly as part of an API request body. This method is useful for scoring or model evaluation jobs. Data supplied this way is not stored.
◦ dataset:/ – Points to a dataset created and stored in a microservice.
• format – A string that indicates the storage format of the data. Supported values include:
◦ csv – For use with the thingworx://, thingworxs://, body:/ or file:/ schemes
◦ parquet – For use with the dataset:/ scheme
|
Format values must be indicated in lower case.
|
• filter – A string that contains clause conditions for an SQL WHERE statement to describe the characteristics of the data that should be included. It has the effect of removed rows of data from the dataset (does not remove columns).
• exclusions – A list of strings that remove specific fields from the dataset. It has the effect of removed columns from each row in the dataset (does not remove rows).
• data – An untyped infotable that must be provided in order to pass data as part of a request body. A data infotable is used only when the datasetUri parameter is set to body:/.
The data parameter can accept any infotable that includes only primitive base type fields (STRING, INTEGER, NUMBER). It must not contain additional nested infotables and it must match the corresponding metadata infotable. The data infotable must include at least the following:
◦ A Data Shape that defines each column of the data, including field names and base types. For additional information about creating and working with Data Shapes, see the
Data Shapes section of the
ThingWorx Foundation Help Center.
◦ Rows of actual data. A row entry must be specified for every record of data.
For a sample of a javascript service that creates a data infotable as part of an AnalyticsDatasetRef table, see
Sample Javascript Service.
• metadata – A string that points to the metadata infotable and must be provided in order to pass metadata as part of a request body.
AnalyticsDatasetMetadata
This infotable is a static data shape containing the machine learning characteristics that describe a dataset. It includes the following fields:
• fieldName – A string that provides the field name for a column of data.
• dataType – A string that indicates the format of the data in this field. Acceptable values include:
◦ STRING
◦ DOUBLE
◦ BOOLEAN
◦ INTEGER
◦ LONG
• opType – A string that indicates how the data behaves. Acceptable values include:
◦ CONTINUOUS – A numeric field (DOUBLE, INTEGER, or LONG dataType) that can include data of any value across the provided range. Data in this opType usually falls between the min and max values.
|
When a goal field is defined as a Continuous opType with a dataType of Integer, scores output in PMML format during training will probably NOT be integers. During the validation process, the system cannot accept an integer for a Continuous goal. Internally, all Continuous/Integer fields are converted to Continuous/Double fields. In the resulting PMML output, scores are reported as more accurate floating point numbers.
|
◦ CATEGORICAL – A field with a finite set of values, where each value is treated independently and there is no ordering. Only the provided values are acceptable unless no values are provided. In that case, values are calculated from all observed entries within the dataset.
◦ ORDINAL – A field with a finite set of values that have an inherent ordering. Only the provided values are acceptable and they must be provided in the correct order.
◦ BOOLEAN – A true or false field, typically used as a flag.
◦ TEMPORAL – For time series data, this time value field indicates the expected time between adjacent rows of data (provided in the timeSamplingInterval).
◦ ENTITY_ID – For time series data, this identifier allows for differentiation between entities. For example, if a dataset contains entries for three different machines, each machine should have its own entity ID.
• min – For continuous values, this field represents the lowest expected value.
This is an informational field.
|
When submitting metadata in a JSON file, both the min and the max values are nested in a range parameter. However, in this Infotable, when querying metadata in ThingWorx, these parameters are flattened out so that min and max values are returned without the range field.
|
• max – For continuous values, this field represents the highest expected value.
This is an informational field.
• values – For ordinal and categorical values, this field contains a list of possible values. For ordinal, the values must be listed in the correct order. For categorical, the order of values doesn’t matter.
• timeSamplingInterval – For time series datasets, this value indicates the time interval between adjacent rows of data. If the dataset does not adhere to the specified interval, an error will occur.
• isStatic – For time series datasets, this flag can be used to indicate that a field should not change over time. When set to true, this field will not undergo any time series transformations. Setting this flag where appropriate will improve performance.