Handling Missing Dataset Values
Overview
When a dataset is missing values in some of its fields, training, scoring, or other analytics operations might be prevented from running successfully. Beginning in Analytics Server 9.3, metadata parameters can be configured to instruct the system how to handle missing values automatically for each field in the dataset. Options for handling missing values range from leaving them as they are to replacing them with a specific value or imputing a value based on other values in the same field. For more information, see
Missing Value Metadata Parameters below.
This functionality is supported in all analytics operations, including signals, profiles, clusters, training, validation, and scoring. Missing value treatments can be applied to data of the following opTypes: continuous, Boolean, categorical, ordinal, or text.
|
|
Currently, missing value treatments are only supported for string dataTypes. For example, if you assign a missing value treatment to a field that contains data where opType = continuous, but dataType = integer, an error will occur.
|
Missing Value Metadata Parameters
To define the appropriate treatment of missing values, use the following parameters in the JSON metadata for each field where values are missing.
• missingValueTreatment – Indicates how a missing dataset value should be handled for a specific field. This parameter is required in order to impute missing values. The following chart lists the valid options.
|
Option
|
Description
|
|
AS_IS
|
Keep the missing values as they are. This strategy is not acceptable in all missing value scenarios. When creating a dataset, the AS_IS missing value treatment cannot be used with data of opTypes continuous, Boolean, categorical, ordinal, or text. If these opTypes contain data with null values, other analytics jobs will fail. This option is the default when missing values are detected.
|
|
AS_VALUE
|
Replace the missing values with the value specified by the missingValueReplacement parameter.
|
|
AS_MEAN
|
For continuous fields only. Replace the missing value with the mean of the available values for the field.
|
|
AS_MEDIAN
|
For continuous fields only. Replace the missing value with the median of the available values for the field.
|
|
AS_LAST
|
For time series data only. Replace each missing value with the last-known value, of the same entity, for that field. The first record for each entity must be populated because there is no last-know record to carry forward for the first record. Values can be imputed forward for up to 100 records. An analytics job using a field with more than 100 consecutive missing values is likely to fail.
|
• missingValueReplacement – Indicates what value should replace all missing values for a specific field. This parameter is required when the value for missingValueTreatment is AS_VALUE. It cannot be used in other cases.
• missingValueIndicators – Indicates a list of values that should be interpreted as missing for a specific field. By default, null entries are considered missing. This parameter is optional.
|
|
Indicators must be of the same dataType as the field.
|
Notes for Using Missing Value Parameters
Keep in mind the following information when deciding how to handle missing values.
• Missing value treatments are defined when the dataset is created, but the treatments are applied only when analytics jobs are run using the dataset.
• Missing values can be imputed for continuous, Boolean, categorical, ordinal, or text fields. If a missing value treatment is requested for entity, temporal, or informational fields, an error may occur or the treatment request will be ignored.
• During training jobs, missing values are imputed only on the training dataset, without including the validation holdout data. For example, if the AS_MEAN option is used, the mean is calculated only on the training data. The mean from the training data is also used to impute missing values during scoring and validation jobs.
• A null object is always considered missing, even when it is not included in the missingValueIndicators list. (Not applicable to the string “null”.)
• When creating a dataset, the AS_IS missing value treatment cannot be used with data of opTypes continuous, Boolean, categorical, ordinal, or text. If these opTypes contain data with null values, other analytics jobs will fail.
• When using fields with missing values for filtering purposes, some special handling might be required. For example, it may not be enough to define a filter to look for pressure > 0.0. You might need to add to the filter as follows: pressure IS NOT NULL AND pressure >0.0.
• Distribution queries do not support missing value treatments for time series datasets.
• Because of the distribution query limitations on time series data, missing value treatments are not supported for any time series data in Analytics Builder.
• When the filter is applied on a numeric data with missing value, the filter query result may not behave as expected. If you want to keep the rows with missing values and exclude other numeric values in a dataset, it is recommended that instead of creating an exclude filter, you must create a filter that includes all other values except the values you want to exclude. This is because exclude filter will exclude the missing value by default.
Example Metadata with Missing Value Parameters
The following metadata snippet includes parameters to define missing value treatment for several fields.
[
{
"fieldName": "Continuous",
"opType": "CONTINUOUS",
"dataType": "STRING",
"missingValueTreatment": "AS_VALUE",
"missingValueReplacement": "0.5"
},
{
"fieldName": "Boolean",
"opType": "BOOLEAN",
"dataType": "STRING",
"missingValueTreatment": "AS_VALUE",
"missingValueReplacement": "TRUE"
},
{
"fieldName": "Ordinal",
"opType": "ORDINAL",
"dataType": "STRING",
"values": ["S", "M", "L"],
"missingValueTreatment": "AS_VALUE",
"missingValueReplacement": "M"
},
{
"fieldName": "Text",
"opType": "TEXT",
"dataType": "STRING",
"missingValueIndicators": ["NA","N/A"],
"missingValueTreatment": "AS_VALUE",
"missingValueReplacement": ""
},
{
"fieldName": "goal",
"opType": "CONTINUOUS",
"dataType": "STRING",
"missingValueIndicators": ["NaN"],
"missingValueTreatment": "AS_MEAN"
}
]