Analytics Builder > Working with Predictive Models > Creating New Predictive Models
  
Creating New Predictive Models
Prerequisites
Before you can create a new predictive model, you must have at least one dataset available.
Overview
In the Analytics Builder, creating a new model can be as simple or as complex as you want. The simple technique only requires naming the model and selecting a dataset and a goal variable. However, more complex options are available, ranging from filtering the data included in the model to configuring an ensemble set of learners for training the model.
To Create a Model
1. Click MODELS in the left navigation panel to make sure the Models list page is open.
2. At the top of the Models page, click New.
The New Predictive Model dialog box opens.
3. Enter a unique Model Name.
4. On the Data Selection tab, specify the data you want to train the model against by providing the following information:
Dataset: Select from the datasets available on your instance of ThingWorx Analytics.
If the dataset you select contains temporal information a check box is displayed to the right of the field. If you want to create a Time Series model, make sure the box is checked.
Goal: Select from the available goals in the dataset.
Filter: Select an existing filter to narrow the data. The filter conditions will be displayed in the Filter Details table. Alternately, click Create Filter if you want to create a new filter. (For more information, see Filtering Data for a New Model.)
Excluded Features from Model: Click Exclude Features to select features that will be excluded from the data used to train the model. (Optional. For more information, see Exclude Features from a New Model.)
5. Click on the Advanced Model Configuration tab. (It only becomes available when the Data Selection tab is complete.) This tab is optional and can be used to customize model settings and learner techniques used during model training.
* 
The Reset Configuration button can be used to return all of the settings on this tab back to their default values (both model settings and learner techniques).
6. Configure the following Model Settings (optional):
Validation Holdout %: The percentage of data that is held back and used to validate the newly trained predictive model. Enter a value from 1 to 100. The default value is 20%.
Max Fields: Limits the model to a specific number of fields during training. The fields are selected based on mutual information scores. The number of fields can affect runtime.
Redundancy Filter – Check to enable redundancy filtering, which will rank features by the information gain they provide. The Redundancy Filter operates by calculating the mutual information for each feature with the goal variable. It then iteratively ranks the features, in combination with previously-selected features, according to the amount of information gain they provide. Features that provide more information gain are ranked higher. During training, this ranking is used to improve feature selection for the predictive model. The number of features indicated by Max Fields is selected from the top of the ranking.
7. Configure a Sampling Strategy that will be applied globally to all of the learning techniques you configure (optional and only for use with Boolean goal fields). Select values for the three parameters below which work together to define a strategy for balancing the goal outcomes in your training data.
* 
Applies only to training data, not to validation data.
Strategy: Options include:
None: No upsampling or downsampling with be applied. (Default)
Upsample: Will increase the number of records in the training data that match the selected Value (True or False). The number of records will increase by the specified Factor. For example, if you select a factor of 2, the number of records matching the provided value will double. Upsampling is useful when the number of records with the value you want is sparse and the dataset overall is not very large.
Downsample: Will decrease the number of records in the training data that match the selected Value (True or False). The number of records will decrease by the specified Factor. For example, if you select a factor of 0.4, the number of records matching the provided value will be reduced by 40%. Downsampling is useful when your dataset is very large.
Value: Select the value that you want to upsample or downsample. Options include True or False.
Factor: Enter a factor by which you want to upsample or downsample the selected Value. The factor must meet different criteria, depending on which Strategy you selected:
For Upsample, the factor must be an integer greater than 2. There is no upper limit.
For Downsample, the factor must be between 0 and 1, exclusive, in order to produce meaningful results. (A factor of 0 would remove all the records with the selected Value and a factor of 1 would remove none of them.). When the downsampling factor is applied, each record with the selected Value has an equally random chance of being eliminated. You might need to experiment with different factors to achieve the desired result.
8. Provide values for the following time series parameters. These parameters display only when the dataset contains temporal data.
Lookback Size – Enter a value to define the number of historical data points to be used when predicting each value in the time series. Any value greater than 1 is acceptable but generally, a power of 2 is used (2, 4, 8, 16). Larger values affect performance because more historical data is used for predictions.
Lookahead – Enter a value to define the number of time steps ahead to predict. An acceptable value can be an integer equal to or greater than 0. The default is 1. If Use Goal History is disabled, Lookahead can be set to 0, which is the standard behavior when goal history is not in use. However, do not set Lookahead to 0 when Use Goal History is enabled.
Use Goal History – When enabled (the default), previous goal values are used to predict a future goal value. When disabled, you can train a model on time series data when the goal variable will not be provided as input during scoring. In some scenarios, it’s not possible to know the value of the goal variable during predictive scoring, either because the value of the goal feature is not observable during scoring or because measuring it is physically or financially difficult. The goal value is still required during training, but during scoring, these models can be used to predict the value of a time series goal without providing recent values for the goal field.
* 
Previously, a parameter called Virtual Sensor was available when training a model via the Training microservice. It worked in reverse of the Use Goal History parameter and had to be enabled in order to train a model that would not use goal history during scoring. For backwards compatibility, the Virtual Sensor parameter is maintained in the Training microservice, even though Use Goal History has replaced it when creating models in Analytics Builder.
9. Configure the Learning Techniques (optional):
Add Learner: Click to add a learner to the list of learners used to generate a predictive model. You will be prompted to select and configure the new learning technique. For information about adding a learner, see Add a Learner to a New Model. For more information about each learner type, see Learners and Ensemble Techniques.
Remove Learners: Select a learner from the list below and click this button to remove the learner from the model generation process.
Ensemble Technique: The predictive model is generated using a combination of learner techniques to achieve the best results and minimize prediction errors. The following options are available for working with an ensemble of learners:
Average – Each learner scores each record separately and the scores are averaged.
Best – Only the learner that performed best during training is used for scoring.
Elite Average – The best learners during training are selected as elite learners, then they each score records separately and their scores are averaged.
Majority Vote – For Boolean goals only. Each learner scores each record separately and the scores are tallied. The score with the largest tally is selected.
Soloist – Includes only a single learner.
Comparison Metric: Determines how results from different learners are compared. Available options include:
Pearsons – a linear correlation between the features and the goal variable
RMSE – Root Mean Square Error is a measurement of the values predicted by the model and the values actually observed
ROC – Receiver Operating Characteristics area is a measurement of the model's ability to correctly classify predictions as true or false - for use with Boolean goals only.
10. Click Submit to begin generating the new model.