Identify New Signals without a Model
To identify new signals without a model:
1. On the Signals list page, click the New button at the top of the signals list.
A dialog box opens with signals options you can configure.
2. In the dialog box, enter or select values for the following options:
◦ Signal Name – Enter a name for the new signal.
◦ Signal Description – Enter an optional description for the new set of signals. Text length is limited to 2000 characters.
◦ Redundancy Filter – Check to enable redundancy filtering, which will rank features by the information gain they provide. The Redundancy Filter operates by calculating the mutual information for each feature with the goal variable. It then iteratively ranks the features, in combination with previously-selected features, according to the amount of information gain they provide. Features that provide more information gain are ranked higher.
◦ Data from Existing Dataset – Provide the following information about the dataset you want to identify signals from:
▪ Dataset – Select the dataset from which to identify signals.
▪ Goal – Select a goal variable from the selected dataset.
▪ Filter – Optionally, select a filter to apply to the dataset. Alternately, click the
Create Filter button to define a new filter for the dataset and then apply it to the signals identification process. For more information, see
Create a Data Filter.
◦ Upload New Data – Click to select new data to upload for the signals, instead of using the Data from Existing Dataset option. A New Dataset dialog box opens and you will be prompted to upload a JSON metadata file and a CSV file containing data.
◦ Exclude Features – Click to select specific features of the dataset to be excluded from the signals. A dialog box opens where you can select features to add to or remove from an exclusion list.
◦ Timeseries Data Only – Provide the following information when identifying signals on time series data. These parameters display only when the dataset contains temporal data.
▪ Lookback Size – Enter a value to define the number of historical data points to be used when predicting each value in the time series. Any value greater than 1 is acceptable but generally, a power of 2 is used (2, 4, 8, 16). Larger values affect performance because more historical data is used for predictions.
|
Unlike the training process, where a value of 0 indicates auto-windowing, a value of 0 in the Signals process will result in the data being treated as though it was not time series data.
|
▪ Lookahead Size – Enter a value to define the number of time steps ahead to predict. An acceptable value can be an integer equal to or greater than 0. The default is 1. If Use Goal History is disabled, Lookahead can be set to 0, which is the standard behavior when goal history is not in use. However, do not set Lookahead to 0 when Use Goal History is enabled.
▪ Use Goal History – When enabled (the default), previous goal values are used to predict a future goal value. When disabled, you can train a model on time series data when the goal variable will not be provided as input during scoring. In some scenarios, it’s not possible to know the value of the goal variable during predictive scoring, either because the value of the goal feature is not observable during scoring or because measuring it is physically or financially difficult. The goal value is still required during training, but during scoring, these models can be used to predict the value of a time series goal without providing recent values for the goal field.
◦ Text Data Only – Provide the following information when identifying signals on text data. These parameters display only when the dataset contains free-form text data of OpType TEXT.
▪ Max N-gram Size – The maximum size of the text units to count. A value of 1 indicates that every word is counted. A value of 2 indicates that phrases of two consecutive words are counted, in addition to individual words. Default = 1.
▪ Max Vocabulary Size – The maximum number of words or phrases for each n-gram size that can be included in the vocabulary. Default = 1000 for each n-gram size.
▪ Min Document Freq – A threshold filter to count only words or phrases that appear with a minimum level of frequency across rows. Values can range from 0.0 (inclusive) to 1.0 (exclusive). A value of 0 indicates that every word or phrase is counted with no filtering. A value of 0.1 indicates that words and phrases are only counted if they appear in at least 10% of rows in the dataset. Default = 0.
3. Click Submit.
The dialog box closes and the signals identification process starts. The new signals job appears in the list at the top of the Signals list page. The State column shows the status of the signals identification process. When the job is complete, the new signals are displayed in the bottom section of the page.