Time series forecasting is the process of using a model to generate predictions forecasts for future events based on known past events. Examples of time series applications include: This environment takes the form of a plugin tab in Weka's graphical "Explorer" user interface and can be installed via the package manager.
It does this by removing the temporal ordering of individual input examples by encoding the time dependency via additional input fields. These fields are sometimes referred to as "lagged" variables. Various other fields are also computed automatically to allow the algorithms to model trends and seasonality. After the data has been transformed, any of Weka's regression algorithms can be applied to learn a model.
An obvious choice is to apply multiple linear regression, but any method capable of predicting a continuous target can be applied - including powerful non-linear methods such as support vector machines for regression and model trees decision trees with linear regression functions at the leaves. The above mentioned "core" time series modeling environment is available as open-source free software in the CE version of Weka.
There is also a plugin step for PDI that allows models that have been exported from the time series modeling environment to be loaded and used to make future forecasts as part of an ETL transformation. The perspective and step plugins for PDI are part of the enterprise edition. Once installed via the package manager, the time series modeling environment can be found in a new tab in Weka's Explorer GUI.
Data is brought into the environment in the normal manner by loading from a file, URL or database via the Preprocess panel of the Explorer. The environment has both basic and advanced configuration options. These are described in the following sections.
The basic configuration panel is shown in the screenshot below: In this example, the sample data set "airline" included in the package has been loaded into the Explorer. This data is a publicly available benchmark data set that has one series of data: Aside from the passenger numbers, the data also includes a date time stamp. The basic configuration panel automatically selects the single target series and the "Date" time stamp field.
In the Parameters section of the GUI top right-hand side , the user can enter the number of time steps to forecast beyond the end of the supplied data. Below the time stamp drop-down box, there is a drop-down box for specifying the periodicity of the data.
If the data has a time stamp, and the time stamp is a date, then the system can automatically detect the periodicity of the data. Below this there check boxes that allow the user to opt to have the system compute confidence intervals for its predictions and perform an evaluation of performance on the training data. More details of all these options are given in subsequent sections. The following screenshot shows the results of forecasting 24 months beyond the end of the data. At the top left of the basic configuration panel is an area that allows the user to select which target field s in the data they wish to forecast.
The system can jointly model multiple target fields simultaneously in order to capture dependencies between them. Because of this, modeling several series simultaneously can give different results for each series than modeling them individually. When there is only a single target in the data then the system selects it automatically. In the situation where there are potentially multiple targets the user must select them manually.
The screenshot below shows some results on another benchmark data set. In this case the data is monthly sales in litres per month of Australian wines. There are six categories of wine in the data, and sales were recorded on a monthly basis from the beginning of through to the middle of Forecasting has modeled two series simultaneously: At the top right of the basic configuration panel is an area with several simple parameters that control the behavior of the forecasting algorithm.
This controls how many time steps into the future the forecaster will produce predictions for. The default is set to 1, i. The units correspond to the periodicity of the data if known. For example, with data recorded on a daily basis the time units are days. Next is the Time stamp drop-down box. This allows the user to select which, if any, field in the data holds the time stamp. If there is a date field in the data then the system selects this automatically.
The user may select the time stamp manually; and will need to do so if the time stamp is a non-date numeric field because the system can't distinguish this from a potential target field.
Underneath the Time stamp drop-down box is a drop-down box that allows the user to specify the Periodicity of the data. Periodicity is used to set reasonable defaults for the creation of lagged variables covered below in the Advanced Configuration section.
In the case where the time stamp is a date, Periodicity is also used to create a default set of fields derived from the date. Below the Periodicity drop-down box is a field that allows the user to specify time periods that should not count as a time stamp increment with respect to the modeling, forecasting and visualization process.
For example, consider daily trading data for a given stock. The market is closed for trading over the weekend and on public holidays, so these time periods do not count as an increment and the difference, for example, between market close on Friday and on the following Monday is one time unit not three. The heuristic used to automatically detect periodicity can't cope with these "holes" in the data, so the user must specify a periodicity to use and supply the time periods that are not to considered as increments in the Skip list text field.
The Skip list field can accept strings such as "weekend", "sat", "tuesday", "mar" and "october", specific dates with optional formatting string such as " yyyy-MM-dd", and integers that get interpreted differently depending on the specified periodicity. For daily data an integer is interpreted as the day of the year; for hourly data it is the hour of the day and for monthly data it is the month of the year.
If all dates in the list have the same format, then it only has to be specified once for the first date present in the list and then this will become the default format for subsequent dates in the list. The following screenshots show an example for the "appleStocks" data found in sample-data directory of the package. This file contains daily high, low, opening and closing data for Apple computer stocks from January 3rd to August 10th The data was take from Yahoo finance http: A five day forecast for the daily closing value has been set, a maximum lag of 10 configured see "Lag creation" in Section 3.
Note that it is important to enter dates for public holidays and any other dates that do not count as increments that will occur during the future time period that is being forecasted.
Below the Time stamp drop-down box is a check box and text field that the user can opt to have the system compute confidence bounds on the predictions that it makes. The system uses predictions made for the known target values in the training data to set the confidence bounds.
Note that the confidence intervals are computed for each step-ahead level independently, i. By default, the system is set up to learn the forecasting model and generate a forecast beyond the end of the training data. Selecting the Perform evaluation check box tells the system to perform an evaluation of the forecaster using the training data.
That is, once the forecaster has been trained on the data, it is then applied to make a forecast at each time point in order by stepping through the data. These predictions are collected and summarized, using various metrics, for each future time step forecasted, i. This allows the user to see, to a certain degree, how forecasts further out in time compare to those closer in time. The following screenshot shows the default evaluation on the Australian wine training data for the "Fortified" and "Dry-white" targets.
Output generated by settings available from the basic configuration panel includes the training evaluation shown in the previous screenshot , graphs of forecasted values beyond the end of the training data as shown in Section 3.
There are more options for output available in the advanced configuration panel discussed in the next section. The next screenshot shows the model learned on the airline data. By default, the time series environment is configured to learn a linear model, that is, a linear support vector machine to be precise. Full control over the underlying model learned and its parameters is available in the advanced configuration panel.
Results of time series analysis are saved into a Result list on the lower left-hand side of the display. All textual output and graphs associated with an analysis run are stored with their respective entry in the list. Also stored in the list is the forecasting model itself. It is important to realize that, when saving a model, the model that gets saved is the one that is built on the training data corresponding to that entry in the history list.
If performing an evaluation where some of the data is held out as a separate test set see below in Section 3. The advanced configuration panel gives the user full control over a number of aspects of the forecasting analysis. These include the choice of underlying model and parameters, creation of lagged variables, creation of variables derived from a date time stamp, specification of "overlay" data, evaluation options and control over what output is created.
Each of these has a dedicated sub-panel in the advanced configuration and is discussed in the following sections. The Base learner panel provides control over which Weka learning algorithm is used to model the time series.
It also allows the user to configure parameters specific to the learning algorithm selected. By default, the analysis environment is configured to use a linear support vector machine for regression Weka's SMOreg. This can easily be changed by pressing the Choose button and selecting another algorithm capable of predicting a numeric quantity. Adjusting the individual parameters of the selected learning algorithm can be accomplished by clicking on the options panel , found immediately to the right of the Choose button.
Doing so brings up an options dialog for the learning algorithm. The Lag creation panel allows the user to control and manipulate how lagged variables are created. Lagged variables are the main mechanism by which the relationship between past and current values of a series can be captured by propositional learning algorithms.
They create a "window" or "snapshot" over a time period. Essentially, the number of lagged variables created determines the size of the window. For example, if you had monthly sales data then including lags up to 12 time steps into the past would make sense; for hourly data, you might want lags up to 24 time steps or perhaps The left-hand side of the lag creation panel has an area called lag length that contains controls for setting and fine-tuning lag lengths.
At the top of this area there is a Adjust for variance check box which allows the user to opt to have the system compensate for variance in the data. It does this by taking the log of each target before creating lagged variables and building the model. This can be useful if the variance how much the data jumps around increases or decreases over the course of time.
Adjusting for variance may, or may not, improve performance. Below the adjust for variance check box is a Use custom lag lengths check box. This allows the user to alter the default lag lengths that are set by the basic configuration panel.More...