Singular Spectrum Analysis Demo

3. TOOLKIT DEMONSTRATION

Singular-Spectrum Analysis

Selecting the `SSA' button from the Analysis menu on the toolbar launches the following window (shows its state after pressing Default button, see below):

singular spectrum analysis

Having selected the data vector to be analyzed (here `soi', the SOI time-series), and the value of sampling interval, the main SSA options to be specified are the Window Length, the type of Significance , and Covariance Estimator. The Default button is provided as a guide to select input parameters based on the length of the data time series.
The number of SSA Components specifies how many leading components in terms of captured variance will be retained for future analyses. After computation more information about these retained components can be obtained by using Advanced options. Results of SSA are stored in matrix with a name specified in Spectrum field. In addition, T-EOFs and T-PCs (see below) are stored in matrices with names obtained by prefixing "eof_" and "pc_" to a Spectrum name, and can be accessed in Data I/O tool. If results from several SSA runs have been stored in different matrices, the parameters used in a particular SSA run will be restored in GUI by simply selecting correspondent matrix from a Spectrum pop-up list.

Window Length

As a rule of thumb, the window length M should be chosen to be longer than number of data points in the oscillatory periods under investigation, and shorter than number of data points in the spells of an intermittent oscillation. In general, the window length be less than about N/5 where N is the number of points in the time series. Robustness of results to M is an important test of their validity. The choice of window length sets the dimension of the lag autocorrelation matrix to be constructed and diagonalized by SSA, and thus determines the computational burden of the application. Larger values of M correspond to higher spectral resolution, although there is no direct equivalence between them.

We will set the Window Length to 60, which is a good choice for out time series (690 data points with a one month sampling rate) and the oscillatory periods (2 and 4 years) under investigation.

Covariance Estimation

The particular method for estimating the lag-autocovariance matrix that is decomposed (diagonalized) by SSA is chosen by selecting either `Burg' , `VG' (Vautard &Ghil), or `BK ' (Broomhead &King) from the `Covariance' button on the main SSA control panel.

NB: The Burg covariance matrix estimation option is not supported by Monte-Carlo significance test which defaults to Vautard-Ghil if Burg is selected on the main SSA panel.

We used Burg covariance method for analyzing SOI time series.

Significance tests

There are three choices for Significance test

Heuristic
Chi-Squared
Monte Carlo

described below. If Heuristic option is selected, the eigenspectrum is displayed in order of eigenvalue rank. In the remaining two cases, the spectrum of eigenvalues is organized against the dominant frequency associated with the respective SSA eigenvector (T-EOF).

Heuristic

It constructs a set of ad-hoc error bars that are based on the estimated decorrelation time of the time series according to:

d&lambda_i= (2k&tau/N)^1/2 &lambda_i

where &tau is a typical decorrelation time for the time series, k is a user supplied Decorrelation weight between 1 and 2 (1.5 by default), and &lambda_i is the i-th eigenvalue in the spectrum. The Toolkit estimates &tau to be the inverse of the logarithm of the lag-one autocorrelation of the time series.

After we click Compute and Plot with Heuristic option selected, a following plot will display the eigenvalue spectrum of the specified SSA. The units of abscissa are SSA component number (eigenvalue rank), and with the variance contributed by each SSA component on the ordinate.

singular spectral analysis

Since the window length was set to 60, SSA decomposes the time series into 60 components and thus 60 eigenvalues are plotted. The significance of the various components can be judged qualitatively by noting which components contribute significantly more variance relative to the noise background. The latter in turn assumed to include the components that lie in the flattish tail of the eigenvalue spectrum, i.e. components from about 10 to 60. The leading 10 components in a previous figure lie above a distinct break in the eigenvalue spectrum, and thus may be of physical significance.

In particular, we are interested in the leading four components which form two pairs of nearly equal eigenvalues. As discussed in the SSA theory section, pair of nearly equal eigenvalues in SSA is one that characterizing it as an oscillation. However, the eigenvalues are subject to numerical and sampling errors, and mere pairing of eigenvalues is not enough to guarantee that an oscillation has been identified. In the eigenvalue plot above, the error bars show an ad hoc range of the estimation errors. Any of the eigenvalues with significantly overlapping error bars could represent an "oscillatory pair'. Also eigenvalues with error bars that overlap significantly with the error bars of the noise part of the spectrum must also be suspected of being part of that noise.

Pairing Criteria

The Toolkit provides three pairing criteria to identify oscillatory pairs and trend components in clusters of eigenvalues whose ad-hoc error bars overlap; however, it is not necessary for the 'Heuristic' test to be selected on the main SSA panel. The results of these tests are can be accessed in Log tab. These tests can be activated concurrently using checkboxes in Advanced panel:

`Same Frequency' -- the T-EOFs associated with potential pairs or clusters are subjected to a simple Fourier transform to identify the their dominant frequency. A pair (or cluster) is identified when the associated T-EOFs have the same dominant frequency, within a fraction of the SSA bandwidth (1/M).
`Strong FFT' --the same Fourier transform is used to determine how much of a given signal the potential oscillatory pair accounts for at their dominant frequency. This variance fraction must exceed 95% for a pair (or cluster) to be identified.
`Do trend Test' --Two tests that help to identify trending and low-frequency SSA components:
- Kendall's tau nonparametric trend tests on the T-PCs (and RCs if present)
- Counts of the numbers of zero crossings within the T-EOFs.

For the SOI example, the log shows that pair 1-2 of SSA components passes both `Same Frequency' and `Strong FFT' tests for being oscillation, while the 5th component is indicated as a trend (see here for more on SSA detrending):

Monte Carlo SSA

The Monte Carlo SSA (MCSSA) tests significance against a red noise null-hypothesis (see also "Small signal" example of this guide). A large ensemble of red-noise surrogate time series are generated, each with the same length and same expected lag-1 autocorrelation as the time series to be tested (but see note under 'Included EOFs' below).

Two fixed T-EOF bases are used using Basis popup menu in Advanced Options:

Data: The data, together with many AR(1) noise realizations are projected onto the EOFs of the data covariance matrix
AR(1): The data, together with many AR(1) noise realizations are projected onto the EOFs of the expected covariance matrix of pure noise, plus any selected "signal" EOFs of the data covariance matrix which are specified as "Included EOFs" in the null-hypothesis, with the rest as for pure noise, after orthogonalizing. This is called the "null-hypothesis basis", and it is either pure noise, or a "hybrid basis" of noise plus certain data EOFs. The noise covariance matrix is computed analytically from the estimated lag-1 autocorrelation coefficient of the test time series.

hybrid basis

`Included EOFs': List of "signal" EOFs (by rank or set of space-delimited integers) to include in the noise null-hypothesis, from which the noise surrogates and null-hypothesis EOFs are computed. Note that irrespective of basis (data or null-hyp. EOFs), the characteristics of the noise, and thus the error bars are modified by including data EOFs. In fact, the contribution of the latter to the lag-1 autocorrelation of the data time series are removed, when they are included in the null-hypothesis If no data EOFs are specified, the null-hypothesis is pure AR(1) noise.

`Confidence': in percentiles, e.g. 95 will mean 95% confidence level with respect to red noise null-hypothesis.

`No. Surrogates:' Number of Monte Carlo noise realizations.

Chi-squared

This is the default test, which is a computationally fast approximation of MCSSA approach.

Advanced Options

The Advanced Options panel is used to set the preferences for the particular significance test, as well as a table containing information about retained SSA components (the number specified in Components field, see above): captured variance, and rank or dominant frequency, based on which Significance test has been used. The table below shows Advanced panel after performing SSA analysis on SOI index and is discussed in more details below.

SSA

By selecting rows of the table, user can plot and inspect correspondent T-EOFs or T-PCs and perform reconstruction. In addition, SSA Prediction and Gap-filling can be performed.

By selecting first two rows in SSA components table and clicking Plot EOFs, we can verify that the two leading EOFs are indeed in phase-quadrature, i.e. one leading the other with ~T/4 lead, where T is their common period ~40 months.

Empirical orthogonal functions (EOFs)

To test against a pure red-noise null-hypothesis, we choose Chi-Squared as a significance test (leaving the Included EOFs field blank). After clicking the Compute and Plot buttons, and adjusting the X-axis limits in Graph Controls window, the following plot is obtained:

significance test

Here the variance is plotted against the dominant frequencies of 'data' EOFs on x-axis, and describe projection of data and null-hypothesis (NH) surrogates onto Data EOFs basis, which are essentially SSA eigenvalues and plotted as dots. The low-frequency part of the spectrum has been zoomed in by using Graph Controls. The dominant frequencies of EOFs are calculated with a 0.001/dt resolution in the Nyquist interval from 0 to 0.5/dt. The error bars represent 95% of variance (with settings of Confidence levels such as in Advanced Options) we expect to find in the state-space direction defined by a particular EOF from the 'basis', when analyzing a segment of 'NH' noise. Using 'Included EOFS' option we can construct a composite 'NH' as described above, but it is a pure red-noise here.

For the Data Eofs basis we observe that eigenvalues corresponding to EOFS 1-2 and 3-4 appear almost superimposed on each other at frequencies equal to ~ 0.02 and 0.034 cycle/month, and lie outside the null-hypothesis error bars. Thus they are relatively unlikely ( at the 95% level set in Confidence levels of Advanced Options) to be merely due to selected null-hypothesis process, and represent two significant oscillations. The results of the Chi-squared test should always be checked using the Monte Carlo approach, which is also essential with more complex noise models.

One can also choose 'basis' EOFs to be computed from the expected covariance matrix of the 'Null-hypothesis' by selecting AR(1) from the basis popup menu in Advanced Options. Projection of data does not give SSA eigenvalues here, but the interpretation of the graph is the same. Note how the EOFs of the pure noise NH are almost regularly spaced. The NH basis avoids artificial variance-compression inherited in SSA and therefore it has a lower probability of false-positive results, i.e. identifying the noise components as significant. After doing calculations again the resulting spectrum is shown below:

significance test

The NH basis also confirms the significance of the identified significant pairs, and we conclude that they correspond to the QB and QQ components of El Niño.

Reconstruction and Prediction.

(Note: In a demo version, SSA Reconstruction/Prediction is available only for data of example projects! For custom data this feature is enabled after activation with a purchased Serial No.)

The SSA components table on the Advanced panel allows the reconstruction and prediction of the original time series from selected SSA components, as well as plotting associated T-EOFs and T-PCs by using Plot EOFs and Plot PCs buttons. The name of vector with reconstruction/prediction is specified by the user in Result field. Individual RC components are stored in matrix with name obtained by prefixing "mat_" to Result name, and can be accessed in Data I/O tool. The Result field is shared with the SSA gap-filling feature.

If results from several reconstructions have been stored in different vectors, the parameters for a particular reconstruction will be restored in GUI by simply selecting correspondent vector from Result pop-up list. For prediction user need to specify lead time in Lead field, and order of AR model to advance in time selected SSA components; cross-validation is available as well for finding optimal parameters for prediction, see SSA prediction for more details. Checking "filter out" box will filter out the selected components from the original data (prediction feature is disabled then); this can be quite useful for detrending.

Note: By applying SSA cross-validation with Lead=0 allows to test Reconstruction feature with a fixed window size on a particular interval of the time series. The SSA is applied many times to a shortened time series, thus simulating "real-time" environment. See Finance folder in the Examples for applying this feature to the Forex USD/EUR time series.

Selecting the 1-4 table rows of identified 'significant' components , and after clicking Compute in Reconstruction/Prediction box, followed by Plot, next figure will be obtained (after adjusting its parameters in Graph Controls window):

By using Info button of Vectors table in Data I/O panel, user can see the total variance captured, as well as the individual RCs used for the Result vector:

Here the reconstructed signal is plotted against original time series. If this SSA-filtered RC series is now analyzed by MEM, or MTM Coherence, the frequency spectrum of the RCs is much simpler than that of the complete time series, since most of the "noise" has been removed.We leave as an exercise for the user to repeat SSA analysis with 11 SSA Components, and try to identify pair 10-11 with the oscillatory signal. Hint: this is related to the peak in MTM Spectrum at f~0.18 cycles/month.