3. TOOLKIT DEMONSTRATION

-Multivariate Gap-filling by MSSA

(Note: In demo version MSSA Gap-Filling is available only for data in example projects. In a licensed copy this feature is enabled after activation with a purchased Serial No.)

A novel, iterative form of MSSA is used to analyze multivariate datasets with uneven sampling or missing observations. Gaps are filled-in by utilizing spatio-temporal correlations in the dataset. File data with "NaN" values (case insensitive) are treated as missing. Gap-filling feature is available in Advanced options of MSS/PCA panel. For univariate datasets pls. see SSA gap-filling.

The user needs to select the data in the Data pop-up menu of MSSA/PCA tool, and then specify the PCA or MSSA method to fill the data. PCA will use only spatial correlations between the channels using a few leading EOFs up to the number specified on MSSA/PCA panel. MSSA, on the other hand, will in addition utilize temporal correlations as well; user needs to specify MSSA window size and the number of MSSA components for fill-in. Then gap-filling can be done just by clicking Compute in Gap-filling box of Advanced options.

The filled-in data is stored in the data with a name specified in Result box. By clicking Plot, user can compare the gappy and filled-in dataset, if the column (spatial channel) or row (time channel) 1-D plot option has been selected. Alternatively, 2-D time-space plot can be created with Contour or Fill option.

When plotting missing data, user can select in Preferences option to connect all the available points through gaps:

The number of MSSA components one has to use really depends on the dataset, and in particular on the amount of noise present. The main idea is to discard higher-ranked components corresponding to noise. If CVL error box is checked in Gap-filling options, a number of cross-validation experiments is performed (set in Preferences), where a small portion of the existing points is flagged as being missing (in random), and the rms error is calculated for filled-in data. The optimum number of components corresponds to a minimum of such error averaged over all cross-validation sets. The error can be plotted by Plot CVL button. The random seed for choosing the points for cross-validation can be changed in Preferences, as well as convergence criterion for missing values. User can perform such cross-validation experiments for different MSSA Window values in order to find optimum parameters for gap-filling. In addition, range of values of filled-in data can be constrained by setting optional Max and Min limits. The percentage of the dataset variance used to fill the gaps is written to Log. If results from several gap-filling calculations have been stored in different matrices, the parameters used (including Preferences) will be restored in GUI by simply selecting correspondent matrix from a Result pop-up list.

Here we demonstrate Toolkit capabilities for gap filling on synthetic time series following Examples/Multivariate Gap Filling folder of kSpectra distribution.

First, we will demonstrate MSSA and PCA gap filling of a noisy multivariate data containing quasi-periodic oscillatory spatio-temporal pattern. The synthetic test series, consisting of 20 spatial channels, each 100 data points long, represents low-frequency oscillation with a period of T=40 units. This oscillation is modulated both in amplitude and phase with period of T=120, and is contaminated by large amplitude white noise.

At a fixed time, the pattern represents standing wave in space. About 50% of the data there has been removed in random and filled-in with MSSA and PCA, and results are compared with the original (full) dataset:

Cross-validation shows which MSSA/PCA parameters are best for gap-filling:

If smooth box is checked in Gap-filling options, then Result will be the estimated smooth component of dataset in all points, including those where data is available. Otherwise, Result will take values of existing data, and the missing values will be filled-in with the smooth component. So we check smooth box, and with MSSAWindow equal to 5, `BK' Covariance and 4 MSSA Components we obtain:

This result can be compared with a 'true smooth' component from the full dataset:

By going to Log we can see that MSSA reconstruction captured ~69% of the dataset variance that has been used to fill the gaps:

Second, we will apply MSSA/PCA gap filling to the global data set of monthly SSTs from the International Research Institute for Climate and Society (IRI) for 1950--2004, from 30S-60N, on a 10x10 grid, with a total of 648 data points in each of 237 spatial channels = 153,576 data points (sstnan). We have randomly removed about 70% of the data..

To find optimal parameters for gap-filling, we use cross-validation with PCA and MSSA methods. Figure below shows that error is much smaller when using MSSA with Reduced Covariance and Window equal to 645, and 80 components. We choose the Reduced option as N`=N-M+1=645 implies window M=4, and N` is much less than M*L (4*237). With the Reduced option, covariance matrix is of size N`xN`, rather than M*LxM*L.

Next, we apply MSSA gap-filling with Window equal to 645, Reduced covariance and 80 MSSA components: