Matlab Toolbox for Dimensionality Reduction
The Matlab Toolbox for Dimensionality Reduction contains Matlab implementations of 34 techniques for dimensionality reduction and metric learning. A large number of implementations was developed from scratch, whereas other implementations are improved versions of software that was already available on the Web. The implementations in the toolbox are conservative in their use of memory. The toolbox is available for download here.
Please note I am no longer actively maintaining this toolbox. Your mileage may vary!
Currently, the Matlab Toolbox for Dimensionality Reduction contains the following techniques:
- Principal Component Analysis (PCA)
- Probabilistic PCA
- Factor Analysis (FA)
- Classical multidimensional scaling (MDS)
- Sammon mapping
- Linear Discriminant Analysis (LDA)
- Landmark Isomap
- Local Linear Embedding (LLE)
- Laplacian Eigenmaps
- Hessian LLE
- Local Tangent Space Alignment (LTSA)
- Conformal Eigenmaps (extension of LLE)
- Maximum Variance Unfolding (extension of LLE)
- Landmark MVU (LandmarkMVU)
- Fast Maximum Variance Unfolding (FastMVU)
- Kernel PCA
- Generalized Discriminant Analysis (GDA)
- Diffusion maps
- Neighborhood Preserving Embedding (NPE)
- Locality Preserving Projection (LPP)
- Linear Local Tangent Space Alignment (LLTSA)
- Stochastic Proximity Embedding (SPE)
- Deep autoencoders (using denoising autoencoder pretraining)
- Local Linear Coordination (LLC)
- Manifold charting
- Coordinated Factor Analysis (CFA)
- Gaussian Process Latent Variable Model (GPLVM)
- Stochastic Neighbor Embedding (SNE)
- Symmetric SNE
- t-Distributed Stochastic Neighbor Embedding (t-SNE)
- Neighborhood Components Analysis (NCA)
- Maximally Collapsing Metric Learning (MCML)
- Large-Margin Nearest Neighbor (LMNN)
In addition to the techniques for dimensionality reduction, the toolbox contains implementations of 6 techniques for intrinsic dimensionality estimation, as well as functions for out-of-sample extension, prewhitening of data, and the generation of toy datasets.
The toolbox provides easy access to all these implementations. Basically, the only command you need to execute is:
[mapped_data, mapping] = compute_mapping(data, method, # of dimensions, parameters)
The function assumes the dimensions are the columns in the data, and the instances are the rows. The function also accepts PRTools datasets. Information on how parameters for certain techniques should be specified can be obtained by typing
help compute_mapping in the Matlab prompt. For more instructions on how to install and use the toolbox, please read the
You are free to use, modify, or redistribute this software in any way you want, but only for non-commercial purposes. The use of the toolbox is at your own risk; the author is not responsible for any damage as a result from errors in the software. I would appreciate it if you refer to the toolbox or its author in your papers.
For more information on the techniques implemented in the toolbox, we refer to the following publications:
- L.J.P. van der Maaten, E.O. Postma, and H.J. van den Herik. Dimensionality Reduction: A Comparative Review. Tilburg University Technical Report, TiCC-TR 2009-005, 2009. PDF
- L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579-2605, 2008. PDF [Supplemental material] [Talk]
When using the toolbox, the code quits saying that some function could not be found?
Nine out of ten times, such errors is the result of you forgetting to add the the toolbox to the Matlab path. You can add the toolbox to the Matlab path by typing
addpath(genpath(‘installation_folder/drtoolbox’)). Another probable cause is a naming conflict with another toolbox (e.g., another toolbox with a PCA function). You can investigate such errors using Matlab’s which function. If Matlab complains it cannot find the
bsxfun function, your Matlab is likely to be very outdated. You may try using this code as a surrogate.
Next to reducing the dimensionality of my data, Isomap/LLE/Laplacian Eigenmaps/LTSA also reduced the number of data points? Where did these points go? You may observe this behavior in most techniques that are based on neighborhood graphs. Isomap/LLE/Laplacian Eigenmaps/LTSA can only embed data that gives rise to a connected neighborhood graph. If the neighborhood graph is not connected, the implementations only embed the largest connected component of the neighborhood graph. You can obtain the indices of the embedded data points from mapping.conn_comp (which you can get from the compute_mapping function). If you really need to have al your data points embedded, don’t use a manifold learner.
How do I provide label information to the supervised techniques/metric learners? You should specify label information to supervised techniques (LDA, NCA, MCML, and LMNN) by setting the elements of the first column of the data matrix to the label of the corresponding data point. To this end, the labels should be numeric. For embedding test data, use the
out_of_sample.m function without specifying the test labels.
How do I project low-dimensional data back into the data space? Back-projection can only be implemented for linear techniques, for autoencoders, and for the GPLVM. For some of these models, the toolbox implements back-projection via the
Which techniques support an exact out-of-sample extension? Only parametric dimensionality reduction techniques, i.e., techniques that learn an explicit function between the data space and the low-dimensional latent space, support exact out-of-sample extension. All linear techniques (PCA, LDA, NCA, MCML, LPP, and NPE) support exact out-of-sample extension, and autoencoders do too. Spectral techniques such as Isomap, LLE, and Laplacian Eigenmaps support out-of-sample extensions via the Nyström approximation. The out-of-sample extensions can be used via the
Which technique should I use to visualize high-dimensional data in a scatter plot? t-SNE typically is very good at visualizing data. Manifold learners often perform disappointingly for data visualization due to a problem in their covariance constraint. Parametric techniques are typically not well suited for visualization, because they constrain the mapping between the data and the visualization.