|
Projection methods in Excel
Contents
IntroductionThe manual describes application of software Chemometrics Add-In that is a specially designed add-in for Microsoft Excel. Chemometrics Add-In is aimed to data analysis based on the following projection methods: Principal Component Analysis (PCA), Projection on Latent Structures (PLS1 and PLS2). All calculations are performed inside an active Excel workbook by means of the special user-defined worksheet functions in the same manner as it is done by standard math functions, e.g. MMULT. It is supposed that a reader has an experience in working within Excel environment and familiar with basic principals of array calculations by means of standard worksheet functions. Notation and methods of matrix algebra are intensively used in the manual. Readers who have lack of skills in this subject are highly recommended to study or at least get acquainted with the pertinent literature. The manual is not a textbook on multivariate data analysis. Only basic formulae essential for understanding of Chemometrics Add-In implementation are described here. A reader poorly acquainted with these methods is asked to address to the special literature The manual is illustrated by the examples placed in Projection.xls workbook, which accompanies the manual. Important information on Projection.xls application References to the examples are placed in the text of the manual as Excel objects. All examples are of abstract modeling type but their background is tightly connected with real-world cases. 1. Projection methods1.1. Concept and NotationProjection methods are widely used for multivariate data analysis especially in chemometrics. They are applied both to a one-block data X (classification, e.g. PCA) and to a double-block data as X and Y (calibration, e.g. PCR and PLS). Let us consider the (I×J) data matrix X, where I is the number of objects (rows) and J is the number of independent variables (columns). Ordinary number of variables is rather high (J>>1). We can simultaneously analyze the (I×K) matrix Y, where I is the same number of objects and K is the number of responses. The essence of the projection techniques is considerable reduction of the data dimensionality both for blocks X and Y. The are many reviews of the projection methods and the reader is referred to them for more details. 1.2. PCAPrincipal Component Analysis (PCA) is the oldest projection method. This method uses new formal (or latent) variables ta (a=1,…A), which are linear combination of the original variables xj (j=1,…J). i.e. ta=pa1x1+… paJxJ , or in the matrix notation In this equation T is called the score matrix, or scores. Its dimension is (I×A). Matrix P is called the loading matrix, or loadings, and it has dimension (A×J). E is the residual (I×J) matrix.
Fig. 2 PCA graphic representation New variables ta are often called principal components (PC) therefore the method is called Principal Component Analysis (PCA). The number of columns ta in matrix T, and pa in matrix P, is equal to number A that is called the number of principal components. It defines the projection model complexity. Value of A is certainly less than the number of variables J and the number of objects I. Score and loading matrices have the following properties
The recurrent algorithm NIPALS, is often used for calculation of PCA scores and loadings. NIPALS extracts one principal component at a time. At first, the original matrix X is pre-processed (at least it is centered; see section 1.4) and thus it turns into matrix E0, a=0. Afterwards the following algorithm is applied
As soon as component a is calculated, the following substitution is done: ta=t, and pa=p. For calculation of the next principal component the new residuals are calculated in accordance with the following expression Ea+1 = Ea – t pt and the same algorithm is repeated with a+1 substituted for a. After principal component space is constructed, the new objects Xnew can be projected onto this space. In other words the score matrix Tnew can be calculated. In PCA this is easily done by the expression Tnew = Xnew P Naturally Xnew matrix for the new objects should be pre-processed in the same way as the training matrix X, which was used for the PCA decomposition 1.3. PLSProjection on Latent Structures (PLS) can be considered as a
generalization of PCA. In PLS the decompositions of matrices X and Y
are conducted simultaneously Projection is built in order to maximize correlation between corresponding vectors of X-scores ta and Y-scores ua.
Fig. 3 PLS graphic representation If Y block consists of several responses (i.e. K>1), two types of projections can be built; they are PLS1 and PLS2. In the first case the projection space is built separately for each response variable yk . Scores T (U) and loadings P (W, Q) depend on the response in use. Such an approach is called PLS1. For the PLS2 method one common space is built for all responses. The following expressions show the properties of PLS projection matrices
The recurrent algorithm is used for calculation of PLS scores and loadings. This algorithm calculates one PLS component at a time. For PLS1 the algorithm is as follows. The initial matrix X and vector y are preprocessed (at least they are centered; see section 1.4) and thus they turn into matrix E0 and vector f0, a=0. Afterwards the following steps are performed.
After current (a-th) PLS1 component has been calculated the following substitutions are done: ta=t, pa=p, wa=w, ua=u, and qa=q. For calculation of the next component new residuals should be calculated in accordance with the following expressions Ea+1 = Ea – t pt and fa+1 = fa – qt. The same algorithm is repeated for a+1 substituted for a. For PLS2 the algorithm is as follows. At first the initial matrices X and Y are preprocessed (at least they are centered; see section 1.4) and they turn into matrices E0 and F0, a=0. Afterwards the following steps are performed.
After current (a-th) PLS2 component has been calculated the following substitutions are done: ta=t, pa=p, wa=w, ua=u, and qa=q. For calculation of the next component, new residuals should be calculated in accordance with the following expressions Ea+1 = Ea – t pt and Fa+1 = Fa – tqt . Afterwards the same algorithm is repeated with a+1 substituted for a. Calculation of PLS scores Tnew and Unew for new objects (Xnew , Ynew) is more complicated than in PCA. For PLS1, matrix Xnew and vector ynew are pre-processed in the same way as in the calibration dataset, and thus they turn into matrix E0 and vector f0, a=0. Further, the following algorithm for components a=1,…, A is applied
After calculation of the current PLS1 component (a-th), new residuals are calculated as Ea+1 = Ea – tpt and fa+1 = fa – qt . The same procedure is then applied after substituting index a+1 for a. In PLS2 new matrices Xnew and Ynew are pre-processed in the same manner as training data and they turn into matrices E0 and F0, a=0. For each PLS2 component a=1,…, A , the following algorithm is applied
After calculation of the a-th PLS2 component new residuals are calculated as Ea+1 = Ea – t pt and Fa+1 = Fa – tqt . After substituting index a+1 for a, the same procedure is used. 1.4. Data pre-processingIt is worthy of mentioning that the PCA and PLS methods do not take into consideration the free term in the course of X and Y decomposition. This could be seen from Eq (1) and (2). It is initially supposed that all columns in matrices X и Y have zero mean values, i.e.
This condition can be easily satisfied by data centering. Centering means subtraction of matrix M from the original matrix X, i.e. Centering is a column-wise operation. For each vector xj the mean value is calculated as
In that case M=(m11,..., mJ1), where 1 is the vector of identities with dimension I. Centering is always an obligatory procedure before application of the projection methods. The second simplest pre-processing technique is scaling. Scaling is not as necessary as centering. In comparison with centering, scaling does not change the data structure but simply modifies the weights in different parts of data. The most widely used scaling is a column-wise one. This can be expressed as a right multiplication of matrix X by matrix W, i.e.
Matrix W is the diagonal matrix with dimension J×J. As a rule, diagonal elements wjj are equal to the inverse values of standard deviations, i.e.
calculated for each column xj. Row-wise scaling (also called normalization) is a left multiplication of matrix X by a diagonal matrix W, i.e.
In this case dimension of W is I×I, and its elements wii are ordinary the inverse values of standard deviations calculated for each row xit Combination of centering and column-wise scaling
is called autoscaling. Data scaling is often used to compensate contribution of various variables to a model (i.e. in hyphenated methods such as LC-MS), to take into account the heterogeneous errors, or in the case when different data blocks should be processed in one model. Scaling can also be seen as a method for numerical calculations stabilization. At the same time scaling should be used with a caution as such a pre-processing could essentially change the results of quality analysis. Each pre-processing (centering, scaling, etc) is firstly applied to the calibration data set. This set is used for calculation of mj and dj, which afterwards are applied for the pre-processing both of the training and test sets. 1.5. SoftwareProjection methods are implemented in various software packages. Short review of some popular programs is presented in this section. End-users mainly apply special packages. Among them are the following: SIMCA developed by Umetrics, and a similar software the Unscrambler developed by Camo. Among the most popular packages is also PLS ToolBox developed by Eigenvector Research, Inc. First two packages work in a stand alone mode, whilst the latter works in the MatLab environment. Many projection methods are included in the general statistical packages such as Statistica, SPSS , and others. There is one shareware program. This is Multivariate Analysis Add-in for Excel developed in Bristol Center for Chemometrics headed by prof. R. Brereton 1.6. Didactic exampleFile Projection.xls is used for illustration of the Chemometrics add-in facilities. The performance of all the above-mentioned methods is illustrated with a simulated data set (X, Y), which should be centered but not scaled. File Projection.xls includes the following worksheets:
2. Application of Chemometrics Add-In2.1. InstallationTo start working with Chemometrics Add-In, the software should be properly installed. The setup package consists of two files: Chemometrics.dll and Chemometrics.xla. The setup package with Demo version of Chemometrics Add-In can be downloaded here. This version is not limited in functionality, but it can be used with "our own" files only. These files (Calibration.xls, Excel.xls, Projections.xls, etc) may be modified in any way, but the changes cannot be saved. Full version may be granted by demand. The setup files can be placed manually in accordance with the instruction given in worksheet Install . Afterwards Chemometrics Add-In should be installed as any other Excel add-in. Chemometrics Add-In can be installed automatically with the help of button Install in worksheet Install
Download files Chemometrics.dll and Chemometrics.xla to some folder in your computer, e.g. If Chemometrics Add-In has already been installed on the computer, the procedure will be terminated. In other case, after confirmation of the intention, the following dialog appears.
Here you should select the folder where Chemometrics setup package is located. If you point out a wrong folder, the following message appears
and you should find a proper folder on your computer. In the course of installation various information boxes could appear, e.g.
It is recommended to confirm all such queries. If installation succeeded, you will see the message box
in which all operations are listed. If installation failed or cancelled, the following message is displayed
After successful completion of installation, no other Add-Ins related operations are required, and you can start to explore Chemometrics Add-In immediately without reloading Excel. In some cases, after successful Add-in installation you will see "########" or "#VALUE!" in cells with chemometric functions. Press the key for immediate recalculation of the whole Projection.xls workbook 2.2. General informationChemometrics Add-In takes all input information from an active Excel workbook. This information should be placed directly in the areas of worksheets (data X and Y). A user may organize one's working space in the most convenient way, to place all data onto one worksheet, or share them between several worksheets, or even use the worksheets in various workbooks. The results are output as the arrays in the worksheets. Software does not have any limitations on the input arrays' dimension, i.e. the number of objects (I), the number of variables (J), and the number of responses (K). The input data size is limited by the computer memory supported by the operation system in use and also by limitations of Excel software Chemometrics Add-In includes several
functions, which can be used as standard worksheet functions. For this
purpose one can use Insert Function
Returned values are arrays, therefore the functions
should be entered as array
formula, i.e. applying Function arguments can be numbers, names, arrays, or references containing numbers. To input an array formula first it is necessary to
select the area on the worksheet, which corresponds to the dimension of
the output array. If the selected area is larger than the output array,
the superfluous cells are filled with the Number
of principal/PLS components Centering
and/or scaling
If any argument 3. PCA3.1. ScoresPCAPerforms decomposition of matrix X by using the principal component analysis (PCA), and then returns an array that presents the score values Tnew calculated for matrix Xnew. Syntax
Remarks
Example Fig. 4 Example of input for function ScoresPCA 3.2. LoadingsPCAPerforms decomposition of matrix X by using the principal component method, and then returns an array that presents the loading values P Syntax
Remarks
Example Fig. 5 Example of input for function LoadingsPCA 4. PLS4.1. ScoresPLSPerforms decomposition of matrices X and Y by using the method of projection on latent structure (PLS), and then returns an array that presents the PLS score values Tnew calculated for matrices Xnew. Syntax
Remarks
Example Fig.6 Example of output for function ScoresPLS 4.2. UScoresPLSPerforms decomposition of matrices X and Y by using the method of projection on latent structure (PLS), and then returns an array that presents the PLS score values Unew calculated for matrices Xnew and Ynew. Syntax
Remarks
Example Fig.7 Example of output for function UScoresPLS 4.3 LoadingsPLSPerforms decomposition of matrices X and Y by using the method of projection on latent structure (PLS), and then returns an array that presents the loadings values P Syntax
Remarks
Example Fig. 8 Example of input for function LoadingsPLS 4.4 WLoadingsPLSPerforms decomposition of matrices X and Y by using the method of projection on latent structure (PLS), and then returns an array that presents the loading weights values W Syntax
Remarks
Example Fig. 9 Example of input for function WLoadingsPLS 4.5 QLoadingsPLSPerforms decomposition of matrices X and Y by using the method of projection on latent structure (PLS), and then returns an array that presents the loadings values Q Syntax
Remarks
Example Fig. 10 Example of input for function QLoadingsPLS 5. PLS25.1. ScoresPLS2Performs decomposition of matrices X and Y by using the method of projection on latent structure (PLS2), and then returns an array that presents the PLS2 score values Tnew calculated for matrices Xnew. Syntax
Remarks
Example Fig.11 Example of output for function ScoresPLS2 5.2. UScoresPLS2Performs decomposition of matrices X and Y by using the method of projection on latent structure (PLS2), and then returns an array that presents the PLS score values Unew calculated for matrices Xnew and Ynew. Syntax
Remarks
Example Fig.12 Example of output for function UScoresPLS2 5.3 LoadingsPLS2Performs decomposition of matrices X and Y by using the method of projection on latent structure (PLS2), and then returns an array that presents the loadings values P Syntax
Remarks
Example Fig. 13 Example of output for function LoadingsPLS2 5.4 WLoadingsPLS2Performs decomposition of matrices X and Y by using the method of projection on latent structure (PLS2), and then returns an array that presents the loadings weights values W Syntax
Remarks
Example Fig. 14 Example of input for function WLoadingsPLS2 5.5 QLoadingsPLS2Performs decomposition of matrices X and Y by using the method of projection on latent structure (PLS2), and then returns an array that presents the loadings values Q Syntax
Remarks
Example Fig. 15 Example of input for function QLoadingsPLS2 6. Additional Matrix Functions6.1. MIdentReturns a square identity matrix Syntax
Remarks
Example Fig. 16 Example of input/output for function MIdent 6.2. MIdentD2Returns two-diagonal rectangular matrix of zeros and units Syntax
Remarks
Example Fig. 17 Example of input/output for function MIdentD2 6.3. MCutRowsReturns a matrix with cut out rows Syntax
Remarks
Example Fig. 18 Example of input/output for function MCutRows 6.4. MTraceReturns the matrix trace of an array Syntax
Remarks
Example Fig. 19 Example of input/output for function MTrace
ConclusionChemometrics Add-In provides possibilities for calculation of various scores and loadings for projection methods using original data. Furthermore, applying these results, typical graphic representation used in multivariate data analysis can be obtained, i.e. scores plots, loading plots, and others. Such an approach also provides possibilities for building calibration models, calculation of root mean square error of calibration (RMSEC) and prediction (RMSEP), and etc. Examples are demonstrated in files
|