As of version 9.0, it is possible to perform

**PCA**of**NMR**data sets directly from within the**Mnova**User Interface without having to resort to third party applications. The basic PCA functionality has been previously covered in this blog (see Chemometrics under Mnova 9 – PCA) and in this entry we are going to discuss in more detail some more practical aspects, particularly on the different binning, filtering and scaling options.
What follows has been kindly written by Silvia Mari (project leader of the PCA module) and Isaac Iglesias, who programmed this module in Mnova.

## Introduction

Matrix
generation from an array of NMR spectra is the core step in chemometric
analysis. This procedure involves several options that the user should chose. In
this entry we want to focus on the practical aspects concerning matrix
preparation from NMR data. Broadly speaking, we can consider three main issues:

- Choice of binning method: Sum vs Peak
- Filtering or not filtering?
- Choice of Scaling strategy

## Choice of binning method: Sum vs Peak

When
dealing with high resolution NMR spectra it is in general impracticable to work
with the entire data points of the spectra which are usually in the order of
32Kb and bigger. The most common strategy used to reduce the number of
variables consists in dividing each spectrum in a defined number of regions,
the so called

*bins.*Several binning strategies are available today, from regular binning, where bins have fixed width, to more sophisticated strategies such as gaussian or dynamic adaptive binning [1]. But even for these cases, when dealing with particularly crowded spectra, it usually happens that shifts in peaks close to bin boundaries can cause dramatic quantitative changes in adjacent bins. A good help in solving this problem could come from peak deconvolution strategies. Generally speaking, a deconvolved peak is a mathematical entity characterized by a chemical shift (frequency), intensity and half-height line width. The integral of a peak can be automatically derived assuming a peak shape (i.e. Lorentzian) and the intensity and line width. For this reason, binning a spectrum of deconvolved peaks reads out virtually completely the problem of bin boundaries as illustrated in figure 1.**Figure 1**– Binning real peaks versus binning deconvolved peaks

When
dealing with an array of NMR spectra, whilst regular binning of a number

**b**of bins over stacked spectra containing**s**spectra will generate a matrix**b**x**s**(see figure 2), it is not possible to generate a similar matrix using directly deconvolved peaks (peak list) since the number and position of peaks varies from spectrum to spectrum**Figure 2**– Matrix generation from regular binning or peak list.

To
encompass this problem there are two main strategies: (1) provide algorithms
for peak alignment over the spectra series, as well as strategies for dealing
with missing peaks in order to end up with the same number of peaks and the
same peak positions for all the spectra; (2) perform binning over the peak
table.

In the PCA module available in Mnova, we adopt the
second solution. User can decide whether to use regular binning (Sum) or
binning over deconvolved peaks (Peak) from the binning options. An example of
better classification is qualitatively represented in figure 3, where score
plots are represented for binning using Sum method (panel A) and binning using
Peak method (panel B).

**Figure 3**– Score plots obtained using same bin width of 0.03ppm; in both cases data were normalized by the sum and pareto scaled. In panel A bins were obtained directly as integration of real spectra; in panel B bins were obtained by binning of the corresponding peak list obtained after global spectral deconvolution.

## Filtering or not filtering?

When reducing bin width to approximate spectral
resolution, and hence increasing the number of variables, it is generally
required to introduce filtering strategies in order to filter out those
variables that do not show significantly changes. There are established
filtering strategies that are commonly applied to genomics type of data and
that could also be successfully used for NMR-based type of data[1].
In the PCA module we have implemented
five filtering options, namely:

- Standard Deviation
- Median Absolute Deviation
- Interquartile Range
- Mean Value
- Median Value

In the first three cases a fixed fraction (default 10%) of
the bins is discarded (e.g. if the matrix is composed by 100 bins it means that
10 bins are discarded) and the selection is based on the Filter method chosen. In
the case of Mean Value or Median Value, user is asked to input a value for the
Mean or the Median. By doing so, only bins that display a lower value of the
inputted one are discarded. In the following figure, the difference in
clustering capability when the filtering is applied or not is illustrated.
Finally, it worth noting that very often, NMR data can contain regions which
should discarded and included into the so called blind regions; these regions
will not be taken into account in the principal component calculation.

**Figure 4**- Score plots obtained using same bin width of 0.01ppm; in both cases data were normalized by the sum and pareto scaled. In panel A no filter was applied; in panel B filtering strategy based on Mean Value was applied. A cut-off value of 100 was used.

## Choice of Scaling strategy

Scaling is an operation that is performed on the
variables (columns) of the matrix. Scaling strategy depends from one hand from
the biological information we wish to extract, but on the other hand also on
the data analysis method chosen (in our case PCA). As a first approach the
so-called

*Centering*is generally applied to every analysis. With*Centering*all bin values fluctuate around zero instead of around the mean of each bin; therefore*Centering*is a method that adjusts for differences in the offset between high and low abundant compounds. There are several methods available in literature for scaling__[3]__, and generally centering is applied in combination with these methods. Scaling strategies could be divided in two subclasses: methods that use data dispersion (such as standard deviation) as scaling factor; and methods that use size measure (such as the mean). For the first group Mnova includes Auto, Pareto and Vast scaling strategies. For the second group Range and Level scaling are available. Generally speaking, when dealing with PCA analysis, the first group is normally preferred. Figure 5 shows score plot differences between PCA that used Pareto scaling (A panel) in comparison with PCA that used Level scaling**Figure 5**- Score plots obtained using same bin width of 0.05 ppm and normalization by the sum. In panel A Pareto scaling was applied; in panel B Level scaling was applied.

##
Conclusions

We have
focused on some very practical aspects when dealing with PCA analysis. But it
is always necessary to think about how good was our experimental design. Quoting Stanley Deming [4] in
his overview of Chemometrics of 1986: ”

*Chemometrics is primarily concerned with the acquisition of data and the extraction of useful information from that data*” and again:”*In a given situation, it is far better to err on the side of too many pieces of experimental data. If too few data are available, one might not be able to make any conclusion, and the whole set of experiments will have been wasted*”.##
Acknowledgments

We are
grateful to Dr. Giovanna Musco and Dr. Jose Garcia-Manteiga for providing
dataset for testing purposes.

## References

[1] Amber J Hackstadt, Filtering for increased power for microarray data analysis. BMC Bioinformatics 2009, 10:11

[2] Paul E. Anderson, Metabolomics, Volume 7, Issue 2, pp 179-190
(2010)

[3] Robert A van den Berg, Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics 2006, 7:142

[4] Stanley N. Deming, Chemometrics:an Overview. CLIN. CHEM. 32/9, 1702-1706 (1986)

[4] Stanley N. Deming, Chemometrics:an Overview. CLIN. CHEM. 32/9, 1702-1706 (1986)