As of version 9.0, it is possible to perform PCA of NMR data sets directly from within the Mnova User Interface without having to resort to third party applications. The basic PCA functionality has been previously covered in this blog (see Chemometrics under Mnova 9 – PCA) and in this entry we are going to discuss in more detail some more practical aspects, particularly on the different binning, filtering and scaling options.
What follows has been kindly written by Silvia Mari (project leader of the PCA module) and Isaac Iglesias, who programmed this module in Mnova.
Introduction
Matrix
generation from an array of NMR spectra is the core step in chemometric
analysis. This procedure involves several options that the user should chose. In
this entry we want to focus on the practical aspects concerning matrix
preparation from NMR data. Broadly speaking, we can consider three main issues:
- Choice of binning method: Sum vs Peak
- Filtering or not filtering?
- Choice of Scaling strategy
Choice of binning method: Sum vs Peak
When
dealing with high resolution NMR spectra it is in general impracticable to work
with the entire data points of the spectra which are usually in the order of
32Kb and bigger. The most common strategy used to reduce the number of
variables consists in dividing each spectrum in a defined number of regions,
the so called bins. Several binning strategies are available
today, from regular binning, where bins have fixed width, to more sophisticated
strategies such as gaussian or dynamic adaptive binning [1].
But even for these cases, when dealing with particularly crowded spectra, it usually
happens that shifts in peaks close to bin boundaries can cause dramatic
quantitative changes in adjacent bins. A good help in solving this problem
could come from peak deconvolution strategies.
Generally speaking, a deconvolved peak is a mathematical entity characterized
by a chemical shift (frequency), intensity and half-height line width. The
integral of a peak can be automatically derived assuming a peak shape (i.e.
Lorentzian) and the intensity and line width. For this reason, binning a
spectrum of deconvolved peaks reads out virtually completely the problem of bin
boundaries as illustrated in figure 1.
Figure 1 – Binning real peaks versus binning deconvolved peaks
When
dealing with an array of NMR spectra, whilst regular binning of a number b of bins over stacked spectra containing s spectra will generate a matrix bxs (see figure 2), it is not possible to generate a
similar matrix using directly deconvolved peaks (peak list) since the number
and position of peaks varies from spectrum to spectrum
Figure 2 – Matrix generation from regular binning or peak list.
To
encompass this problem there are two main strategies: (1) provide algorithms
for peak alignment over the spectra series, as well as strategies for dealing
with missing peaks in order to end up with the same number of peaks and the
same peak positions for all the spectra; (2) perform binning over the peak
table.
In the PCA module available in Mnova, we adopt the
second solution. User can decide whether to use regular binning (Sum) or
binning over deconvolved peaks (Peak) from the binning options. An example of
better classification is qualitatively represented in figure 3, where score
plots are represented for binning using Sum method (panel A) and binning using
Peak method (panel B).
Figure 3 – Score plots obtained using same bin width of 0.03ppm; in both
cases data were normalized by the sum and pareto scaled. In panel A bins were
obtained directly as integration of real spectra; in panel B bins were obtained
by binning of the corresponding peak list obtained after global spectral
deconvolution.
Filtering or not filtering?
When reducing bin width to approximate spectral
resolution, and hence increasing the number of variables, it is generally
required to introduce filtering strategies in order to filter out those
variables that do not show significantly changes. There are established
filtering strategies that are commonly applied to genomics type of data and
that could also be successfully used for NMR-based type of data[1].
In the PCA module we have implemented
five filtering options, namely:
- Standard Deviation
- Median Absolute Deviation
- Interquartile Range
- Mean Value
- Median Value
In the first three cases a fixed fraction (default 10%) of
the bins is discarded (e.g. if the matrix is composed by 100 bins it means that
10 bins are discarded) and the selection is based on the Filter method chosen. In
the case of Mean Value or Median Value, user is asked to input a value for the
Mean or the Median. By doing so, only bins that display a lower value of the
inputted one are discarded. In the following figure, the difference in
clustering capability when the filtering is applied or not is illustrated.
Finally, it worth noting that very often, NMR data can contain regions which
should discarded and included into the so called blind regions; these regions
will not be taken into account in the principal component calculation.
Figure 4 - Score plots obtained using same bin width of 0.01ppm; in both
cases data were normalized by the sum and pareto scaled. In panel A no filter
was applied; in panel B filtering strategy based on Mean Value was applied. A cut-off
value of 100 was used.
Choice of Scaling strategy
Scaling is an operation that is performed on the
variables (columns) of the matrix. Scaling strategy depends from one hand from
the biological information we wish to extract, but on the other hand also on
the data analysis method chosen (in our case PCA). As a first approach the
so-called Centering is generally applied to every analysis. With Centering all bin values fluctuate around
zero instead of around the mean of each bin; therefore Centering is a method that adjusts for differences in the offset
between high and low abundant compounds. There are several methods available in
literature for scaling [3],
and generally centering is applied in combination with these methods. Scaling
strategies could be divided in two subclasses:
methods that use data dispersion (such as standard deviation) as scaling
factor; and methods that use size measure (such as the mean). For the first
group Mnova includes Auto, Pareto and
Vast scaling strategies. For the second group Range and Level scaling
are available. Generally speaking, when dealing with PCA analysis, the first
group is normally preferred. Figure 5 shows score plot differences between PCA
that used Pareto scaling (A panel) in comparison with PCA that used Level
scaling
Figure 5 - Score plots obtained using same bin width of 0.05 ppm and normalization by the sum. In panel A Pareto scaling was applied; in panel B Level scaling was applied.
Conclusions
We have
focused on some very practical aspects when dealing with PCA analysis. But it
is always necessary to think about how good was our experimental design. Quoting Stanley Deming [4] in
his overview of Chemometrics of 1986: ”Chemometrics is primarily concerned with the acquisition of data and
the extraction of useful information from that data” and again:” In a given situation, it is far better to err
on the side of too many pieces of experimental data. If too few data are available,
one might not be able to make any conclusion, and the whole set of experiments
will have been wasted”.
Acknowledgments
We are
grateful to Dr. Giovanna Musco and Dr. Jose Garcia-Manteiga for providing
dataset for testing purposes.
References
[1] Amber J Hackstadt, Filtering for increased power for microarray data analysis. BMC Bioinformatics 2009, 10:11
[2] Paul E. Anderson, Metabolomics, Volume 7, Issue 2, pp 179-190
(2010)
[3] Robert A van den Berg, Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics 2006, 7:142
[4] Stanley N. Deming, Chemometrics:an Overview. CLIN. CHEM. 32/9, 1702-1706 (1986)
[4] Stanley N. Deming, Chemometrics:an Overview. CLIN. CHEM. 32/9, 1702-1706 (1986)