Thursday, 31 July 2014

PCA and NMR: Practical aspects

As of version 9.0, it is possible to perform PCA of NMR data sets directly from within the Mnova User Interface without having to resort to third party applications. The basic PCA functionality has been previously covered in this blog (see Chemometrics under Mnova 9 – PCA) and in this entry we are going to discuss in more detail some more practical aspects, particularly on the different binning, filtering and scaling options. 

What follows has been kindly written by Silvia Mari (project leader of the PCA module) and Isaac Iglesias, who programmed this module in Mnova.

Introduction

Matrix generation from an array of NMR spectra is the core step in chemometric analysis. This procedure involves several options that the user should chose. In this entry we want to focus on the practical aspects concerning matrix preparation from NMR data. Broadly speaking, we can consider three main issues:
  1. Choice of binning method: Sum vs Peak
  2. Filtering or not filtering?
  3. Choice of Scaling strategy

Choice of binning method: Sum vs Peak


When dealing with high resolution NMR spectra it is in general impracticable to work with the entire data points of the spectra which are usually in the order of 32Kb and bigger. The most common strategy used to reduce the number of variables consists in dividing each spectrum in a defined number of regions, the so called bins.  Several binning strategies are available today, from regular binning, where bins have fixed width, to more sophisticated strategies such as gaussian or dynamic adaptive binning [1]. But even for these cases, when dealing with particularly crowded spectra, it usually happens that shifts in peaks close to bin boundaries can cause dramatic quantitative changes in adjacent bins. A good help in solving this problem could come from peak deconvolution strategies.  Generally speaking, a deconvolved peak is a mathematical entity characterized by a chemical shift (frequency), intensity and half-height line width. The integral of a peak can be automatically derived assuming a peak shape (i.e. Lorentzian) and the intensity and line width. For this reason, binning a spectrum of deconvolved peaks reads out virtually completely the problem of bin boundaries as illustrated in figure 1.



 Figure 1 – Binning real peaks versus binning deconvolved peaks

When dealing with an array of NMR spectra, whilst regular binning of a number b of bins over  stacked spectra containing  s spectra will generate a matrix bxs (see figure 2), it is not possible to generate a similar matrix using directly deconvolved peaks (peak list) since the number and position of peaks varies from spectrum to spectrum



Figure 2 – Matrix generation from regular binning or peak list.

To encompass this problem there are two main strategies: (1) provide algorithms for peak alignment over the spectra series, as well as strategies for dealing with missing peaks in order to end up with the same number of peaks and the same peak positions for all the spectra; (2) perform binning over the peak table.

In the PCA module available in Mnova, we adopt the second solution. User can decide whether to use regular binning (Sum) or binning over deconvolved peaks (Peak) from the binning options. An example of better classification is qualitatively represented in figure 3, where score plots are represented for binning using Sum method (panel A) and binning using Peak method (panel B).



Figure 3 – Score plots obtained using same bin width of 0.03ppm; in both cases data were normalized by the sum and pareto scaled. In panel A bins were obtained directly as integration of real spectra; in panel B bins were obtained by binning of the corresponding peak list obtained after global spectral deconvolution.

Filtering or not filtering?

When reducing bin width to approximate spectral resolution, and hence increasing the number of variables, it is generally required to introduce filtering strategies in order to filter out those variables that do not show significantly changes. There are established filtering strategies that are commonly applied to genomics type of data and that could also be successfully used for NMR-based type of data[1].  In the PCA module we have implemented five filtering options, namely: 
  1. Standard Deviation
  2. Median Absolute Deviation
  3. Interquartile Range
  4. Mean Value
  5. Median Value 


In the first three cases a fixed fraction (default 10%) of the bins is discarded (e.g. if the matrix is composed by 100 bins it means that 10 bins are discarded) and the selection is based on the Filter method chosen. In the case of Mean Value or Median Value, user is asked to input a value for the Mean or the Median. By doing so, only bins that display a lower value of the inputted one are discarded. In the following figure, the difference in clustering capability when the filtering is applied or not is illustrated. Finally, it worth noting that very often, NMR data can contain regions which should discarded and included into the so called blind regions; these regions will not be taken into account in the principal component calculation.




Figure 4 - Score plots obtained using same bin width of 0.01ppm; in both cases data were normalized by the sum and pareto scaled. In panel A no filter was applied; in panel B filtering strategy based on Mean Value was applied. A cut-off value of 100 was used.

Choice of Scaling strategy

Scaling is an operation that is performed on the variables (columns) of the matrix. Scaling strategy depends from one hand from the biological information we wish to extract, but on the other hand also on the data analysis method chosen (in our case PCA). As a first approach the so-called Centering is generally applied to every analysis. With Centering all bin values fluctuate around zero instead of around the mean of each bin; therefore Centering is a method that adjusts for differences in the offset between high and low abundant compounds. There are several methods available in literature for scaling [3], and generally centering is applied in combination with these methods. Scaling strategies could be divided in two subclasses:  methods that use data dispersion (such as standard deviation) as scaling factor; and methods that use size measure (such as the mean). For the first group Mnova includes Auto, Pareto and  Vast scaling strategies. For the second group Range and Level scaling are available. Generally speaking, when dealing with PCA analysis, the first group is normally preferred. Figure 5 shows score plot differences between PCA that used Pareto scaling (A panel) in comparison with PCA that used Level scaling

Figure 5 - Score plots obtained using same bin width of 0.05 ppm and normalization by the sum. In panel A Pareto scaling was applied; in panel B Level scaling was applied.

Conclusions

We have focused on some very practical aspects when dealing with PCA analysis. But it is always necessary to think about how good was our experimental design. Quoting Stanley Deming [4] in his overview of Chemometrics of 1986: ”Chemometrics is primarily concerned with the acquisition of data and the extraction of useful information from that data” and again:” In a given situation, it is far better to err on the side of too many pieces of experimental data. If too few data are available, one might not be able to make any conclusion, and the whole set of experiments will have been wasted”.

Acknowledgments

We are grateful to Dr. Giovanna Musco and Dr. Jose Garcia-Manteiga for providing dataset for testing purposes.


References

[1] Amber J Hackstadt, Filtering for increased power for microarray data analysis. BMC Bioinformatics 2009, 10:11

[2] Paul E. Anderson, Metabolomics, Volume 7, Issue 2, pp 179-190 (2010)

[3] Robert A van den Berg, Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics 2006, 7:142

[4] Stanley N. Deming, Chemometrics:an Overview. CLIN. CHEM. 32/9, 1702-1706 (1986)


1 comment:

statistical data analysis said...

Well, would say that in Practical aspects of PCA and NMR, it's so critical to go for filtering or not as the reasult of analysis depends upon it.