## Minimum Description Length approach for unsupervised spectral unmixing of multiple interfering gas species |

Optics Express, Vol. 19, Issue 15, pp. 13862-13872 (2011)

http://dx.doi.org/10.1364/OE.19.013862

Acrobat PDF (950 KB)

### Abstract

We address an original statistical method for unsupervised identification and concentration estimation of spectrally interfering gas components of unknown nature and number. We show that such spectral unmixing can be efficiently achieved using information criteria derived from the Minimum Description Length (MDL) principle, outperforming standard information criteria such as AICc or BIC. In the context of spectroscopic applications, we also show that the most efficient MDL technique implemented shows good robustness to experimental artifacts.

© 2011 OSA

## 1. Introduction

9. E. R. Warren, “Optimum detection of multiple vapor materials with frequency-agile lidar,” Appl. Opt. **35**, 4180–4193 (1996). [CrossRef] [PubMed]

## 2. Principle of unsupervised spectral unmixing algorithm

### 2.1. Posing of the problem

**X**containing intensity measures on

*M*spectral slits (or wavelengths) not necessarily adjacent. In the presence of absorbing gas species, these spectral measurements reveal specific absorption patterns depending on the nature and concentration of the chemicals encountered by the probe light beam. These spectral absorption patterns are superimposed with the spectral baseline of the active illumination source. The vector

**X**of the measured intensities is linked to the

*K*-dimensional vector

**c**containing the gases concentrations

**c**= [c

_{1},..., c

*]*

_{K}*through the following equation where*

^{T}*g*denotes the spectral slit (or laser linewidth) convolution function, which is assumed known in the following. In this equation,

**a**

_{0u}denotes the baseline spectrum, and the

*M*×

*K*matrix

**H**= [

_{u}**h**

_{u}_{1},

**h**

_{2},...,

**h**

_{u}*] contains the unconvolved high-resolution absorption spectra of the*

_{K}*K*gas species. For the sake of simplicity, we will only consider in this paper the case of small absorption optical depths (i.e.,

**H**≪ 1). Moreover, we assume that the baseline

_{u}c**a**

_{0}is varying slowly with respect to both the absorption lines and the convolution function widths. In such conditions, the measured intensities can be written, where the matrix

**H**= [(

**h**

_{u}_{1}*

*g*), (

**h**

_{u}_{2}*

*g*),...,(

**h**

_{u}***

_{K}*g*)] contains the convolved absorption spectra of the

*K*gas species, and where the convolved spectral baseline

**a**

_{0}is assumed known, either from instrumental calibration or with a precise radiometric model of the illumination source. More accurate models involving deconvolution procedures, as well as the influence of a possible resolution mismatch between the instrument and the model are outside the scope of this paper, but could deserve investigation in future work.

*M*spectral slits, obtained for instance with a dispersive spectrometer or a FTIR spectrometer, will be denoted

**X̃**in the remaining of this paper. It is a common procedure to use the logarithm of the measured data so as to obtain a linear regression model of the following form: with

**b**

_{0}= ln

**a**

_{0}, and where the

*M*-dimensional zero-mean random vector

**n**allows us to model the experimental noise. We assume that the noise contribution to the measured signal

**Ỹ**can be correctly accounted for with a Gaussian additive model. We also assume independence between the noise affecting two distinct spectral slits, i.e., 〈n

*n*

_{i}*〉 = 0 if*

_{j}*i*≠

*j*. For such a linear regression model, the usual estimator is

**ĉ**= (

**HH**

*)*

^{T}^{−1}

**H**

*(*

^{T}**Ỹ**–

**b**

_{0}) and is usually referred to as the Minimum Mean Squared Error (MMSE) estimator since it minimizes the Residual Sum of Squares

*RSS*= (

**Ỹ**–

**Ŷ**)

*(*

^{T}**Ỹ – Ŷ**), with

**Ŷ**=

**b**

_{0}–

**Hĉ**.

### 2.2. Model selection

18. H. Akaike, “A new look at the statistical model identification,” IEEE Trans. Autom. Control **19**, 716–723 (1974). [CrossRef]

19. G. Schwartz, “Estimating the dimension of a model,” Ann. Stat. **9**, 461–464 (1978). [CrossRef]

20. D. P. Foster and E. I. G., “The risk inflation criterion for multiple regression,” Ann. Stat. **22**, 1947–1975 (1994). [CrossRef]

*information criteria*make it possible to introduce sparsity constraints in the regression model, by selecting the solution (i.e., the regressor matrix

**H**) which minimizes –

*ℓ*

**(**

_{Ỹ}**Ỹ**|

**H**) +

*𝒞*, with a different penalization term

*𝒞*depending on the information criterion considered. It can be noted however that since the loglikelihood is proportional to the logarithm of the RSS, up to an additive constant independent of the selected regression model [16

16. M. Hansen and B. Yu, “Model selection and the principle of minimum description length,” J. Am. Stat. Assoc. **96**, 746–774 (2001). [CrossRef]

*M*/2 ln

*RSS*+

*𝒞*.

18. H. Akaike, “A new look at the statistical model identification,” IEEE Trans. Autom. Control **19**, 716–723 (1974). [CrossRef]

*K*of regressors included in the model. In the case of samples of limited size, this penalization term can be refined and is usually referred to as AICc and will be denoted

*𝒞*

^{(}

^{a}^{)}in the following, with [16

16. M. Hansen and B. Yu, “Model selection and the principle of minimum description length,” J. Am. Stat. Assoc. **96**, 746–774 (2001). [CrossRef]

19. G. Schwartz, “Estimating the dimension of a model,” Ann. Stat. **9**, 461–464 (1978). [CrossRef]

*nats*(1

*nat*= ln 2 bits)) needed to both describe the data given the model, and to encode the model itself [12, 16

16. M. Hansen and B. Yu, “Model selection and the principle of minimum description length,” J. Am. Stat. Assoc. **96**, 746–774 (2001). [CrossRef]

**96**, 746–774 (2001). [CrossRef]

**Mixture MDL and**

*g*

**-prior (gMDL):**Within the framework of

*mixture*MDL [21], a prior distribution is assigned to the vector parameter

*θ*. With a specific choice of the prior distribution (Zellner’s

*g*-prior), one obtains the so-called

*gMDL*for which the criterion to minimize has the following closed form expression [16

**96**, 746–774 (2001). [CrossRef]

*F*= (

*M – K*) [(

**Ỹ**–

**b**

_{0})

*(*

^{T}**Ỹ**–

**b**

_{0}) –

*RSS*] /

*K RSS*is the standard

*F*-ratio for testing the null model containing the spectral baseline only. The penalization term

*𝒞*

^{(}

^{g}^{)}in Eq.(7) is given in [16

**96**, 746–774 (2001). [CrossRef]

**Normalized Maximized Likelihood (nMDL):**Lastly, we shall be interested in the recently proposed Normalized Maximized Likelihood form of the MDL [22

22. J. Rissanen, “Fisher information and stochastic complexity,” IEEE Trans. Inf. Theory **42**, 48–54 (1996). [CrossRef]

**96**, 746–774 (2001). [CrossRef]

*a*and

*b*to be estimated. According to Rissanen’s indications [12], the estimator of the hyperparameter

*a*is given by the RSS obtained with the most exhaustive model (i.e.

*K*regressors included) while the estimator of the hyperparameter

_{max}*b*is the RSS obtained with the less exhaustive model (i.e. baseline only).

### 2.3. Stepwise algorithm for unsupervised spectral unmixing

*K*is important. Instead of carrying out extensive operational research techniques such as

_{max}*branch & bound*for instance, we implement a stepwise search algorithm for the sake of computation rapidity. We use a forward stepwise algorithm with an initialization state containing the baseline only. At each step of the algorithm, the regressor (i.e., the gas species) that most diminishes the criterion is included in the model, until any further increase in the model complexity leads to an increase of the criterion. An example of iterative model selection is illustrated in Fig. 2. At each step of the algorithm implementing nMDL criterion, the corresponding reconstructed signal is plotted in red dashed curve, and can be compared with the signal obtained with the true model (black curve).

## 3. Implementation and comparison of MDL-based information criteria

### 3.1. Simulated absorption spectroscopy experiment

*M*= 400 adjacent spectral slits, spanning between 3.2 and 3.6

*μ*m, with a simulated instrumental spectral resolution of 2.3 nm (Gaussian slit function). The physical situation considered in this experiment consisted of a spectrally uniform illumination propagated through a gas mixture with 4 components:

*O*

_{3}(6000 ppm.m),

*NO*

_{2}(500 ppm.m),

*CH*

_{4}(70 ppm.m) and

*H*

_{2}

*CO*(30 ppm.m), where the numerical values in brackets correspond to their respective path-length integrated concentration.

*K*= 16 gas species, including the 4 gases of the “true” model and 12 spectrally interfering species (such as

_{max}*H*

_{2}

*O*,

*N*

_{2}

*O*,

*NH*

_{3},

*HCl*, etc.) with significant absorption strength within the spectral range considered. The strong spectral overlap of the database species can be checked in Fig. 3, where the absorption spectra of 8 gas species (among 16 in the spectral database) are plotted. In this figure, the spectra are convolved with a Gaussian kernel to match the spectral resolution of the instrument considered in the simulated experimental data.

*M*statistically independent realizations of Gaussian random noise with variance

*σ*

^{2}were added to the absorption spectra generated over

*M*spectral slits. Varying the noise variance allowed us to simulate experiments with different values of the Signal to Noise Ratio (SNR), usually defined in the context of additive Gaussian noise as the ratio of the flat baseline value to the noise standard deviation

*σ*. However, this quantity is poorly adapted to assess the difficulty of the estimation problem considered, since it only depends on the active illumination power, and does not depend on the absorption strength of the gas mixture to be detected. We thus introduce another figure of merit, denoted S-SNR for

*spectral*SNR, and defined as: In this expression, the numerator can be interpreted as the root mean square of the absorption signal

**b**

_{0}–

**Y**=

**Hc**from which the nature and concentration of the gas components have to be estimated. An increase of the gas mixture concentrations accentuates the spectral absorption patterns in the measured spectrum, thus leading to an easier identification/estimation. In that case, it can be seen from the above definition that the S-SNR value is correspondingly increased. An example of simulated noisy data is given in Fig.2(a) for a S-SNR=6.3 dB.

### 3.2. Simulation results

*R*= 5.10

^{3}realizations of the selection/estimation task on statistically independent simulated data. Two situations were considered according to whether light has undergone absorption from the gas mixture or not.

*K*= 5), it is however interesting to note that nMDL appears very efficient at avoiding overfitting, with only 1% of overestimated selections and 0.4% of selections with only

*K*= 3 components. This property has already been addressed in Ref. [12] and remains valid in the less favorable situations of low SNRs where nMDL is outperformed by BIC: when S-SNR=2 dB, nMDL leads to only 53.5% of correct models but more than 99% of the remaining selections have an underestimated size (

*K*= 3) and the “missing” gas component is always

*H*

_{2}

*CO*. In the context of absorption spectroscopy, this behavior seems interesting since it decreases the probability of erroneously detecting a gas component in excess and thus strengthens the confidence in the components selected with nMDL.

*gMDL*criterion to test the null hypothesis. In case this hypothesis is rejected, the algorithm could switch to the nMDL criterion, which showed the best performance for model selection. In the next subsection, we analyze how a positivity-constrained implementation of the stepwise algorithm influences the previous results.

### 3.3. Influence of a positivity constraint

### 3.4. Influence of outliers

*N*= 20 independent measures. Among these

*N*= 20 measures, we included a varying proportion of outliers, corresponding to the simulated noisy absorption spectrum of a single interferent gas species (HCl [80 ppm.m]), whose absorption spectrum is represented in green curve in Fig. 3.

## 4. Conclusion

23. M. Duhant, W. Renard, G. Canat, F. Smektala, J. Troles, P. Bourdon, and C. Planchat, “Improving mid-infrared supercontinuum generation efficiency by pumping a fluoride fiber directly into the anomalous regime at 1995 nm,” in CLEO/Europe and EQEC 2011 Conference Digest, (2011), p. CD9_1. (to be published)

24. A. Berrou, M. Raybaut, A. Godard, and M. Lefebvre, “High-resolution photoacoustic and direct absorption spectroscopy of main greenhouse gases by use of a pulsed entangled cavity doubly resonant OPO,” Appl. Phys. B **98**, 217–230 (2010). [CrossRef]

*lasso*approaches [25, 26

26. E. J. Candès and Y. Plan, “Near-ideal model selection by *ℓ*_{1} minimization,” Ann. Stat. **37**, 2145–2177 (2009). [CrossRef]

## References and links

1. | P. Weibring, C. Abrahamsson, M. Sjholm, J. N. Smith, H. Edner, and S. Svanberg, “Multi-component chemical analysis of gas mixtures using a continuously tuneable lidar system,” Appl. Phys. B |

2. | J. R. Quagliano, P. O. Stoutland, R. R. Petrin, R. K. Sander, R. J. Romero, M. C. Whitehead, C. R. Quick, J. J. Tiee, and L. J. Jolin, “Quantitative chemical identification of four gases in remote infrared (9–11 |

3. | G. Wysocki, R. Lewicki, R. Curl, F. Tittel, L. Diehl, F. Capasso, M. Troccoli, G. Hofler, D. Bour, S. Corzine, R. Maulini, M. Giovannini, and J. Faist, “Widely tunable mode-hop free external cavity quantum cascade lasers for high resolution spectroscopy and chemical sensing,” Appl. Phys. B |

4. | U. Platt and J. Stutz, |

5. | R. A. Hashmonay, R. M. Varma, M. Modrak, R. H. Kagann, and P. D. Sullivan, “Simultaneous measurement of vaporous and aerosolized threats by active open path FTIR,” Unclassified Technical Report ADA449529, Arcadis Geraghty and Miller Research, Triangle Park, NC (2004). |

6. | J. Kasparian, M. Rodriguez, G. Méjean, J. Yu, E. Salmon, H. Wille, R. Bourayou, S. Frey, Y. André, A. Mysyrowicz, R. Sauerbrey, J. Wolf, and L. Wöste, “White-light filaments for atmospheric analysis,” Science |

7. | D. M. Brown, K. Shi, Z. Liu, and C. R. Philbrick, “Long-path supercontinuum absorption spectroscopy for measurement of atmospheric constituents,” Opt. Express |

8. | P. S. Edwards, A. M. Wyant, D. M. Brown, Z. Liu, and C. R. Philbrick, “Supercontinuum laser sensing of atmospheric constituents,” Proc. SPIE |

9. | E. R. Warren, “Optimum detection of multiple vapor materials with frequency-agile lidar,” Appl. Opt. |

10. | S. Yin and W. Wang, “Novel algorithm for simultaneously detecting multiple vapor materials with multiple-wavelength differential absorption lidar,” Chin. Opt. Lett. |

11. | J. Fade and N. Cézard, “Supercontinuum lidar absorption spectroscopy for gas detection and concentration estimation,” in Proceedings of the 25th International Laser and Remote-sensing Conference, (2010), pp. 798–801. |

12. | J. Rissanen, |

13. | R. A. Stine, “Model selection using information theory and the MDL principle,” Sociolog. Methods Res. |

14. | C. D. Giurcaneanu, “Stochastic complexity for the detection of periodically expressed genes,” in Proceedings of the IEEE International Workshop on Genomic Signal Processing and Statistics, (2007), pp. 1–4. [CrossRef] |

15. | H. Chen, T. Kirubarajan, Y. Bar-Shalom, and K. R. Pattipati, “MDL approach for multiple
low-observable track initiation,” Proc.
SPIE |

16. | M. Hansen and B. Yu, “Model selection and the principle of minimum description length,” J. Am. Stat. Assoc. |

17. | C. L. Mallows, “Some comments on 15, 661–675 (1973). [CrossRef] |

18. | H. Akaike, “A new look at the statistical model identification,” IEEE Trans. Autom. Control |

19. | G. Schwartz, “Estimating the dimension of a model,” Ann. Stat. |

20. | D. P. Foster and E. I. G., “The risk inflation criterion for multiple regression,” Ann. Stat. |

21. | J. Rissanen, |

22. | J. Rissanen, “Fisher information and stochastic complexity,” IEEE Trans. Inf. Theory |

23. | M. Duhant, W. Renard, G. Canat, F. Smektala, J. Troles, P. Bourdon, and C. Planchat, “Improving mid-infrared supercontinuum generation efficiency by pumping a fluoride fiber directly into the anomalous regime at 1995 nm,” in CLEO/Europe and EQEC 2011 Conference Digest, (2011), p. CD9_1. (to be published) |

24. | A. Berrou, M. Raybaut, A. Godard, and M. Lefebvre, “High-resolution photoacoustic and direct absorption spectroscopy of main greenhouse gases by use of a pulsed entangled cavity doubly resonant OPO,” Appl. Phys. B |

25. | R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. R. Stat. Soc. Ser. B |

26. | E. J. Candès and Y. Plan, “Near-ideal model selection by |

**OCIS Codes**

(070.4790) Fourier optics and signal processing : Spectrum analysis

(280.1120) Remote sensing and sensors : Air pollution monitoring

(300.0300) Spectroscopy : Spectroscopy

(010.1030) Atmospheric and oceanic optics : Absorption

(010.0280) Atmospheric and oceanic optics : Remote sensing and sensors

**ToC Category:**

Spectroscopy

**History**

Original Manuscript: March 31, 2011

Revised Manuscript: May 6, 2011

Manuscript Accepted: May 9, 2011

Published: July 6, 2011

**Virtual Issues**

Vol. 6, Iss. 8 *Virtual Journal for Biomedical Optics*

**Citation**

Julien Fade, Sidonie Lefebvre, and Nicolas Cézard, "Minimum description length approach for unsupervised spectral unmixing of multiple interfering gas species," Opt. Express **19**, 13862-13872 (2011)

http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-19-15-13862

Sort: Year | Journal | Reset

### References

- P. Weibring, C. Abrahamsson, M. Sjholm, J. N. Smith, H. Edner, and S. Svanberg, “Multi-component chemical analysis of gas mixtures using a continuously tuneable lidar system,” Appl. Phys. B 79, 525–530 (2004). [CrossRef]
- J. R. Quagliano, P. O. Stoutland, R. R. Petrin, R. K. Sander, R. J. Romero, M. C. Whitehead, C. R. Quick, J. J. Tiee, and L. J. Jolin, “Quantitative chemical identification of four gases in remote infrared (9–11μm) differential absorption lidar experiments,” Appl. Opt. 36, 1915–1927 (1997). [CrossRef] [PubMed]
- G. Wysocki, R. Lewicki, R. Curl, F. Tittel, L. Diehl, F. Capasso, M. Troccoli, G. Hofler, D. Bour, S. Corzine, R. Maulini, M. Giovannini, and J. Faist, “Widely tunable mode-hop free external cavity quantum cascade lasers for high resolution spectroscopy and chemical sensing,” Appl. Phys. B 92, 305–311 (2008). [CrossRef]
- U. Platt and J. Stutz, Differential Optical Absorption Spectroscopy (Springer, 2008).
- R. A. Hashmonay, R. M. Varma, M. Modrak, R. H. Kagann, and P. D. Sullivan, “Simultaneous measurement of vaporous and aerosolized threats by active open path FTIR,” Unclassified Technical Report ADA449529, Arcadis Geraghty and Miller Research, Triangle Park, NC (2004).
- J. Kasparian, M. Rodriguez, G. Méjean, J. Yu, E. Salmon, H. Wille, R. Bourayou, S. Frey, Y. André, A. Mysyrowicz, R. Sauerbrey, J. Wolf, and L. Wöste, “White-light filaments for atmospheric analysis,” Science 301, 61–64 (2003). [CrossRef] [PubMed]
- D. M. Brown, K. Shi, Z. Liu, and C. R. Philbrick, “Long-path supercontinuum absorption spectroscopy for measurement of atmospheric constituents,” Opt. Express 16, 8457–8471 (2008). [CrossRef] [PubMed]
- P. S. Edwards, A. M. Wyant, D. M. Brown, Z. Liu, and C. R. Philbrick, “Supercontinuum laser sensing of atmospheric constituents,” Proc. SPIE 7323, 73230S (2009). [CrossRef]
- E. R. Warren, “Optimum detection of multiple vapor materials with frequency-agile lidar,” Appl. Opt. 35, 4180–4193 (1996). [CrossRef] [PubMed]
- S. Yin and W. Wang, “Novel algorithm for simultaneously detecting multiple vapor materials with multiple-wavelength differential absorption lidar,” Chin. Opt. Lett. 4, 360–363 (2006).
- J. Fade and N. Cézard, “Supercontinuum lidar absorption spectroscopy for gas detection and concentration estimation,” in Proceedings of the 25th International Laser and Remote-sensing Conference , (2010), pp. 798–801.
- J. Rissanen, Information and Complexity in Statistical Modeling (Springer, 2007).
- R. A. Stine, “Model selection using information theory and the MDL principle,” Sociolog. Methods Res. 33, 230–260 (2004). [CrossRef]
- C. D. Giurcaneanu, “Stochastic complexity for the detection of periodically expressed genes,” in Proceedings of the IEEE International Workshop on Genomic Signal Processing and Statistics , (2007), pp. 1–4. [CrossRef]
- H. Chen, T. Kirubarajan, Y. Bar-Shalom, and K. R. Pattipati, “MDL approach for multiple low-observable track initiation,” Proc. SPIE 4728, 477–488 (2002). [CrossRef]
- M. Hansen and B. Yu, “Model selection and the principle of minimum description length,” J. Am. Stat. Assoc. 96, 746–774 (2001). [CrossRef]
- C. L. Mallows, “Some comments on cp,” Technometrics 15, 661–675 (1973). [CrossRef]
- H. Akaike, “A new look at the statistical model identification,” IEEE Trans. Autom. Control 19, 716–723 (1974). [CrossRef]
- G. Schwartz, “Estimating the dimension of a model,” Ann. Stat. 9, 461–464 (1978). [CrossRef]
- D. P. Foster and E. I. G., “The risk inflation criterion for multiple regression,” Ann. Stat. 22, 1947–1975 (1994). [CrossRef]
- J. Rissanen, Stochastic Complexity in Statistical Inquiry, Series in Computer Science (World Scientific, 1989), Vol. 15.
- J. Rissanen, “Fisher information and stochastic complexity,” IEEE Trans. Inf. Theory 42, 48–54 (1996). [CrossRef]
- M. Duhant, W. Renard, G. Canat, F. Smektala, J. Troles, P. Bourdon, and C. Planchat, “Improving mid-infrared supercontinuum generation efficiency by pumping a fluoride fiber directly into the anomalous regime at 1995 nm,” in CLEO/Europe and EQEC 2011 Conference Digest, (2011), p. CD9_1. (to be published)
- A. Berrou, M. Raybaut, A. Godard, and M. Lefebvre, “High-resolution photoacoustic and direct absorption spectroscopy of main greenhouse gases by use of a pulsed entangled cavity doubly resonant OPO,” Appl. Phys. B 98, 217–230 (2010). [CrossRef]
- R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. R. Stat. Soc. Ser. B 58, 267–288 (1996).
- E. J. Candès and Y. Plan, “Near-ideal model selection by ℓ1 minimization,” Ann. Stat. 37, 2145–2177 (2009). [CrossRef]

## Cited By |
Alert me when this paper is cited |

OSA is able to provide readers links to articles that cite this paper by participating in CrossRef's Cited-By Linking service. CrossRef includes content from more than 3000 publishers and societies. In addition to listing OSA journal articles that cite this paper, citing articles from other participating publishers will also be listed.

« Previous Article | Next Article »

OSA is a member of CrossRef.