## Next-generation acceleration and code optimization for light transport in turbid media using GPUs |

Biomedical Optics Express, Vol. 1, Issue 2, pp. 658-675 (2010)

http://dx.doi.org/10.1364/BOE.1.000658

Acrobat PDF (2729 KB)

### Abstract

A highly optimized Monte Carlo (MC) code package for simulating light
transport is developed on the latest graphics processing unit (GPU) built
for general-purpose computing from NVIDIA - the Fermi GPU. In biomedical
optics, the MC method is the gold standard approach for simulating light
transport in biological tissue, both due to its accuracy and its flexibility
in modelling realistic, heterogeneous tissue geometry in 3-D. However, the
widespread use of MC simulations in inverse problems, such as treatment
planning for PDT, is limited by their long computation time. Despite its
parallel nature, optimizing MC code on the GPU has been shown to be a
challenge, particularly when the sharing of simulation result matrices among
many parallel threads demands the frequent use of atomic instructions to
access the slow GPU global memory. This paper proposes an optimization
scheme that utilizes the fast shared memory to resolve the performance
bottleneck caused by atomic access, and discusses numerous other
optimization techniques needed to harness the full potential of the GPU.
Using these techniques, a widely accepted MC code package in biophotonics,
called MCML, was successfully accelerated on a Fermi GPU by approximately
600x compared to a state-of-the-art Intel Core i7 CPU. A skin model
consisting of 7 layers was used as the standard simulation geometry. To
demonstrate the possibility of GPU cluster computing, the same GPU code was
executed on four GPUs, showing a linear improvement in performance with an
increasing number of GPUs. The GPU-based MCML code package, named GPU-MCML,
is compatible with a wide range of graphics cards and is released as an
open-source software in two versions: an optimized version tuned for high
performance and a simplified version for beginners (

© 2010 Optical Society of America

## 1. Introduction

*et al*. (MCML) [2] has been accepted as a gold standard tool for simulating light propagation in multilayered turbid media. The MCML code package has since then been used for validation of numerous photon migration modelling schemes and as a starting point for custom-developed MC solutions. Further, Boas

*et al*. extended the code capabilities by releasing the tMCimg code for time-resolved and steady-state MC-based photon migration in arbitrary, 3-D voxelized media [3].

## 2. Background

### 2.1. MCML

*A*[

*r*][

*z*], which stores the photon absorption probability density [cm

^{-3}] as a function of radius

*r*and depth

*z*for the pencil beam (or impulse response). Absorption probability density can be converted into more common quantities, such as photon fluence (measured in cm

^{-2}for the impulse response) which is obtained by dividing

*A*[

*r*][

*z*] by the local absorption coefficient.

*et al*. [2].

### 2.2. Programming Graphics Processing Units

#### 2.2.1. CUDA

*thread*on the CPU, while the latter is executed in parallel by many threads on the GPU. Here, a thread is an independent execution context that can, for example, simulate an individual photon packet in the MCML algorithm. The device code is expressed in the form of a

*kernel*function. It is similar to a regular C function, except that it specifies the work of

*each*GPU thread, parameterized by a thread index variable. Correspondingly, a kernel invocation is similar to a regular function call except that it must specify the geometry of a

*grid*of threads that executes the kernel, also referred to as the

*kernel configuration*. A grid is first composed of a number of

*thread blocks*, each of which then has a number of threads.

#### 2.2.2. NVIDIA GPU Architecture

*streaming multiprocessors*(SMs), each with 8

*scalar processors*(SPs). Note that the 240 SPs (total) are not 240 independent processors; instead, they are 30 independent processors that can perform 8 similar computations at a time. From the programmer’s perspective, each thread block is assigned to an SM, and each thread within it is further scheduled to execute on one of the SPs. Compared to GTX 280, the Fermi-based GTX 480 has half the number of SMs, but each SM contains four times the number of SPs, resulting in a total of 480 SPs.

#### 2.2.3. Atomic Instructions

## 3. GPU-accelerated MCML Code (GPU-MCML)

### 3.1. Features

**Compatibility with a wide range of NVIDIA graphics cards (Fermi and pre-Fermi)**: The GPU-MCML program has been tested on the latest NVIDIA Fermi graphics cards such as GTX 480 (Compute Capability 2.0) and is backward compatible with pre-Fermi generations such as GTX 280 and 8800 GT (Compute Capability 1.1–1.3). This compatiblity is achieved by automatically enabling different set of features based on the Compute Capability of the GPU detected.**Multi-GPU execution mode**: The program can be executed on multi-GPU systems by specifying the number of GPUs at runtime to open up the possibility of GPU cluster computing.**Optimized version and simplified version**: Two versions of the GPU-MCML program are released. The optimized version automatically detects the GPU generation and chooses the proper set of architecture-specific optimizations based on the features supported. The simple version aims for code readability and reusability, by removing most optimizations. The simplified code is designed to look very similar to the original CPU MCML program to ease the learning curve for novice CUDA programmers.

### 3.2. Implementation Overview

### 3.3. Key Performance Bottleneck

### 3.4. Solution to Performance Bottleneck

*A*[

*r*][

*z*] array, two memory optimizations were applied:

**Storing the high-fluence region in shared memory**: The first optimization, illustrated in Fig. 3, is based on the high access rate of the*A*[*r*][*z*] elements near the photon source (or at the origin in the MCML model), causing significant contention when atomic instructions are used. Therefore, this region of the*A*[*r*][*z*] array is cached in the shared memory. Namely, if a photon packet is absorbed inside the high-fluence region, its weight is accumulated atomically into the shared memory copy. Otherwise, its weight is added directly into the global memory atomically. At the end of the simulation, the values temporarily stored in the shared memory copy are written (flushed) to the master copy in the global memory.**Caching photon absorption history in registers**: To further reduce the number of atomic accesses (even to the shared memory), the recent write history, representing previous absorption events, can be stored privately in registers by each thread. It was observed that consecutive absorption events can happen at nearby, or sometimes the same, locations in the*A*[*r*][*z*] array, depending on the absorption grid geometry and optical properties of the layers. To save on register usage, the current GPU-MCML code only stores the most recent write history using 2 registers - one for the last memory location and one for the total weight. For each thread, consecutive writes to the same location of the*A*[*r*][*z*] array are accumulated into these registers until a different memory location is computed. Once a different location is detected, the accumulated weight is flushed to shared (or global) memory using an atomicAdd instruction and the whole process is repeated.

*dr*,

*dz*,

*nr*, and

*nz*) no longer accumulate their weights at the perimeter of the grid, unlike in the original MCML code. Note that these boundary elements were known to give invalid values in the original MCML code [2]. This optimization does not change the correctness of the simulation, yet it ensures that performance is not degraded if the size of the detection grid is decreased, which forces photon packets to be absorbed at boundary elements (significantly increasing contention and access latency to these elements in the

*A*[

*r*][

*z*] array).

### 3.5. Alternative Solution: Reflectance/transmittance-only Mode

*A*[

*r*][

*z*] array. As the major performance bottleneck is caused by the atomic update of the

*A*[

*r*][

*z*] array for almost every step for each photon, ignoring the array update (while still reducing the photon packet weight after each scattering event) causes a major performance boost while still retaining valid reflectance and transmittance outputs. The maximum number of atomic memory operations is reduced to the total number of photon packets launched (since reflectance/transmittance is recorded atomically only upon exiting the tissue, which happens once per photon packet) and these operations are spread out over the entire simulation. Also, the small number of memory operations compared to arithmetic operations allows the GPU to “hide” the cost of memory operations. Note that this way of recording simulation output is similar to that presented in [14]. The option to ignore the

*A*[

*r*][

*z*] array detection is enabled by passing a -A flag to GPU-MCML at runtime.

### 3.6. Parallel pseudo-random number generation

### 3.7. Scaling to Multiple GPUs

*N*times where

*N*is the number of GPUs, except that each GPU initializes a different set of seeds and multipliers for the random number generator and declares a separate absorption array. This allows the independent simulation of photon packets on multiple GPUs, opening up the possibility of GPU-based cluster computing.

## 4. Performance

### 4.1. Test Platform and Test Cases

### 4.2. Effect of GPU Architecture

### 4.3. Effect of Shared Memory Size

### 4.4. Effect of Grid Geometry

### 4.5. Scaling to Multiple GPUs

### 4.6. Performance of Simplified Version

### 4.7. Performance in Reflectance/transmittance-only Mode

^{8}photon packets in the skin model from 14418s to 12824s.

## 5. Validation

^{8}photon packets) from Table 1 was used as the test case geometry.

### 5.1. Fluence Distribution

*E*[

*i*][

_{r}*i*] is computed for each voxel using Eq. (1).

_{z}*A*is the gold standard absorption array produced by the CPU-MCML software while

_{cpu}*A*contains the corresponding elements produced by the GPU-MCML program.

_{gpu}## 6. Discussion and Conclusions

*et al*. only achieved a 10x speedup compared to a 2.4-GHz Intel Xeon CPU, despite using a modern GPU (NVIDIA GTX 260 with 192 scalar processors). Their simulation involved modelling heterogenous tissue using triangular meshes [15]. Although an earlier 3-D, voxel-based implementation by Fang and Boas [16] reported a higher speedup (300x on an NVIDIA 8800GT graphics card with 112 scalar processors compared to a 1.86GHz Intel Xeon CPU), this was achieved without using atomic instructions, possibly compromising data integrity. The speedup dropped to 75x, or 4 times slower, when atomic instructions were used. Note that atomic instructions are required to avoid race conditions [29]. Race conditions can compromise the stability and accuracy of the simulation results, both of which are crucial to the use of the MC method as a gold standard. Finally, Badal and Badano obtained a 27x speedup when using a modern GTX 295 GPU (utilizing 240 out of 480 scalar processors) to accelerate the simulation of high-energy photons in voxelized media for X-ray imaging [17]. Based on these previous attempts, Shen and Wang developed a tetrahedron-based, inhomogeneous MC simulator utilizing multiple cores in modern CPUs, arguing that the speedup obtained using mid-range GPUs was insufficient to motivate a complicated, GPU-based implementation [24].

*et al*. introduced GPU-accelerated MC to the field by demonstrating a massive speedup for simulating time-resolved photon reflectance in a homogenous semi-infinite medium [14], which requires atomic access to a 1-D histogram array. Lo

*et al*. presented both an FPGA-based [11] and a GPU-based solution [18] to simulate steady-state photon migration in multilayered media. Notably, this preliminary GPU implementation demonstrated the importance of optimizing atomic access to the 2-D absorption array. Simultaneously and independently, Alerstam

*et al*. presented a more versatile, albeit less optimized, code solving the same problem using GPU hardware [19

19. E. Alerstam, T. Svensson, and S. Andersson-Engels, “CUDAMCML - User manual and implementation notes,” http://www.atomic.physics.lu.se/biophotonics/.

*et al*. [14] and Carbone

*et al*. [30]. Considering the differences in the hardware used, the performance of the codes agree well. This is an important result as almost all current methods of evaluating optical properties, including optical tomography methods, are based on either monitoring spatial/temporal distributions of light exiting the boundary of a scattering medium or monitoring one or a few discrete points inside the medium. Both of these cases are covered by the alternative solution. In addition, this result illustrates the strength of GPU-based calculations; GPUs are good at handling organized (coalesced) memory operations (which is hard to achieve in MC applications), but perform poorly with random memory operations and atomic memory operations. However, when the ratio of arithmetic operations to memory operations is high, the GPU may “hide” the memory latency by using the waiting time to do calculations with other threads. Consequently, the simplified/naive GPU-MCML version performs almost as well as the highly optimized code when ignoring the internal absorption array detection while the code is much simpler.

## Acknowledgments

## References and links

1. | B. Wilson and G. Adam, “A Monte Carlo model for the absorption and flux
distributions of light in tissue,” Med. Phys. |

2. | L. Wang, S. Jacques, and L. Zheng, “MCML - Monte Carlo modeling of light transport
in multi-layered tissues,” Comput. Meth. Prog. Biol. |

3. | D. Boas, J. Culver, J. Stott, and A. Dunn, “Three dimensional Monte Carlo code for photon
migration through complex heterogeneous media including the adult human
head.,” Opt. Express |

4. | C. R. Simpson, M. Kohl, M. Essenpreis, and M. Cope, “Near-infrared optical properties of ex vivo
human skin and subcutaneous tissues measured using the Monte Carlo inversion
technique,” Phys. Med. Biol. |

5. | F. Bevilacqua, D. Piguet, P. Marquet, J. D. Gross, B. J. Tromberg, and C. Depeursinge, “In vivo local determination of tissue optical
properties: applications to human brain,” Appl. Opt. |

6. | G. Palmer and N. Ramanujam, “Monte Carlo-based inverse model for calculating
tissue optical properties. Part I: Theory and validation on synthetic
phantoms,” Appl. Opt. |

7. | C. K. Hayakawa, J. Spanier, F. Bevilacqua, A. K. Dunn, J. S. You, B. J. Tromberg, and V. Venugopalan, “Perturbation Monte Carlo methods to solve
inverse photon migration problems in heterogenous tissues,”
Opt. Lett. |

8. | E. Alerstam, S. Andersson-Engels, and T. Svensson, “White Monte Carlo for time-resolved photon
migration,” J. Biomed. Opt. |

9. | A. Custo, D. Boas, D. Tsuzuki, I. Dan, R. Mesquita, B. Fischl, W. Grimson, and W. Wells III, “Anatomical atlas-guided diffuse optical
tomography of brain activation,” NeuroImage |

10. | D. Boas and A. Dale, “Simulation study of magnetic resonance
imaging-guided cortically constrained diffuse optical tomography of human
brain function,” Appl. Opt. |

11. | W. C. Y. Lo, K. Redmond, J. Luu, P. Chow, J. Rose, and L. Lilge, “Hardware acceleration of a Monte Carlo
simulation for photodynamic therapy treatment planning,”
J. Biomed. Opt. |

12. | A. Johansson, J. Axelsson, S. Andersson-Engels, and J. Swartling, “Realtime light dosimetry software tools for
interstitial photodynamic therapy of the human prostate,”
Med. Phys. |

13. | S. Davidson, R. Weersink, M. Haider, M. Gertner, A. Bogaards, D. Giewercer, A. Scherz, M. Sherar, M. Elhilali, and J. Chin |

14. | E. Alerstam, T. Svensson, and S. Andersson-Engels, “Parallel computing with graphics processing
units for high-speed Monte Carlo simulation of photon
migration,” J. Biomed. Opt. |

15. | N. Ren, J. Liang, X. Qu, J. Li, B. Lu, and J. Tian, “GPU-based Monte Carlo simulation for light
propagation in complex heterogeneous tissues,” Opt.
Express |

16. | Q. Fang and D. A. Boas, “Monte Carlo Simulation of Photon Migration in
3D Turbid Media Accelerated by Graphics Processing Units,”
Opt. Express |

17. | A. Badal and A. Badano, “Monte Carlo simulations in a graphics
processing unit,” Med. Phys. |

18. | W. C. Y. Lo, T. D. Han, J. Rose, and L. Lilge, “GPU-accelerated Monte Carlo simulation for
photodynamic therapy treatment planning,” Proc. SPIE |

19. | E. Alerstam, T. Svensson, and S. Andersson-Engels, “CUDAMCML - User manual and implementation notes,” http://www.atomic.physics.lu.se/biophotonics/. |

20. | L. Wang, S. Jacques, and L. Zheng, “CONV - convolution for responses to a finite
diameter photon beam incident on multi-layered tissues,”
Comput. Meth. Prog. Bio. |

21. | NVIDIA Corporation “CUDA Programming Guide 3.0,” (2010). |

22. | NVIDIA Corporation, “NVIDIA’s Next Generation CUDA Compute Architecture: Fermi,” (2010). |

23. | G. Marsaglia, “Random number generators,”
J. Mod. Appl. Stat. Meth. |

24. | H. Shen and G. Wang, “A tetrahedon-based inhomogenous Monte Carlo
optical simulator,” Phys. Med. Biol. |

25. | M. Matsumoto and T. Nishimura, “Mersnne Twister: a 623-dimensionally
equidistributed uniform pseudorandom number generator,”
ACM T. Model. Comput. S. |

26. | M. Saito and M. Matsumoto, |

27. | I. Meglinsky and S. Matcher, “Modelling the sampling volume for skin blood
oxygenation measurements,” Med. Biol. Eng. Comput. |

28. | S. Flock, S. Jacques, B. Wilson, W. Star, and M. van Gemert, “Optical properties of Intralipid: a phantom
medium for light propagation studies,” Laser. Surg.
Med. |

29. | M. Quinn, |

30. | N. Carbone, H. Di Rocco, D. I. Iriarte, and J. A. Pomarico, “Solution of the direct problem in turbid media
with inclusions using monte carlo simulations implemented in graphics
processing units: new criterion for processing transmittance
data,” J. Biomed. Opt. |

**OCIS Codes**

(170.3660) Medical optics and biotechnology : Light propagation in tissues

(170.5280) Medical optics and biotechnology : Photon migration

(170.7050) Medical optics and biotechnology : Turbid media

(290.4210) Scattering : Multiple scattering

**ToC Category:**

Image Reconstruction and Inverse Problems

**History**

Original Manuscript: July 9, 2010

Revised Manuscript: August 10, 2010

Manuscript Accepted: August 11, 2010

Published: August 23, 2010

**Citation**

Erik Alerstam, William Chun Yip Lo, Tianyi David Han, Jonathan Rose, Stefan Andersson-Engels, and Lothar Lilge, "Next-generation acceleration and code optimization for light transport in turbid media using GPUs," Biomed. Opt. Express **1**, 658-675 (2010)

http://www.opticsinfobase.org/boe/abstract.cfm?URI=boe-1-2-658

Sort: Year | Journal | Reset

### References

- B. Wilson and G. Adam, “A Monte Carlo model for the absorption and flux distributions of light in tissue,” Med. Phys. 10, 824 (1983).
- L. Wang, S. Jacques, and L. Zheng, “MCML - Monte Carlo modeling of light transport in multi-layered tissues,” Comput. Meth. Prog. Biol. 47, 131–146 (1995).
- D. Boas, J. Culver, J. Stott, and A. Dunn, “Three dimensional Monte Carlo code for photon migration through complex heterogeneous media including the adult human head.,” Opt. Express 10, 159–170 (2002).
- C. R. Simpson, M. Kohl, M. Essenpreis, and M. Cope, “Near-infrared optical properties of ex vivo human skin and subcutaneous tissues measured using the Monte Carlo inversion technique,” Phys. Med. Biol. 43, 2465–2478 (1998).
- F. Bevilacqua, D. Piguet, P. Marquet, J. D. Gross, B. J. Tromberg, and C. Depeursinge, “In vivo local determination of tissue optical properties: applications to human brain,” Appl. Opt. 38, 4939–4950 (1999).
- G. Palmer and N. Ramanujam, “Monte Carlo-based inverse model for calculating tissue optical properties. Part I: Theory and validation on synthetic phantoms,” Appl. Opt. 45, 1062–1071 (2006).
- C. K. Hayakawa, J. Spanier, F. Bevilacqua, A. K. Dunn, J. S. You, B. J. Tromberg, and V. Venugopalan, “Perturbation Monte Carlo methods to solve inverse photon migration problems in heterogenous tissues,” Opt. Lett. 26, 1335–1337 (2001).
- E. Alerstam, S. Andersson-Engels, and T. Svensson, “White Monte Carlo for time-resolved photon migration,” J. Biomed. Opt. 13, 041304 (2008).
- A. Custo, D. Boas, D. Tsuzuki, I. Dan, R. Mesquita, B. Fischl, W. Grimson, and W. Wells III, “Anatomical atlas-guided diffuse optical tomography of brain activation,” NeuroImage 49 (2010).
- D. Boas and A. Dale, “Simulation study of magnetic resonance imaging-guided cortically constrained diffuse optical tomography of human brain function,” Appl. Opt. 44, 1957–1968 (2005).
- W. C. Y. Lo, K. Redmond, J. Luu, P. Chow, J. Rose, and L. Lilge, “Hardware acceleration of a Monte Carlo simulation for photodynamic therapy treatment planning,” J. Biomed. Opt. 14, 014019 (2009).
- A. Johansson, J. Axelsson, S. Andersson-Engels, and J. Swartling, “Realtime light dosimetry software tools for interstitial photodynamic therapy of the human prostate,” Med. Phys. 34, 4309 (2007).
- S. Davidson, R. Weersink, M. Haider, M. Gertner, A. Bogaards, D. Giewercer, A. Scherz, M. Sherar, M. Elhilali, and J. Chinet al., “Treatment planning and dose analysis for interstitial photodynamic therapy of prostate cancer,” Phys. Med. Biol. 54, 2293–2313 (2009).
- E. Alerstam, T. Svensson, and S. Andersson-Engels, “Parallel computing with graphics processing units for high-speed Monte Carlo simulation of photon migration,” J. Biomed. Opt. 13, 060504 (2008).
- N. Ren, J. Liang, X. Qu, J. Li, B. Lu, and J. Tian, “GPU-based Monte Carlo simulation for light propagation in complex heterogeneous tissues,” Opt. Express 18, 6811–6823 (2010).
- Q. Fang and D. A. Boas, “Monte Carlo Simulation of Photon Migration in 3D Turbid Media Accelerated by Graphics Processing Units,” Opt. Express 17, 20178–20190 (2009).
- A. Badal and A. Badano, “Monte Carlo simulations in a graphics processing unit,” Med. Phys. 36, 4878–4880 (2009).
- W. C. Y. Lo, T. D. Han, J. Rose, and L. Lilge, “GPU-accelerated Monte Carlo simulation for photodynamic therapy treatment planning,” Proc. SPIE 7373, (2009).
- E. Alerstam, T. Svensson, and S. Andersson-Engels, “CUDAMCML - User manual and implementation notes,” http://www.atomic.physics.lu.se/biophotonics/.
- L. Wang, S. Jacques, and L. Zheng, “CONV - convolution for responses to a finite diameter photon beam incident on multi-layered tissues,” Comput. Meth. Prog. Bio. 54, 141–150 (1997).
- NVIDIA Corporation “CUDA Programming Guide 3.0,” (2010).
- NVIDIA Corporation, “NVIDIA’s Next Generation CUDA Compute Architecture: Fermi,” (2010).
- G. Marsaglia, “Random number generators,” J. Mod. Appl. Stat. Meth. 2, 2–13 (2003).
- H. Shen and G. Wang, “A tetrahedon-based inhomogenous Monte Carlo optical simulator,” Phys. Med. Biol. 55, 947–962 (2010).
- M. Matsumoto and T. Nishimura, “Mersnne Twister: a 623-dimensionally equidistributed uniform pseudorandom number generator,” ACM T. Model. Comput. S. 8, 3–30 (1998).
- M. Saito and M. Matsumoto, SIMD-oriented Fast Mersenne Twister: a 128-bit Pseudorandom Number Generator (Springer, 2008).
- I. Meglinsky and S. Matcher, “Modelling the sampling volume for skin blood oxygenation measurements,” Med. Biol. Eng. Comput. 39, 44–50 (2001).
- S. Flock, S. Jacques, B. Wilson, W. Star, and M. van Gemert, “Optical properties of Intralipid: a phantom medium for light propagation studies,” Laser. Surg. Med. 12, 510–510 (1992).
- M. Quinn, Parallel Computing: Theory and Practice (McGraw-Hill, 1994).
- N. Carbone, H. Di Rocco, D. I. Iriarte, and J. A. Pomarico, “Solution of the direct problem in turbid media with inclusions using monte carlo simulations implemented in graphics processing units: new criterion for processing transmittance data,” J. Biomed. Opt. 15, 035002 (2010).

## Cited By |
Alert me when this paper is cited |

OSA is able to provide readers links to articles that cite this paper by participating in CrossRef's Cited-By Linking service. CrossRef includes content from more than 3000 publishers and societies. In addition to listing OSA journal articles that cite this paper, citing articles from other participating publishers will also be listed.

« Previous Article | Next Article »

OSA is a member of CrossRef.