OSA's Digital Library

Optics Express

Optics Express

  • Editor: C. Martijn de Sterke
  • Vol. 17, Iss. 21 — Oct. 12, 2009
  • pp: 18543–18555
« Show journal navigation

Fast CGH computation using S-LUT on GPU

Yuechao Pan, Xuewu Xu, Sanjeev Solanki, Xinan Liang, Ridwan Bin Adrian Tanjung, Chiwei Tan, and Tow-Chong Chong  »View Author Affiliations


Optics Express, Vol. 17, Issue 21, pp. 18543-18555 (2009)
http://dx.doi.org/10.1364/OE.17.018543


View Full Text Article

Acrobat PDF (234 KB)





Browse Journals / Lookup Meetings

Browse by Journal and Year


   


Lookup Conference Papers

Close Browse Journals / Lookup Meetings

Article Tools

Share
Citations

Abstract

In computation of full-parallax computer-generated hologram (CGH), balance between speed and memory usage is always the core of algorithm development. To solve the speed problem of coherent ray trace (CRT) algorithm and memory problem of look-up table (LUT) algorithm without sacrificing reconstructed object quality, we develop a novel algorithm with split look-up tables (S-LUT) and implement it on graphics processing unit (GPU). Our results show that S-LUT on GPU has the fastest speed among all the algorithms investigated in this paper, while it still maintaining low memory usage. We also demonstrate high quality objects reconstructed from CGHs computed with S-LUT on GPU. The GPU implementation of our new algorithm may enable real-time and interactive holographic 3D display in the future.

© 2009 OSA

1. Introduction

Holographic 3D display uses diffraction effect from holograms to reconstruct 3D objects. It is a very promising true 3D display technology, which has capability to provide wide color gamut and full depth cues, without using any glasses [1

C. Slinger, C. Cameron, and M. Stanley, “Computer-generated holography as a generic display technology,” Computer 38(8), 46–53 (2005). [CrossRef]

]. However, generation of holograms has heavy computational load [2

C. Slinger, C. Cameron, S. Coomber, R. Miller, D. Payne, A. Smith, M. Smith, M. Stanley, and P. Watson, “Recent developments in computer-generated holography: toward a practical electroholography system for interactive 3D visualization,” Proc. SPIE 5290, 27–41 (2004). [CrossRef]

]. To solve this problem needs fast algorithm and powerful computer hardware.

There are different kinds of algorithms for computer generated hologram (CGH) [2

C. Slinger, C. Cameron, S. Coomber, R. Miller, D. Payne, A. Smith, M. Smith, M. Stanley, and P. Watson, “Recent developments in computer-generated holography: toward a practical electroholography system for interactive 3D visualization,” Proc. SPIE 5290, 27–41 (2004). [CrossRef]

,3

C. D. Cameron, D. A. Pain, M. Stanley, and C. W. Slinger, “Computational challengers of emerging novel true 3D holography displays,” Proc. SPIE 4109, 129–140 (2000). [CrossRef]

]. Fourier Ping-Pong algorithm was used to create Fourier holograms for holographic 3D display [4

J. W. Goodman, “Introduction to Fourier Optics 3rd Edition,” McGraw-Hill College, (Roberts & Co. Publishers, 2005).

,5

Y. Ichioka, M. Izumi, and Y. Suzuki, “Scanning halftone plotter and computer-generated continuous tone hologram,” Appl. Opt. 10(2), 403–411 (1971). [PubMed] [CrossRef] [PubMed]

]. In this algorithm, one hologram-size fast Fourier transform operation is required for each slice of object, thus it is slow for objects with many depth layers. Lucente [6

M. Lucente, “Diffraction-specific fringe computation for electro-holography,” Ph. D.Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology (1994).

] developed diffraction specific algorithm and significantly reduced CGH computational load. But it is more suitable for horizontal parallax only (HPO) holograms and display system with high horizontal resolution. Coherent ray trace (CRT) algorithm simulates light propagation from all sampling object points to all hologram pixels [7

N. Masuda, T. Ito, T. Tanaka, A. Shiraki, and T. Sugie, “Computer generated holography using a graphics processing unit,” Opt. Express 14(2), 603–608 (2006), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-14-2-603. [PubMed] [CrossRef] [PubMed]

]. It can achieve very high quality image reconstruction with full parallax, while keeping low memory consumption. However, because of direct many-to-many convolution, CRT algorithm is very time consuming. Look-up table (LUT) was used to reduce the computational load of CRT algorithm by pre-calculating possible results and storing them in a table for in-line look-up [8

M. Lucente, “Interactive computation of holograms using a look-up table,” J. Electron. Imaging 2(1), 28–34 (1993). [CrossRef]

]. As the amount of possible results is huge, the size of look-up table is big. Kim et al. [9

S. C. Kim and E. S. Kim, “Fast computation of hologram patterns of a 3D object using run-length encoding and novel look-up table methods,” Appl. Opt. 48(6), 1030–1041 (2009). [CrossRef]

] implemented a novel look-up table (N-LUT) method to reduce memory usage but it is still in the order of gigabyte (GB).

All CGH algorithms need to trade off among reconstruction quality, computational speed and memory usage. For the holographic 3D display system developed in Data Storage Institute (DSI) [10

X. W. Xu, S. Solanki, X. A. Liang, S. H. Xu, A. T. Ridwan, Y. C. Pan, F. Farbiz, B. X. Xu, and T. C. Chong, “Computer-generated holography for dynamic display of 3D objects with full parallax,” The International Journal of Virtual Reality 8(2), 33–38 (2009), http://www.ijvr.org/issues/issue2-2009/6.pdf.

], we need to compute full parallax holograms with fast speed on PC, even for big holograms containing complicated 3D object specifications. Thus, we have developed a new algorithm using split look-up tables (S-LUT), which can meet the requirements of high speed, low memory usage and full parallax reconstruction.

Using powerful computer hardware is another way to increase CGH computational speed. In the past, one used super computers / clusters for this task due to its heavy computational load [2

C. Slinger, C. Cameron, S. Coomber, R. Miller, D. Payne, A. Smith, M. Smith, M. Stanley, and P. Watson, “Recent developments in computer-generated holography: toward a practical electroholography system for interactive 3D visualization,” Proc. SPIE 5290, 27–41 (2004). [CrossRef]

,3

C. D. Cameron, D. A. Pain, M. Stanley, and C. W. Slinger, “Computational challengers of emerging novel true 3D holography displays,” Proc. SPIE 4109, 129–140 (2000). [CrossRef]

]. Special-purpose computational chip was also designed to increase the speed [11

T. Ito and T. Shimobaba, “One-unit system for electroholography by use of a special-purpose computational chip with a high-resolution liquid-crystal display toward a three-dimensional television,” Opt. Express 12(9), 1788–1793 (2004), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-12-9-1788. [PubMed] [CrossRef] [PubMed]

]. Recently there is a trend to use GPU for CGH computation [7

N. Masuda, T. Ito, T. Tanaka, A. Shiraki, and T. Sugie, “Computer generated holography using a graphics processing unit,” Opt. Express 14(2), 603–608 (2006), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-14-2-603. [PubMed] [CrossRef] [PubMed]

,12

C. Petz and M. Magnor, “Fast hologram synthesis for 3D geometry models using graphics hardware,” Proc. SPIE 5005, 266–275 (2003). [CrossRef]

,13

L. Ahrenberg, P. Benzie, M. Magnor, and J. Watson, “Computer generated holography using parallel commodity graphics hardware,” Opt. Express 14(17), 7636–7641 (2006), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-14-17-7636. [PubMed] [CrossRef] [PubMed]

]. In this paper, we implement CGH algorithms on GPU in order to further increase the computational speed. But instead of using GPU’s graphic pipeline / OpenGL [7

N. Masuda, T. Ito, T. Tanaka, A. Shiraki, and T. Sugie, “Computer generated holography using a graphics processing unit,” Opt. Express 14(2), 603–608 (2006), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-14-2-603. [PubMed] [CrossRef] [PubMed]

,12

C. Petz and M. Magnor, “Fast hologram synthesis for 3D geometry models using graphics hardware,” Proc. SPIE 5005, 266–275 (2003). [CrossRef]

,13

L. Ahrenberg, P. Benzie, M. Magnor, and J. Watson, “Computer generated holography using parallel commodity graphics hardware,” Opt. Express 14(17), 7636–7641 (2006), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-14-17-7636. [PubMed] [CrossRef] [PubMed]

], we use C + + -like parallel GPU application programming interface (API), Compute Unified Device Architecture (CUDA) [14

nVidia, “Compute Unified Device Architecture Programming Guide ver. 2.2”, (nVidia, 2009). http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.2.pdf

], as CUDA is designed especially for direct GPU computing without limitations by graphic pipeline / shader.

2. Conventional CRT and LUT algorithms

Conventional CRT algorithm directly simulates physical light propagation process from each object point in 3D space to each hologram pixel. Its mathematical formula can be written as in Eq. (1) [7

N. Masuda, T. Ito, T. Tanaka, A. Shiraki, and T. Sugie, “Computer generated holography using a graphics processing unit,” Opt. Express 14(2), 603–608 (2006), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-14-2-603. [PubMed] [CrossRef] [PubMed]

,13

L. Ahrenberg, P. Benzie, M. Magnor, and J. Watson, “Computer generated holography using parallel commodity graphics hardware,” Opt. Express 14(17), 7636–7641 (2006), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-14-17-7636. [PubMed] [CrossRef] [PubMed]

]:
I ( xh, yh)= j=0 N1 ajcos [ 2πλ ( xh xj)2+ ( yh yj)2+ zj 2] ,
(1)
where I ( xh, yh)is the intensity in the hologram plane at z=0, Nthe number of object points and λ the wavelength. ( xj, yj, zj) is the object point coordinates and aj the intensities.

Algorithm steps of CRT follow Eq. (1), and can be listed as:
  • For each hologram pixel I ( xh, yh) : ( xh=0,1,...,X1; yh=0,1,...,Y1)
  • I ( xh, yh)0 ;
  • For each object point aj ( xj, yj, zj) : ( j=0,1,...,N1)
  • I ( xh, yh)I ( xh, yh)+ ajcos [ 2πλ ( xh xj)2+ ( yh yj)2+ zj 2] ; ①
  • End;
  • End;
where X and Y are horizontal and vertical resolutions of spatial light modulator (SLM) used in display system, respectively.

From the algorithm described above, the most inner loop ① runs XYN times. The total computational complexity is in O ( XYN), where O ( ) is the big O notation. For each run of loop ①, operations include one cosine, one square root, five additions and five multiplications. These operations introduce heavy computational load, especially cosine and square root operations, and make the algorithm run very slow. On the other hand, CRT algorithm has some advantages such as no additional memory requirement other than input 3D object specification and output hologram, as well as high reconstruction quality.

In order to solve slow speed problem of CRT, LUT algorithm [8

M. Lucente, “Interactive computation of holograms using a look-up table,” J. Electron. Imaging 2(1), 28–34 (1993). [CrossRef]

,9

S. C. Kim and E. S. Kim, “Fast computation of hologram patterns of a 3D object using run-length encoding and novel look-up table methods,” Appl. Opt. 48(6), 1030–1041 (2009). [CrossRef]

] had been developed. In LUT, all possible results of the cosine function in Eq. (1) are off-line pre-computed and stored in a table; later whenever a result is needed for in-line computation, it will be directly read out from the table. Original LUT uses HPO, and reduces CGH computational load by one order [8

M. Lucente, “Interactive computation of holograms using a look-up table,” J. Electron. Imaging 2(1), 28–34 (1993). [CrossRef]

] while keeping affordable memory requirement. But using HPO is considered as downgrading of reconstruction quality, as HPO lacks vertical parallax. If using LUT for full parallax hologram computation, memory requirement will dramatically increase.

The mathematical detail of LUT can be described in Eq. (2):
I ( xh, yh)= j=0 N1 ajT ( Δx,Δy, zj) ,
(2)
where Δx= xh xj, Δy= yh yj and T ( Δx,Δy, zj)=cos ( 2πλ Δ x2+Δ y2+ zj 2).

Converted into algorithm, it becomes:

  • //Off-line pre-computation:
  • For each possible ( Δx,Δy, zj) : T ( Δx,Δy, zj)cos ( 2πλ Δ x2+Δ y2+ zj 2) ;
  • //In-line computation:
  • For each hologram pixel I ( xh, yh) : ( xh=0,1,...,X1; yh=0,1,...,Y1)
  • I ( xh, yh)0 ;
  • For each object point aj ( xj, yj, zj) : ( j=0,1,...,N1)
  • I ( xh, yh)I ( xh, yh)+ ajT ( Δx,Δy, zj) ; ②
  • End;
  • End;

Although the most inner loop ② of in-line computation part also runs XYN times. The total computational complexity is also in O ( XYN). Each time only one multiplication and one addition operations are performed. The rest of computational operations are done by the off-line pre-computation in LUT. Therefore, the computational speed of LUT algorithm is much faster than CRT. Considering the memory requirement of LUT, xh values are from 0 to d xhX with stepping d xh, and xj values are from 0 to d xoW with stepping d xo, where d xh is the pixel size of SLM, d xo the sampling interval of object space in horizontal direction and W the number of object space sampling points along horizontal direction. For typical holographic display system, d xod xh, and in our display system W and X are in the same order of magnitude. Thus, the minimum of Δx is d xoW and the maximum is d xhX, and the stepping of Δx is d xh. As a result, the table size in Δx dimension is ( d xoW+d xhX)/ d xh= d xoW/ d xh+X. Similarly, the table size in Δy dimension is d yoH/ d yh+Y, where d yo is the sampling interval of object space, H the number of object space sampling points, d yh the pixel size of SLM, all in vertical direction. In total, the table size is ( d xoW/ d xh+X) ( d yoH/ d yh+Y)D, where D is the number of object space sampling points along depth direction (depth layers). Normally d xh and d yh are in the order of micrometer while d xo and d yo are in the order of sub-millimeter, so the table size can easily boost and uses up all available memory space of a PC. This limits the application of LUT algorithm although it is faster than CRT. However the reconstruction quality of LUT is as good as that of CRT because their computational results are mathematically identical.

3. A novel S-LUT algorithm

In order to solve the speed problem of CRT and the memory problem of LUT without sacrificing reconstruction quality, we propose a new algorithm using look-up tables for horizontal and vertical directions, separately. We name it split look-up table (S-LUT) algorithm.

Recalled from Eq. (1), the resulted hologram can be described as:
I ( xh, yh)= j=0 N1 ajcos ( 2πλ Δ x2+Δ y2+ zj 2) ,
(3)
where Δx= xh xj and Δy= yh yj.

As in our display system, reconstruction size is small as compared to object-hologram distance, i.e. | Δx| zj and | Δy| zj, using Fresnel approximation, Eq. (3) can be approximated as Eq. (4):

I ( xh, yh)= j=0 N1 ajcos { 2πλ [ zj+ 1 2 zj ( Δ x2+Δ y2)]} .
(4)

Because wavelength is much smaller than the object-hologram distance, i.e. λ zj, small change of zj in the order of λ will not affect the reconstruction quality. Therefore zj can be adjusted to the nearest multiple of λ. Under this condition, 2π zj/λ is multiple of 2π, and one more 2π zj/λ can be added into Eq. (4) without changing the value of I ( xh, yh). As a result, Eq. (4) becomes Eq. (5):

I ( xh, yh)= j=0 N1 ajcos [ 2πλ ( zj+ 1 2 zjΔ x2+ zj+ 1 2 zjΔ y2)] .
(5)

Using inverse Fresnel approximation on Eq. (5) results in Eq. (6):

I ( xh, yh)= j=0 N1 ajcos ( 2πλ Δ x2+ zj 2+ 2πλ Δ y2+ zj 2) .
(6)

Rewrite Eq. (6) in complex form as Eq. (7):

Ic ( xh, yh)= j=0 N1 aj e i ( 2πλ Δ x2+ zj 2+ 2πλ Δ y2+ zj 2)= j=0 N1 aj e i ( 2πλ Δ x2+ zj 2)× e i ( 2πλ Δ y2+ zj 2) .
(7)

Let us define horizontal light modulation factor as H(Δx, zj)= e i ( 2πλ Δ x2+ zj 2) and vertical light modulation factor as V(Δy, zj)= e i ( 2πλ Δ y2+ zj 2), then Eq. (7) can be rewritten as Eq. (8):

Ic ( xh, yh)= j=0 N1 ajH ( Δx, zj)×V ( Δy, zj) .
(8)

For n object points falling on the same vertical line ( xj, zj), H ( Δx, zj) is all the same, thus the hologram resulted from these n object points can be calculated by Eq. (9):

Ic ( xh, yh) | ( xj, zj)= j=0 n1 ajH ( Δx, zj)×V ( Δy, zj)=H ( Δx, zj)× j=0 n1 ajV ( Δy, zj) .
(9)

As j=0 n1 ajV ( Δy, zj) is independent of Δx, Eq. (9) can be broken down into two steps as Eq. (10) and Eq. (11):
Step 1:  S ( yh)= j=0 n1 ajV ( Δy, zj) ;
(10)
Step 2:   Ic ( xh, yh) | ( xj, zj)=H ( Δx, zj)×S ( yh) ,
(11)
where S ( yh) is the sum of contribution in vertical direction from those n object points to the hologram Ic ( xh, yh) | ( xj, zj).

The final hologram Ic ( xh, yh) is the sum of all contributions from object points on different vertical lines and can be computed by Eq. (12):

Ic ( xh, yh)= Ic ( xh, yh) | ( xj, zj) .
(12)

Therefore our new algorithm S-LUT can be listed as:

  • //Off-line pre-computation:
  • For each possible ( Δx, zj) : H ( Δx, zj) e i ( 2πλ Δ x2+ zj 2) ;
  • For each possible ( Δy, zj) : V ( Δy, zj) e i ( 2πλ Δ y2+ zj 2) ;
  • //In-line computation:
  • For each hologram pixel I ( xh, yh) : I ( xh, yh)0;
  • For each ( xj, zj) that exists aj ( xj, yj, zj)0 :
  • For each yh ( yh=0,1,...,Y1) : S ( yh)0 ;
  • For each yj with aj ( xj, yj, zj)0 and each yh ( yh=0,1,...,Y1) :
  • S ( yh)S ( yh)+ aj ( xj, yj, zj)×V ( Δy, zj) ; ③
  • End;
  • For each ( xh, yh) : ( xh=0,1,...,X1; yh=0,1,...,Y1)
  • Ic ( xh, yh) Ic ( xh, yh)+S ( yh)×H ( Δx, zj) ; ④
  • End;
  • End;

The list of object points falling on the same vertical line is generated by the program in input processing during in-line computation. The n object points with non-zero intensity are marked down and stored in an array for each vertical line together with respective y coordinates.

In in-line computation of S-LUT algorithm, for a vertical line containing n object points with non-zero intensity, loop ③ runs nY times and loop ④ runs XY times. Therefore, total in-line computational complexity is in O ( nY+XY), and during each step of both loop ③ and loop ④, only one multiplication and one addition operations need to be performed. Similar to memory analysis of LUT, the table size of H ( Δx, zj) in Δx dimension is d xoW/ d xh+X, and that of V ( Δy, zj) in Δy dimension is d yoH/ d yh+Y, while the size in zj dimension is D. As a result, the total size of two tables is ( d xoW/ d xh+X+ d yoH/ d yh+Y)D. Reconstruction quality of S-LUT is as good as CRT, if all conditions, | Δx| zj, | Δy| zj, and zj being multiple of λ, are met.

The comparison in computational complexity, operation and memory usage for CRT, LUT and S-LUT algorithms is listed in Table 1 , for n object points falling on the same vertical line.

Table 1  Complexity, operation and memory usage comparison among algorithms
Algorithm Complexity Operation Memory Usage
CRT O ( nXY) 1 cos, 1 , 5 ×, 5 +0
LUT O ( nXY) 1 ×, 1 + ( d xoW/ d xh+X) ( d yoH/ d yh+Y)D
S-LUT O ( nY+XY) 1 ×, 1 + ( d xoW/ d xh+X+ d yoH/ d yh+Y)D

From Table 1, it can be seen that both computational complexity and memory usage of S-LUT are one order less than that of LUT, i.e. S-LUT runs much faster than LUT, while using much less memory.

4. Implementation of algorithms on GPU

By using CUDA, mathematical operations can be performed on GPU cores in parallel, without mapping them to graphic pipeline. As generally GPU has more cores than CPU (30 computational cores for nVidia GTX285 GPU [15

nVidia, “Specification of GeForce GTX 285”, (nVidia, 2008). http://www.nvidia.com/object/product_geforce_gtx_285_us.html [PubMed]

] and 4 cores for Intel core i7 965 CPU [16

Intel, “Specification of Intel Core i7processor Extreme Edition”, (Intel, 2009). http://www.intel.com/products/processor/corei7ee/specifications.htm [PubMed]

]), parallel performance of GPU is better than that of CPU, although individual core speed of GPU is slower than that of CPU (680 MHz for MSI GTX 285 OC [17], 3.2 GHz for Intel core i7 965 [16

Intel, “Specification of Intel Core i7processor Extreme Edition”, (Intel, 2009). http://www.intel.com/products/processor/corei7ee/specifications.htm [PubMed]

]). Multiple GPUs can be used in a PC, which provides more GPU cores to speed up CGH computation. Another advantage of GPU over CPU is that each GPU has its own memory interface to its dedicated memory chips. As a result, memory for different GPUs can read/write simultaneously. Currently only global memory on GPU is used to store the pre-computed data of H ( Δx, zj) and V ( Δy, zj). In the future, shared memory on GPU may be used for further computational speed optimization. However CPU only has one interface to system main memory and data for CPU cores need to be queued up, which may create computational bottleneck.

GPU also has some disadvantages. High speed computation on GPU is mainly for single precision calculation. Error rate of GPU and its memory is higher than that of CPU and main memory. Therefore, direct computation on GPU without error correction may not be suitable for high precision scientific computation. As hologram has high error tolerance, GPU still can meet the requirement of CGH computation, especially when resulted holograms need to be binarised before they are lunched to SLM for reconstruction.

Using a multi-GPU PC for CGH computation needs to consider how to distribute and balance workload among GPUs. In our implementation, the program will distribute loads to each CUDA-enabled GPU in unit of vertical line in object space, in such a way that computation time on each GPU is as close to each other as possible.

Detailed implementation steps of S-LUT algorithm on GPU are listed below, where m is the number of CUDA-enabled GPUs in the PC:

  • //Off-line pre-computation:
  • For each possible ( Δx, zj) : H ( Δx, zj) e i ( 2πλ Δ x2+ zj 2) ;
  • For each possible ( Δy, zj) : V ( Δy, zj) e i ( 2πλ Δ y2+ zj 2) ;
  • //In-line computation:
  • Distribute load into sets Lk ( k=0,1,...,m1) , elements in Lk are vertical lines ( xj, zj) in object space;
  • For each Lk ( k=0,1,...,m1) , create a thread for GPU k and run in parallel:
  • Allocate GPU memory on GPU k;
  • Transfer tables H ( Δx, zj) , V ( Δy, zj) and Lk to GPU global memory;
  • For each hologram pixel Ic ( xh, yh) | GPU k : Ic ( xh, yh) | GPU k0;
  • For each ( xj, zj) Lk , call CUDA kernel to do the following three loops:
  • For each yh ( yh=0,1,...,Y1) : S ( yh)0 ;
  • For each yj with aj ( xj, yj, zj)0 and each yh ( yh=0,1,...,Y1) :
  • S ( yh)S ( yh)+ aj ( xj, yj, zj)×V ( Δy, zj) ;
  • End;
  • For each ( xh, yh) : ( xh=0,1,...,X1; yh=0,1,...,Y1)
  • Ic ( xh, yh) | GPU k Ic ( xh, yh) | GPU k+S ( yh)×H ( Δx, zj) ;
  • End;
  • End;
  • Transfer Ic ( xh, yh) | GPU k to main memory;
  • Deallocate GPU memory on GPU k;
  • End;
  • Combine all Ic ( xh, yh) | GPU k ( k=0,1,...,m1) to get Ic ( xh, yh) ;

From steps listed above, it is noticed that additional memory needs to be allocated for workload Lk and holograms Ic ( xh, yh) | GPU k of individual GPUs in main memory. This increases memory requirement by WHDm+XYm as compared to that running on CPU.

5. Experimental results

Our computational experiments are done on a PC with specifications listed in Table 2 . Our program uses only single CPU core and total 3 × 30 GPU cores. Some parameters used in CGH computation are listed in Table 3 .

Table 2  Specifications of PC
Item Model Details
CPUIntel Core i7 9651 CPU, 4 cores, 3.2GHz, HT enabled
GPUnVidia GTX 2853 GPUs, 30 cores at 680MHz and 1GB memory each
MemoryOCZ DDR3Tri-channel, 1600MHz, total 12GB
OSWindows Vista64 bit with SP1
IDEVisual Studio 200564 bit
GPU APInVidia CUDAVer. 2.2
Table 3  CGH computation parameters
Parameter Symbol Value
Horizontal resolution of SLM X 1024
Vertical resolution of SLM Y 768
Horizontal pixel size of SLM d xh 13.8 µm
Vertical pixel size of SLM d yh 13.8 µm
Number of horizontal sampling points of object space W 301
Number of vertical sampling points of object space H 301
Horizontal sampling interval of object space d xo 150 µm
Vertical sampling interval of object space d yo 150 µm
Wave length of laser used in reconstruction system λ 655 nm

The time for calculation of H ( Δx, zj) and V ( Δy, zj) is shown in Fig. 1 . It can be seen that the time needed for this off-line calculation is proportional to the number of depth layers.

Fig. 1 Time for calculation of H ( Δx, zj) and V ( Δy, zj)for S-LUT.

The comparison of computational speed among different algorithms implemented on both CPU and GPU is shown in Fig. 2 . The detail of computation time of S-LUT on GPU is shown in Fig. 3 . The object points are randomly distributed in the object space (Table 3) with 21 depth layers.

Fig. 2 Comparison of computational speed among algorithms on CPU and GPU.
Fig. 3 Computation time of S-LUT on GPU.

From CPU computational results shown in Fig. 2, it is clear that S-LUT is much faster than CRT and LUT, especially when the number of object points increases. LUT also shows faster computation as compared with CRT. Running time of S-LUT depends mainly on the number of vertical lines rather than the number of object points in object space as described in Table 1. As a result, running time for S-LUT keeps almost constant when the number of object points goes beyond a certain value, which in our test cases is 10k. Beyond this value, almost all vertical lines in object space have at least one object point and need to be computed for different number of object points above 10k.

From GPU computational results shown in Fig. 2, the speed of CRT on GPU increases drastically as compared with that on CPU. CUDA automatically optimizes the execution of kernel code running on GPUs to cover read / write time of GPU memory. This optimization effect is more obvious for larger number of object points for the cases of CRT and LUT on GPU. Therefore, their computation time does not have linear relationship with the number of object points. From Fig. 2 and Fig. 3, the speed of S-LUT on GPU is about 100 times faster than S-LUT on CPU and CRT on GPU. It is shown that S-LUT is the fastest among all algorithms. However, LUT is even slower than CRT on GPU. This is because LUT reads huge amount of data from large area in memory on GPU, and different GPU cores read data simultaneously, which causes a bottleneck in memory bandwidth. As for S-LUT, there is no such a bottleneck because it only reads small amount of data from small area in memory, as shown in Table 1.

As shown in Fig. 3, S-LUT on GPU achieves 0.3 second running time for less than 1k object points and 0.5 second for more than 10k object points. For the number of object points larger than 40k, S-LUT on GPU is faster than LUT on CPU by at least 700 times.

The comparison of memory usage is shown in Fig. 4 , in which base memory refers to the memory used to store input 3D object specification and final output hologram.

Fig. 4 Comparison of memory usage among algorithms on CPU and GPU.

LUT on both CPU and GPU uses much more memory space than CRT and S-LUT. It reaches 12GB physical memory limitation of our PC at 90 depth layers. With virtual memory support, LUT can handle up to 200 layers by using 27GB memory, but the computation as well as the whole PC slows down, and further increase in the number of depth layers stops the OS from responding. S-LUT on CPU uses very small amount of memory and only needs 1.2GB for 20k depth layers. S-LUT on GPU needs roughly twice of base memory because tables of object points are created to optimize overall performance of algorithm. However it still keeps memory usage at 7.5GB for 10k depth layers, which are far more than enough for normal CGH computation.

In order to test memory limitations, we purposely increase resolutions of SLM and object space to four times in both horizontal and vertical directions, i.e. 4096 × 3072 for hologram and 1200 × 1200 for object space, and the comparison is shown in Fig. 5 . Under this condition, LUT on both CPU and GPU stops at 8 depth layers with 17.5GB memory usage; S-LUT on CPU stops at 1490 depth layers with 474MB memory for tables H ( Δx, zj) and V ( Δy, zj) because at 1490 depth layers base memory occupies available memory space of 8GB. S-LUT on GPU uses 12GB at 1000 depth layers, and stops at 1300 depth layers with 15GB memory usages. This shows that S-LUT on GPU can meet most of application requirements within main memory limitation.

Fig. 5 Comparison of memory usage for high resolution SLM and object space.

Based on the results presented above, S-LUT on GPU has shown the fastest speed while maintaining low memory usage. It is most efficient in addressing the balance between computational speed and memory usage among all the algorithms investigated in this paper.

Figure 6 shows objects reconstructed from CGHs computed by CRT on CPU and S-LUT on GPU with our holographic 3D display system, in which the SLM device has 1920 × 1080 pixels, and its pixel size is 10.8 µm in both horizontal and vertical directions. The pictures were captured by a 2D SLR camera. Random-phase noise reduction technique [18

L. Golan and S. Shoham, “Speckle elimination using shift-averaging in high-rate holographic projection,” Opt. Express 17(3), 1330–1339 (2009), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-17-3-1330. [PubMed] [CrossRef] [PubMed]

] is adopted to improve reconstruction quality for all the objects. By naked eyes or even camera, no difference in reconstruction is observed between above two algorithms for both 2D (resolution chart, 372k object points) and 3D (teapot, 100 depth layers, 37k object points) objects. It is shown that S-LUT algorithm implemented on GPU can preserve full parallax and high reconstruction quality of original CRT algorithm.

Fig. 6 Reconstructed objects: (a) 2D resolution chart by CRT on CPU, (b) 2D resolution chart by S-LUT on GPU, (c) 3D teapot by CRT on CPU, and (d) 3D teapot by S-LUT on GPU.

Figure 7 shows a snapshot from a short video clip of a rotating 3D globe (150 depth layers, 62k object points). It is displayed at video rate by our holographic 3D display system [10

X. W. Xu, S. Solanki, X. A. Liang, S. H. Xu, A. T. Ridwan, Y. C. Pan, F. Farbiz, B. X. Xu, and T. C. Chong, “Computer-generated holography for dynamic display of 3D objects with full parallax,” The International Journal of Virtual Reality 8(2), 33–38 (2009), http://www.ijvr.org/issues/issue2-2009/6.pdf.

]. The holograms used to reconstruct this dynamic 3D globe are computed by S-LUT on GPU.

Fig. 7 (Media 1) Video clip of a rotating 3D globe reconstructed from holograms computed with S-LUT on GPU.

6. Conclusion

We have developed a new algorithm by using split look-up tables (S-LUT) and successfully implemented it on GPU using CUDA. Comparisons among different algorithms implemented on both CPU and GPU have also been done in terms of computational speed, memory usage as well as reconstruction quality. S-LUT on GPU can achieve a speed faster than LUT on CPU by at least 700 times for the number of object points larger than 40k under the specified system specifications. It is also about 100 times faster than CRT on GPU. The memory usage of S-LUT on GPU is much less than that of LUT, which enables fast CGH computation for complicated 3D objects. The reconstruction of objects from holograms computed with S-LUT on GPU has shown full parallax and the same image quality as that with CRT on CPU. S-LUT algorithm implemented on GPU is very promising to meet the requirements of high speed, low memory usage and high quality full-parallax reconstruction for real-time and interactive holographic 3D display in the future.

Acknowledgement

This project is a joint project between DSI and I2R (Institute for Infocomm Research) and is funded by HOME2015 Programme of A*STAR, Singapore. We would like to thank Dr. Xu Baoxi from DSI, Dr. Xu Shuhong and Dr. Farzam Farbiz from I2R for helpful discussions.

References and links

1.

C. Slinger, C. Cameron, and M. Stanley, “Computer-generated holography as a generic display technology,” Computer 38(8), 46–53 (2005). [CrossRef]

2.

C. Slinger, C. Cameron, S. Coomber, R. Miller, D. Payne, A. Smith, M. Smith, M. Stanley, and P. Watson, “Recent developments in computer-generated holography: toward a practical electroholography system for interactive 3D visualization,” Proc. SPIE 5290, 27–41 (2004). [CrossRef]

3.

C. D. Cameron, D. A. Pain, M. Stanley, and C. W. Slinger, “Computational challengers of emerging novel true 3D holography displays,” Proc. SPIE 4109, 129–140 (2000). [CrossRef]

4.

J. W. Goodman, “Introduction to Fourier Optics 3rd Edition,” McGraw-Hill College, (Roberts & Co. Publishers, 2005).

5.

Y. Ichioka, M. Izumi, and Y. Suzuki, “Scanning halftone plotter and computer-generated continuous tone hologram,” Appl. Opt. 10(2), 403–411 (1971). [PubMed] [CrossRef] [PubMed]

6.

M. Lucente, “Diffraction-specific fringe computation for electro-holography,” Ph. D.Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology (1994).

7.

N. Masuda, T. Ito, T. Tanaka, A. Shiraki, and T. Sugie, “Computer generated holography using a graphics processing unit,” Opt. Express 14(2), 603–608 (2006), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-14-2-603. [PubMed] [CrossRef] [PubMed]

8.

M. Lucente, “Interactive computation of holograms using a look-up table,” J. Electron. Imaging 2(1), 28–34 (1993). [CrossRef]

9.

S. C. Kim and E. S. Kim, “Fast computation of hologram patterns of a 3D object using run-length encoding and novel look-up table methods,” Appl. Opt. 48(6), 1030–1041 (2009). [CrossRef]

10.

X. W. Xu, S. Solanki, X. A. Liang, S. H. Xu, A. T. Ridwan, Y. C. Pan, F. Farbiz, B. X. Xu, and T. C. Chong, “Computer-generated holography for dynamic display of 3D objects with full parallax,” The International Journal of Virtual Reality 8(2), 33–38 (2009), http://www.ijvr.org/issues/issue2-2009/6.pdf.

11.

T. Ito and T. Shimobaba, “One-unit system for electroholography by use of a special-purpose computational chip with a high-resolution liquid-crystal display toward a three-dimensional television,” Opt. Express 12(9), 1788–1793 (2004), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-12-9-1788. [PubMed] [CrossRef] [PubMed]

12.

C. Petz and M. Magnor, “Fast hologram synthesis for 3D geometry models using graphics hardware,” Proc. SPIE 5005, 266–275 (2003). [CrossRef]

13.

L. Ahrenberg, P. Benzie, M. Magnor, and J. Watson, “Computer generated holography using parallel commodity graphics hardware,” Opt. Express 14(17), 7636–7641 (2006), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-14-17-7636. [PubMed] [CrossRef] [PubMed]

14.

nVidia, “Compute Unified Device Architecture Programming Guide ver. 2.2”, (nVidia, 2009). http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.2.pdf

15.

nVidia, “Specification of GeForce GTX 285”, (nVidia, 2008). http://www.nvidia.com/object/product_geforce_gtx_285_us.html [PubMed]

16.

Intel, “Specification of Intel Core i7processor Extreme Edition”, (Intel, 2009). http://www.intel.com/products/processor/corei7ee/specifications.htm [PubMed]

17.

MSI, “Specification of MSI N285GTX-T2D1G-OC”, (MSI, 2008). http://www.msi.com/index.php?func=prodvgaspec&maincat_no=130&cat2_no=136&cat3_no=&prod_no=1726#menu [PubMed]

18.

L. Golan and S. Shoham, “Speckle elimination using shift-averaging in high-rate holographic projection,” Opt. Express 17(3), 1330–1339 (2009), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-17-3-1330. [PubMed] [CrossRef] [PubMed]

OCIS Codes
(090.0090) Holography : Holography
(090.1760) Holography : Computer holography
(090.2870) Holography : Holographic display

ToC Category:
Holography

History
Original Manuscript: July 6, 2009
Revised Manuscript: September 1, 2009
Manuscript Accepted: September 23, 2009
Published: September 30, 2009

Citation
Yuechao Pan, Xuewu Xu, Sanjeev Solanki, Xinan Liang, Ridwan Bin Adrian Tanjung, Chiwei Tan, and Tow-Chong Chong, "Fast CGH computation using S-LUT on GPU," Opt. Express 17, 18543-18555 (2009)
http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-17-21-18543


Sort:  Author  |  Year  |  Journal  |  Reset  

References

  1. C. Slinger, C. Cameron, and M. Stanley, “Computer-generated holography as a generic display technology,” Computer 38(8), 46–53 (2005). [CrossRef]
  2. C. Slinger, C. Cameron, S. Coomber, R. Miller, D. Payne, A. Smith, M. Smith, M. Stanley, and P. Watson, “Recent developments in computer-generated holography: toward a practical electroholography system for interactive 3D visualization,” Proc. SPIE 5290, 27–41 (2004). [CrossRef]
  3. C. D. Cameron, D. A. Pain, M. Stanley, and C. W. Slinger, “Computational challengers of emerging novel true 3D holography displays,” Proc. SPIE 4109, 129–140 (2000). [CrossRef]
  4. J. W. Goodman, “Introduction to Fourier Optics 3rd Edition,” McGraw-Hill College, (Roberts & Co. Publishers, 2005).
  5. Y. Ichioka, M. Izumi, and Y. Suzuki, “Scanning halftone plotter and computer-generated continuous tone hologram,” Appl. Opt. 10(2), 403–411 (1971). [CrossRef] [PubMed]
  6. M. Lucente, “Diffraction-specific fringe computation for electro-holography,” Ph. D. Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology (1994).
  7. N. Masuda, T. Ito, T. Tanaka, A. Shiraki, and T. Sugie, “Computer generated holography using a graphics processing unit,” Opt. Express 14(2), 603–608 (2006), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-14-2-603 . [CrossRef] [PubMed]
  8. M. Lucente, “Interactive computation of holograms using a look-up table,” J. Electron. Imaging 2(1), 28–34 (1993). [CrossRef]
  9. S. C. Kim and E. S. Kim, “Fast computation of hologram patterns of a 3D object using run-length encoding and novel look-up table methods,” Appl. Opt. 48(6), 1030–1041 (2009). [CrossRef]
  10. X. W. Xu, S. Solanki, X. A. Liang, S. H. Xu, A. T. Ridwan, Y. C. Pan, F. Farbiz, B. X. Xu, and T. C. Chong, “Computer-generated holography for dynamic display of 3D objects with full parallax,” The International Journal of Virtual Reality 8(2), 33–38 (2009), http://www.ijvr.org/issues/issue2-2009/6.pdf .
  11. T. Ito and T. Shimobaba, “One-unit system for electroholography by use of a special-purpose computational chip with a high-resolution liquid-crystal display toward a three-dimensional television,” Opt. Express 12(9), 1788–1793 (2004), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-12-9-1788 . [CrossRef] [PubMed]
  12. C. Petz and M. Magnor, “Fast hologram synthesis for 3D geometry models using graphics hardware,” Proc. SPIE 5005, 266–275 (2003). [CrossRef]
  13. L. Ahrenberg, P. Benzie, M. Magnor, and J. Watson, “Computer generated holography using parallel commodity graphics hardware,” Opt. Express 14(17), 7636–7641 (2006), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-14-17-7636 . [CrossRef] [PubMed]
  14. nVidia, “Compute Unified Device Architecture Programming Guide ver. 2.2”, (nVidia, 2009). http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.2.pdf
  15. nVidia, “Specification of GeForce GTX 285”, (nVidia, 2008). http://www.nvidia.com/object/product_geforce_gtx_285_us.html [PubMed]
  16. Intel, “Specification of Intel Core i7 processor Extreme Edition”, (Intel, 2009). http://www.intel.com/products/processor/corei7ee/specifications.htm [PubMed]
  17. MSI, “Specification of MSI N285GTX-T2D1G-OC”, (MSI, 2008). http://www.msi.com/index.php?func=prodvgaspec&maincat_no=130&cat2_no=136&cat3_no=&prod_no=1726#menu [PubMed]
  18. L. Golan and S. Shoham, “Speckle elimination using shift-averaging in high-rate holographic projection,” Opt. Express 17(3), 1330–1339 (2009), http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-17-3-1330 . [CrossRef] [PubMed]

Cited By

Alert me when this paper is cited

OSA is able to provide readers links to articles that cite this paper by participating in CrossRef's Cited-By Linking service. CrossRef includes content from more than 3000 publishers and societies. In addition to listing OSA journal articles that cite this paper, citing articles from other participating publishers will also be listed.

Multimedia

Multimedia FilesRecommended Software
» Media 1: AVI (4393 KB)      QuickTime

« Previous Article  |  Next Article »

OSA is a member of CrossRef.

CrossCheck Deposited