OSA's Digital Library

Journal of Optical Communications and Networking

Journal of Optical Communications and Networking

  • Editors: K. Bergman and O. Gerstel
  • Vol. 4, Iss. 11 — Nov. 1, 2012
  • pp: B151–B160

Resilient Optically Connected Memory Systems Using Dynamic Bit-Steering [Invited]

Daniel Brunina, Caroline P. Lai, Dawei Liu, Ajay S. Garg, and Keren Bergman  »View Author Affiliations


Journal of Optical Communications and Networking, Vol. 4, Issue 11, pp. B151-B160 (2012)
http://dx.doi.org/10.1364/JOCN.4.00B151


View Full Text Article

Enhanced HTML    Acrobat PDF (1458 KB)





Browse Journals / Lookup Meetings

Browse by Journal and Year


   


Lookup Conference Papers

Close Browse Journals / Lookup Meetings

Article Tools

Share
Citations

Abstract

Resilience is becoming an increasingly critical performance requirement for future large-scale computing systems. In data center and high-performance computing systems with many thousands of nodes, errors in main memory can be a significant source of failures. As a result, large-scale memory systems must employ advanced error detection and correction techniques to mitigate failures. Memory devices are primarily designed for density, optimizing memory capacity and throughput, rather than resilience. A strict focus on memory performance instead of resilience risks undermining the overall stability of next-generation computers. In this work, we leverage an optically connected memory system to optimize both memory performance and resilience. A multicast-capable optical interconnection network replaces the traditional electronic bus between a processor and its main memory, allowing for a novel error-correction technique based on dynamic bit-steering. As compared to an electronically connected approach, we demonstrate significantly higher memory bandwidths and reduced latencies, in addition to a 700 × improvement in resilience.

© 2012 OSA

OCIS Codes
(200.0200) Optics in computing : Optics in computing
(200.4650) Optics in computing : Optical interconnects

ToC Category:
OFC/NFOEC 2012

History
Original Manuscript: June 1, 2012
Revised Manuscript: September 26, 2012
Manuscript Accepted: September 28, 2012
Published: October 30, 2012

Citation
Daniel Brunina, Caroline P. Lai, Dawei Liu, Ajay S. Garg, and Keren Bergman, "Resilient Optically Connected Memory Systems Using Dynamic Bit-Steering [Invited]," J. Opt. Commun. Netw. 4, B151-B160 (2012)
http://www.opticsinfobase.org/jocn/abstract.cfm?URI=jocn-4-11-B151


Sort:  Author  |  Year  |  Journal  |  Reset  

References

  1. P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.
  2. R. Baumann, “Soft errors in advanced computer systems,” IEEE Des. Test Comput., vol. 22, no. 3, pp. 258–266, 2005. [CrossRef]
  3. L. A. Barroso and U. Hölzle, “The datacenter as a computer: An introduction to the design of warehouse-scale machines,” in Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2009.
  4. D. Oppenheimer, A. Ganapathi, and D. Patterson, “Why do Internet services fail and what can be done about it?” in 4th USENIX Symp. on Internet Technologies and Systems, 2003.
  5. A. Modine, “Web startups crumble under Amazon S3 outage” [Online]. Available: http://www.theregister.co.uk/2008/02/15/amazon_s3_outage_feb_2008/.
  6. B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM errors in the wild: A large-scale field study,” in ACM SIGMETRICS, 2009.
  7. K. V. Vishwanath and N. Nagappan, “Characterizing cloud computing hardware reliability,” in Proc. of the 1st ACM Symp. on Cloud Computing (SoCC ’10), New York, 2010, pp. 193–204.
  8. C. Chen and M. Hsiao, “Error-correcting codes for semiconductor memory applications: A state-of-the-art review,” IBM J. Res. Dev., vol. 28, no. 2, pp. 124–134, 1984. [CrossRef]
  9. The ITRS Technology Working Groups, International Technology Roadmap for Semiconductors (ITRS) 2011 Edition [Online]. Available: http://www.itrs.net.
  10. D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Building data centers with optically connected memory,” J. Opt. Commun. Netw., vol. 3, no. 8, pp. A40–A48, 2011. [CrossRef]
  11. L. Chen, K. Preston, S. Manipatruni, and M. Lipson, “Integrated GHz silicon photonic interconnect with micrometer-scale modulators and detectors,” Opt. Express, vol. 17, no. 17, pp. 15248–15256, Aug.2009. [CrossRef] [PubMed]
  12. D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Wavelength-striped multicasting of optically-connected memory for large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2011, OWH4.
  13. D. Brunina, C. P. Lai, D. Liu, A. S. Garg, and K. Bergman, “Optically-connected memory with error correction for increased reliability in large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2012.
  14. T. C. May and M. H. Woods, “Alpha-particle-induced soft errors in dynamic memories,” IEEE Trans. Electron Devices, vol. 26, no. 1, pp. 2–9, 1979. [CrossRef]
  15. S. S. Mukherjee, J. Emer, and S. K. Reinhardt, “The soft error problem: An architectural perspective,” in HPCA ’05: Proc. of the 11th Int. Symp. on High-Performance Computer Architecture, 2005.
  16. E. Normand, “Single event upset at ground level,” IEEE Trans. Nucl. Sci., vol. 6, no. 43, pp. 2742–2750, 1996. [CrossRef]
  17. T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996. [CrossRef]
  18. J. F. Ziegler and W. A. Lanford, “Effect of cosmic rays on computer memories,” Science, vol. 206, pp. 776–788, 1979. [CrossRef] [PubMed]
  19. H. Mine and K. Hatayama, “Reliability analysis and optimal redundancy for majority-voted logic circuits,” IEEE Trans. Reliab., vol. 30, no. 2, pp. 189–191, 1981. [CrossRef]
  20. T. J. Dell, A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory, IBM Microelectronics Division, 1997.
  21. “Intel E7500 chipset MCH Intel×4 single device data correction (×4 SDDC) implementation and validation,” Intel Application Note AP-726, Aug.2002.
  22. “Servers and storage technology for the adaptive infrastructure,” HP Technology Advisor, 2006 [Online]. Available: http://h40089.www4.hp.com/integrity/pdf/4AA0-7545EEE.pdf.
  23. P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012. [CrossRef]
  24. R. Ho, W. Mai, and M. A. Horowitz, “The future of wires,” Proc. IEEE, vol. 89, no. 4, pp. 490–504, Apr.2001. [CrossRef]
  25. JEDEC Solid State Technology Association, DDR3 SDRAM Standard [Online]. Available: http://www.jedec.org/standards-documents/docs/jesd-79-3d.
  26. O. Liboiron-Ladouceur, B. A. Small, and K. Bergman, “Physical layer scalability of WDM optical packet interconnection networks,” J. Lightwave Technol., vol. 24, no. 1, pp. 262–270, Jan.2006. [CrossRef]
  27. A. D. Kshemkalyani and M. Singhal, Distributed Computing: Principles, Algorithms, and Systems. Cambridge University Press, New York, 2008, pp. 6–13.
  28. D. Brunina, C. P. Lai, and K. Bergman, “A data rate- and modulation format-independent packet-switched optical network test-bed,” IEEE Photon. Technol. Lett., vol. 24, no. 5, pp. 377–379, Mar.2012. [CrossRef]
  29. C. P. Lai and K. Bergman, “Broadband multicasting for wavelength-striped optical packets,” J. Lightwave Technol., vol. 30, no. 11, pp. 1706–1718, June2012. [CrossRef]
  30. J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, “Future scaling of processor-memory interfaces,” in Supercomputing (SC), Nov. 2010.

Cited By

Alert me when this paper is cited

OSA is able to provide readers links to articles that cite this paper by participating in CrossRef's Cited-By Linking service. CrossRef includes content from more than 3000 publishers and societies. In addition to listing OSA journal articles that cite this paper, citing articles from other participating publishers will also be listed.

« Previous Article  |  Next Article »

OSA is a member of CrossRef.

CrossCheck Deposited