Resilience is becoming an increasingly critical performance requirement for future large-scale computing systems. In data center and high-performance computing systems with many thousands of nodes, errors in main memory can be a significant source of failures. As a result, large-scale memory systems must employ advanced error detection and correction techniques to mitigate failures. Memory devices are primarily designed for density, optimizing memory capacity and throughput, rather than resilience. A strict focus on memory performance instead of resilience risks undermining the overall stability of next-generation computers. In this work, we leverage an optically connected memory system to optimize both memory performance and resilience. A multicast-capable optical interconnection network replaces the traditional electronic bus between a processor and its main memory, allowing for a novel error-correction technique based on dynamic bit-steering. As compared to an electronically connected approach, we demonstrate significantly higher memory bandwidths and reduced latencies, in addition to a 700 × improvement in resilience.
© 2012 OSA
Original Manuscript: June 1, 2012
Revised Manuscript: September 26, 2012
Manuscript Accepted: September 28, 2012
Published: October 30, 2012
Daniel Brunina, Caroline P. Lai, Dawei Liu, Ajay S. Garg, and Keren Bergman, "Resilient Optically Connected Memory Systems Using Dynamic Bit-Steering [Invited]," J. Opt. Commun. Netw. 4, B151-B160 (2012)