Abstract
Resilience is becoming an increasingly critical performance requirement for future large-scale computing systems. In data center and high-performance computing systems with many thousands of nodes, errors in main memory can be a significant source of failures. As a result, large-scale memory systems must employ advanced error detection and correction techniques to mitigate failures. Memory devices are primarily designed for density, optimizing memory capacity and throughput, rather than resilience. A strict focus on memory performance instead of resilience risks undermining the overall stability of next-generation computers. In this work, we leverage an optically connected memory system to optimize both memory performance and resilience. A multicast-capable optical interconnection network replaces the traditional electronic bus between a processor and its main memory, allowing for a novel error-correction technique based on dynamic bit-steering. As compared to an electronically connected approach, we demonstrate significantly higher memory bandwidths and reduced latencies, in addition to a 700 × improvement in resilience.
©2012 Optical Society of America
Full Article | PDF ArticleMore Like This
Daniel Brunina, Caroline P. Lai, Ajay S. Garg, and Keren Bergman
J. Opt. Commun. Netw. 3(8) A40-A48 (2011)
Georgios Zervas, Hui Yuan, Arsalan Saljoghei, Qianqiao Chen, and Vaibhawa Mishra
J. Opt. Commun. Netw. 10(2) A270-A285 (2018)
H. J. S. Dorren, S. Di Lucente, J. Luo, O. Raz, and N. Calabretta
J. Opt. Commun. Netw. 4(9) A82-A89 (2012)