Approaching System Reliability in the AI Era
Virtual: https://events.vtools.ieee.org/m/485845[]Ensuring hardware system reliability is increasingly critical in the evolving AI landscape, particularly within data centers. Drawing upon extensive experience leading reliability initiatives for cutting-edge hardware, this presentation will outline a general methodology for designing reliable complex AI systems. It will emphasize the necessity of a multidisciplinary approach, integrating model-based system engineering, rigorous reliability testing, and continuous system improvements, as exemplified by advancements in liquid cooling and power delivery technologies for high-performance AI processors. The talk will focus on the reliability approach needed for resilience in complex, AI-driven environments. Speaker(s): Venkata Chivukula, Virtual: https://events.vtools.ieee.org/m/485845