Abstract: As the scale of high-performance computing systems (HPC) continues to expand, the frequency of failures within the system also increases. The follow-up steps in the event of a failure are ...