FeTOL - Towards Fault Tolerant Massively Parallel Computations on Peta-scale Platforms
FeTOL is a project funded by the German BMBF doing research into faulttolerance for applications on future HPC systems. Duration: 36 months, starting June 2011
It is well known that for massively parallel computations beyond the Teraflop scale the combined probability of local hardware / network failures will reach a level that substantially decreases the productivity of HPC-systems due to failure of submitted jobs even for moderate runtimes. This also holds for sub-Teraflop applications with extreme runtimes such as MD-applications. Thus it is mandatory to create software frameworks which increase the resilience of HPC applications to partial failures of the underlying hardware resources and thus avoiding a complete restart of a massively parallel application run.
Failure of a single process in an MPI job leads to unrecoverable error condition an aborting of the whole job. FeTOL thus suggest to break down large MPI jobs into a range of smaller MPI jobs, so called fibers, which are interconnected by BOND, a framework similar in functionality as MPI. If a node crashes, the local MPI fiber will crash, too, but the remaining fibers will survive the fault. BOND will then re-assign resources to the failing MPI jobs and restart it from a adequate checkpoint. This operation is much cheaper and resource efficient than loosing the whole job.
The main contributions of HLRS are:
- improving the resilience and robustness of the Infiniband network layer in MPI in order to allow to survive transient network errors
- implementation of a high-level, persistent storage mechanism that allows an application to store essential data to be used in case of restart after a failure
- TU Braunschweig (coordinator)
- HLRS Stuttgart
- VIS, Univ. Stuttgart
- NEC Deutschland GmbH
- Regionales Rechenzentrum Erlangen
- Univ. Duisburg-Essen
Dr. José Gracia
Höchstleistungsrechenzentrum Universität Stuttgart
Nobelstraße 19, 70569 Stuttgart, Germany