FeTOL - Towards Fault Tolerant Massively Parallel Computations on Peta-scale Platforms - is a project funded by the German BMBF doing research into fault-tolerance for applications on future HPC systems.
Duration: 36 months, starting June 2011
It is well known that for massively parallel computations beyond the Teraflop scale the combined probability of local hardware / network failures will reach a level that substantially decreases the productivity of HPC-systems due to failure of submitted jobs even for moderate runtimes. This also holds for sub-Teraflop applications with extreme runtimes such as MD-applications. Thus it is mandatory to create software frameworks which increase the resilience of HPC applications to partial failures of the underlying hardware resources and thus avoiding a complete restart of a massively parallel application run.
Failure of a single process in an MPI job leads to unrecoverable error condition an aborting of the whole job. FeTOL thus suggest to break down large MPI jobs into a range of smaller MPI jobs, so called fibers, which are interconnected by BOND, a framework similar in functionality as MPI. If a node crashes, the local MPI fiber will crash, too, but the remaining fibers will survive the fault. BOND will then re-assign resources to the failing MPI jobs and restart it from a adequate checkpoint. This operation is much cheaper and resource efficient than loosing the whole job.
The main contributions of HLRS are:
- improving the resilience and robustness of the Infiniband network layer in MPI in order to allow to survive transient network errors
- implementation of a high-level, persistent storage mechanism that allows an application to store essential data to be used in case of restart after a failure
- TU Braunschweig (TUBS, coordinator)
- HLRS Stuttgart (HLRS)
- NEC Deutschland GmbH (NEC)
- Platform Computing GmbH (PCG)
- Regionales Rechenzentrum Erlangen (RRZE)
- Univ. Duisburg-Essen (UDE)
- VISENSO GmbH (VIS)