You are in the main area:Research

Headerimage for: FeTOL

FeTOL


FeTOL - Towards Fault Tolerant Massively Parallel Computations on Peta-scale Platforms

FeTOL is a project funded by the German BMBF doing research into faulttolerance for applications on future HPC systems. Duration: 36 months, starting June 2011

Objective

It is well known that for massively parallel computations beyond the Teraflop scale the combined probability of local hardware / network failures will reach a level that substantially decreases the productivity of HPC-systems due to failure of submitted jobs even for moderate runtimes. This also holds for sub-Teraflop applications with extreme runtimes such as MD-applications. Thus it is mandatory to create software frameworks which increase the resilience of HPC applications to partial failures of the underlying hardware resources and thus avoiding a complete restart of a massively parallel application run.

Failure of a single process in an MPI job leads to unrecoverable error condition an aborting of the whole job. FeTOL thus suggest to break down large MPI jobs into a range of smaller MPI jobs, so called fibers, which are interconnected by BOND, a framework similar in functionality as MPI. If a node crashes, the local MPI fiber will crash, too, but the remaining fibers will survive the fault. BOND will then re-assign resources to the failing MPI jobs and restart it from a adequate checkpoint. This operation is much cheaper and resource efficient than loosing the whole job.

The main contributions of HLRS are:

  • improving the resilience and robustness of the Infiniband network layer in MPI in order to allow to survive transient network errors
  • implementation of a high-level, persistent storage mechanism that allows an application to store essential data to be used in case of restart after a failure

Partners

  • TU Braunschweig (coordinator)
  • HLRS Stuttgart
  • VIS, Univ. Stuttgart
  • NEC Deutschland GmbH
  • Regionales Rechenzentrum Erlangen
  • Univ. Duisburg-Essen

Contact

Dr. José Gracia
Höchstleistungsrechenzentrum Universität Stuttgart
Nobelstraße 19, 70569 Stuttgart, Germany
Phone: +49-711-685-87208
Fax: +49-711-685-65832
E-Mail: gracia@hlrs.de