Node-Level Performance Engineering

Enterprises & SME Research & Science
Node-Level Performance Engineering


This course teaches performance engineering approaches on the compute node level. "Performance engineering" as we define it is more than employing tools to identify hotspots and bottlenecks. It is about developing a thorough understanding of the interactions between software and hardware. This process must start at the core, socket, and node level, where the code gets executed that does the actual computational work. Once the architectural requirements of a code are understood and correlated with performance measurements, the potential benefit of optimizations can often be predicted. We introduce a "holistic" node-level performance engineering strategy, apply it to different algorithms from computational science, and also show how an awareness of the performance features of an application may lead to notable reductions in power consumption.

This course provides scientific training in Computational Science, and in addition, the scientific exchange of the participants among themselves.

Attendees are highly invited to also join the course User Guided Optimization in High-Level Languages held on July 08 that targets related topics!

General Information


First day
09:00 - 09:30 local registration
09:30 - 13:00 lectures (with breaks: 10:30-10:45 & 11:45-12:00)
13:00 - 14:00 lunch break
14:00 - 17:00 lectures  (with breaks: 15:10-15:25)

Second day
09:00 - 13:00 lectures (with breaks: 10:15-10:30 & 11:45-12:00)
13:00 - 14:00 lunch break
14:00 - 17:00 lectures  (with breaks: 15:10-15:25)

Detailed Program


  • Intel and AMD x86 architectures
  • ccNUMA
  • Performance modeling & engineering approaches
  • Our Approach

Practical performance analysis

  • The LIKWID tools
  • Typical performance patterns

Microbenchmarks and the memory hierarchy

  • Understanding the memory hierarchy
    • Data transfer between memory levels
    • Write allocate vs. NT stores
    • Modeling of cache hierarchies
    • NUMA effects - anisotropy and asymmetry
  • Contention

Typical node-level software overheads

  •     Cost of synchronization
  •     Work Distribution

Example Problem: The 3D Jacobi solver

  • Core-level optimizations
    • Blocking
    • Non Temporal stores
    • SIMD vectorization (SSE, AVX)
  • Multithreading - contention at different memory hierarchies
  • Temporal Blocking

Example Problem: The Lattice-Boltzmann Method (LBM)

  • Introduction
  • Roofline Model
  • Data layout
  • Non Temporal stores
  • Model for in-cache data & multicore scaling
  • Sparse representation and options for Propagation

Example Problem: Sparse Matrix-Vector Multiplication

  • Data layouts
  • Performance model - CPU vs. GPU
  • Bandwidth reduction

Example Problem: A backprojection algorithm for CT reconstruction

  • The algorithm
  • Naïve analysis
  • Detailed analysis and performance model 
  • Optimizations

Energy & Parallel Scalability

  • Energy consumption of modern processors
  • The energy-to-solution metric
  • Performance engineering == power engineering
  • Case studies

Between each module, there is time for Questions and Answers!




Dr. habil. Georg Hager
Dr.-Ing. Jan Eitzinger
Prof. Dr. Gerhard Wellein


Academic participants (i.e., members of universities or public research institutions) from Europe or PRACE countries: Please apply through the PATC web page. After your registration, you will receive an automated "congratulation"-email about your successful registration. This email implies that you have a guaranteed seat in the course and you should organize your travel.
All other participants (not from academia, or from outside Europe), or if the PATC web page is temporarily unavailable, please apply through this online registration form
Course number is 2015-NLP.

Deadline for Registration

21 June 2015 (extended deadline)


Members of German universities and public research institutes: none
Members of universities and public research institutes within Europe or PRACE: none
Members of other universities and public research institutes: 120 EUR
Others: 400 EUR
(includes food and drink at coffee breaks, will be collected on the first day of the course, cash only)

Cancelation policy

If you cannot come to the course, please send an email to the organizer as soon as possible. This would allow us to accept additional participants from the waiting-list. There is no cancelation fee.
NO-SHOW: Registered persons that do not cancel and do not show up without any reasons are blocked for the next year on any of our workshops (because it is too expensive to produce unused copies of the slides for them).


Participants must have basic knowledge in programming with Fortran or C

Course Material

The course material and an updated agenda is available here.
An older version of this course with most of the material (including the audio information) can also be viewed in the ONLINE Parallel Programming Workshop.


HLRS is part of the Gauss Centre for Supercomputing (GCS), which is one of the six PRACE Advanced Training Centres (PATCs) that started in Feb. 2012. The mandate for the PATCs is as follows: "The PRACE Advanced Training Centres will serve as European hubs of advanced, world-class training for researchers working in the computational sciences." (see D3.2.3)
This course is a PATC course, see also the PRACE Training Portal and Events. For participants from public research institutions in PRACE countries, the course fee is sponsored through the PRACE PATC program.

HLRS is also member of the Baden-Württemberg initiative bwHPC-C5.
This course is also provided within the framework of the bwHPC-C5 user Support.


Local Organizer

Rolf Rabenseifner
Phone 0711 685 65530