Effective Bandwidth (b_eff) Benchmark


The algorithm of beff (version 3.6 + bugfix 3.6.0.1)

The effective bandwidth beff measures the accumulated bandwidth of the communication network of parallel and/or distributed computing systems. Several message sizes, communication patterns and methods are used. The algorithm uses an average to take into account that short and long messages are transferred with different bandwidth values in real applications.

Definition of the effective bandwidth beff:

beff = logavg ( logavgring patterns (sumL (maxmthd (maxrep ( b(ring pat.,L,mthd,rep) )))/21 ),
logavgrandom patterns (sumL (maxmthd (maxrep ( b(random pat.,L,mthd,rep) )))/21 )
)
with

Details of the algoritm:

Programming methods:
The communication is programmed with the several methods. This allows the measurement of the effective bandwidth independent of which MPI methods are optimized on a given platform. The maximum bandwidth of the following methods is used:
  1. MPI_Sendrecv
  2. MPI_Alltoallv
  3. non-blocking MPI_Irecv and MPI_Isend with MPI_Waitall.
Communication patterns:
To produce a balanced measurement on any network topology, different communication patterns are used:
Lmax
On systems with sizeof(int)<64, Lmax must be less or equal 128 MB, i.e. Lmax = min(128 MB, (memory per processor)/128);
Further details are discribed in the technical section.

Summary:

The effective bandwidth is number of MPI processes multiplied with the asymptotic bandwidth multiplied with the ratio of the area under the curve "bandwidth over message-lengths" and the area under the constant asymptotic bandwidth curve in the same diagram. To measure the bandwidth, several communication patterns are applied. The patterns are based on rings and on random distributions. The logarithmic average on all ring patterns and on all random patterns is computed and beff is the logarithmic average of these two values. The communication is implemented in three different ways with MPI and for each single measurement the maximum bandwidth of all three methods is used. For the ratio mentioned above the bandwidth is plotted over the message length and the used message length values are plotted equidistant on the abscissa, i.e. along two logarithmic scales, one from 1 byte to 4 kbyte (12 intervals) and the next from 4 kbyte to L (8 intervals).

Background

A first approach from Karl Solchenbach, Hans-Joachim Plum and Gero Ritzenhoefer
[1,2] was based on the bi-section bandwidth. Due to several problems a redesign was done. This redesign tries not to violate the rules defined by Rolf Hempel in [3] and by William Gropp and Ewing Lusk in [4].

Output of the beff Benchmark

Each run of the benchmark on a particular system results in a set of output files. (Default prefix is b_eff.)
  • b_eff.prot a detailed protocol
  • b_eff.short a short version of the protocol
  • b_eff.sum a short overview
  • b_eff.plot a file with data for plotting
  • b_eff.gps a script for gnuplot
  • b_eff.tex a LaTeX source file to create a benchmark report
With only 3-steps you can create a nice benchmark report with charts. What you need is:
  • Gnuplot (version 3.7 or newer)
  • an installation of LaTeX
  • b_eff.plot b_eff.gps b_eff.tex
Commands: gnuplot b_eff.gps
          latex b_eff.tex
          dvips b_eff.dvi
In b_eff.short and b_eff.sum the last line reports e.g. b_eff = 9709.549 MB/s = 37.928 * 256 PEs with 128 MB/PE on sn6715 hwwt3e 2.0.4.71 unicosmk CRAY T3E This line reports
  • the effective bandwidth beff of the whole system,
  • the effective bandwidth of each processor (or node),
  • the number of processors (or nodes),
  • the memory of each processor (or node),
  • the output of uname -a.
A full description of the benchmark protocol is available here.

Sourcecode

b_eff.c (version 3.6.0.1)

Benchmarking

If you use this benchmark, please send us back the following information:
  • which compilation command was used,
  • which MPI implementation, version, ... was used,
  • do you have setup a special environment, e.g. UNIX environment, variables for compiling or running the benchmark,
  • with which command and/or batch script have you started the benchmark,
  • b_eff.c writes its results on stdout; please attach this output as a gzip'ed attachment.
Additionally -- only for you -- b_eff.c writes the last summary line also on stderr. Some examples on how to compile and start b_eff.c are given in the next sections. In all cases one has to choose the correct memory size value (in MBytes). The syntax for setting the CPP macro MEMORY_PER_PROCESSOR may differ, e.g. with or without a blank after the -D option.

Please send the mail to
rabenseifner@rus.uni-stuttgart.de.

First Results

Distributed Memory Systems

Size and beff values are highlighted if the measurements evaluates the whole system.

On a Cray T3E-900 with 512+32 processors and 128 MByte/processor

On Nov. 7, 1999, on sn6715 hwwt3e 2.0.4.71 unicosmk CRAY T3E, with 128 MB/PE: The measurements with 2 to 256 PEs were done while an other application was running on the first 256 PEs. Currently the 512 PEs value must be computed on the base of former measurements with release 3.1, using the 1 dimensional cyclic and the random values. The MPI implementation mpt.1.3.0.2 was used and the environment variable MPI_BUFFER_MAX=4099 was set.

size beff
MByte/s
beff/size
MByte/s
summary full protocol
512 19919.128 38.905 result_3.3_t3e_512.shrt result_3.3_t3e_512.gz
384 15526.600 40.434 result_3.2_t3e_384a.shrt result_3.2_t3e_384a.gz
256 10056.033 39.281 result_3.2_t3e_256a.shrt result_3.2_t3e_256a.gz
192 7871.336 40.997 result_3.2_t3e_192a.shrt result_3.2_t3e_192a.gz
128 5620.345 43.909 result_3.2_t3e_128a.shrt result_3.2_t3e_128a.gz
96 4180.723 43.549 result_3.2_t3e_096a.shrt result_3.2_t3e_096a.gz
64 3158.554 49.352 result_3.2_t3e_064a.shrt result_3.2_t3e_064a.gz
48 2725.891 56.789 result_3.2_t3e_048a.shrt result_3.2_t3e_048a.gz
32 1893.872 59.183 result_3.2_t3e_032a.shrt result_3.2_t3e_032a.gz
24 1522.225 63.426 result_3.2_t3e_024a.shrt result_3.2_t3e_024a.gz
16 1063.217 66.451 result_3.2_t3e_016a.shrt result_3.2_t3e_016a.gz
12 918.109 76.509 result_3.2_t3e_012a.shrt result_3.2_t3e_012a.gz
8 612.815 76.602 result_3.2_t3e_008a.shrt result_3.2_t3e_008a.gz
6 509.359 84.893 result_3.2_t3e_006a.shrt result_3.2_t3e_006a.gz
4 355.045 88.761 result_3.2_t3e_004a.shrt result_3.2_t3e_004a.gz
3 278.898 92.966 result_3.2_t3e_003a.shrt result_3.2_t3e_003a.gz
2 182.989 91.495 result_3.2_t3e_002a.shrt result_3.2_t3e_002a.gz
Used commands: module switch mpt mpt.1.3.0.2 
               cc -o b_eff -DMEMORY_PER_PROCESSOR=128 b_eff.c 
               export MPI_BUFFER_MAX=4099 
               mpirun -np size ./b_eff result_3.2_t3e_size
MPI release:   mpt.1.3.0.2
Execution time < 225 sec 

On a Hitachi SR 8000 with 128 processors on 16 nodes and 1 GByte/processor

On Nov. 8, 2001, on HI-UX/MPP hwwsr8k 03-04 0 SR8000, with 1 GB/PE: The measurements were done with exclusivly used PEs.
Because of some problems the MPI-MPP measurements with size>=64 were redone with a new revision of b_eff in May 2002.

size beff beff/size bandwidth
per PE
at Lmax
PingPong
latency
PingPong
bandwidth
maximal
message
length Lmax
#nodes * #PEs summary full protocol
  MByte/s MByte/s MByte/s microsec MByte/s MByte      
MPI-MPP
96 6065.118 63.178 207.954 11.113 1281.377 8.000 12 * 8 result_3.5_SR8000_1GB_012nodes_096PEs.short result_3.5_SR8000_1GB_012nodes_096PEs.tar.gz
80 5012.301 62.654 200.951 11.038 1281.284 8.000 10 * 8 result_3.5_SR8000_1GB_010nodes_080PEs.short result_3.5_SR8000_1GB_010nodes_080PEs.tar.gz
64 4194.937 65.546 209.369 11.173 1285.415 8.000 8 * 8 result_3.5_SR8000_1GB_008nodes_064PEs.short result_3.5_SR8000_1GB_008nodes_064PEs.tar.gz
48 3122.585 65.054 199.884 11.043 1198.888 8.000 6 * 8 result_3.5_SR8000_1GB_006nodes_048PEs.short result_3.5_SR8000_1GB_006nodes_048PEs.tar.gz
32 2345.881 73.309 220.027 10.874 1189.626 8.000 4 * 8 result_3.5_SR8000_1GB_004nodes_032PEs.short result_3.5_SR8000_1GB_004nodes_032PEs.tar.gz
MPI + Compas
12 1535.074 127.923 463.716 22.614 799.754 8.000 12 * 1 result_3.5_SR8000_1GB_012nodes_012PEs.short result_3.5_SR8000_1GB_012nodes_012PEs.tar.gz
8 1075.017 134.377 506.474 22.339 806.329 8.000 8 * 1 result_3.5_SR8000_1GB_008nodes_008PEs.short result_3.5_SR8000_1GB_008nodes_008PEs.tar.gz
6 796.021 132.670 483.954 22.362 807.569 8.000 6 * 1 result_3.5_SR8000_1GB_006nodes_006PEs.short result_3.5_SR8000_1GB_006nodes_006PEs.tar.gz
4 605.150 151.288 523.326 22.279 805.586 8.000 4 * 1 result_3.5_SR8000_1GB_004nodes_004PEs.short result_3.5_SR8000_1GB_004nodes_004PEs.tar.gz
Used commands: mpicc -O4 -pvec +Op -noparallel -o b_eff -DMEMORY_PER_PROCESSOR=1024 b_eff.c  (MPI-MPP)
               mpicc -O4 -pvec +Op -parallel -o b_eff -DMEMORY_PER_PROCESSOR=1024 b_eff.c    (MPI + Compas)
               mpiexec -p multi -N #nodes -n #processes  ./b_eff
MPI release:   P-1811-1113, HI-UX/MPP 
               MPIR_RANK_NO_ROUND=yes
               JOBTYPE=E8S
               was automatically set according to the HLRS defaults with MPI on SR8000
Execution time < 130 sec 

On a Hitachi SR 8000 with 24 processors on 3 nodes and 1 GByte/processor

On Nov. 29, 1999, on HI-UX/MPP himiko 03-00 ad2b0 SR8000, with 1 GB/PE: The measurements were done with exclusively used PEs. The ping pong measurement is done with the first two MPI processes in each beff-configuration. The MPI implementation does not use the topology information given by the beff benchmark program and allocates the process ranks by default in a round-robin order. This results in a bad efficiency, see second part of the table. Two measurements were taken twice, see ..._b.shrt files. By using the multi-command interface of mpiexec, one can explicitly allocate contigous intervals of process ranks in each SR8000 node, see first part of the table.

size beff beff/size bandwidth
per PE
at Lmax
PingPong
latency
PingPong
bandwidth
maximal
message
length Lmax
#nodes * #PEs summary full protocol
  MByte/s MByte/s MByte/s microsec MByte/s MByte      
explicitly allocated PEs, i.e. contiguous ranks on each node:
24 1805.675 75.236 400.133 11.728 954.936 8.000 3 * 8 result_3.3_SR8000_1GB_003nodes_024PEs_c.shrt result_3.3_SR8000_1GB_003nodes_024PEs_c.gz
18 1565.703 86.983 427.860 11.525 1202.586 8.000 3 * 6 result_3.3_SR8000_1GB_003nodes_018PEs_c.shrt result_3.3_SR8000_1GB_003nodes_018PEs_c.gz
12 1257.728 104.811 489.445 11.475 1204.480 8.000 3 * 4 result_3.3_SR8000_1GB_003nodes_012PEs_c.shrt result_3.3_SR8000_1GB_003nodes_012PEs_c.gz
6 758.508 126.418 477.788 11.437 1224.976 8.000 3 * 2 result_3.3_SR8000_1GB_003nodes_006PEs_c.shrt result_3.3_SR8000_1GB_003nodes_006PEs_c.gz
3 396.829 132.276 447.107 23.307 791.866 8.000 3 * 1 result_3.3_SR8000_1GB_003nodes_003PEs_c.shrt result_3.3_SR8000_1GB_003nodes_003PEs_c.gz
16 1530.664 95.667 411.060 11.811 969.781 8.000 2 * 8 result_3.3_SR8000_1GB_002nodes_016PEs_c.shrt result_3.3_SR8000_1GB_002nodes_016PEs_c.gz
12 1287.352 107.279 439.721 11.527 1208.742 8.000 2 * 6 result_3.3_SR8000_1GB_002nodes_012PEs_c.shrt result_3.3_SR8000_1GB_002nodes_012PEs_c.gz
8 989.464 123.683 504.567 11.521 1213.191 8.000 2 * 4 result_3.3_SR8000_1GB_002nodes_008PEs_c.shrt result_3.3_SR8000_1GB_002nodes_008PEs_c.gz
6 766.605 127.768 499.667 11.555 1222.560 8.000 2 * 3 result_3.3_SR8000_1GB_002nodes_006PEs_c.shrt result_3.3_SR8000_1GB_002nodes_006PEs_c.gz
4 574.523 143.631 519.596 11.484 1226.043 8.000 2 * 2 result_3.3_SR8000_1GB_002nodes_004PEs_c.shrt result_3.3_SR8000_1GB_002nodes_004PEs_c.gz
2 306.570 153.285 521.074 22.923 799.677 8.000 2 * 1 result_3.3_SR8000_1GB_002nodes_002PEs_c.shrt result_3.3_SR8000_1GB_002nodes_002PEs_c.gz
8 1218.994 152.374 455.575 11.570 916.839 8.000 1 * 8 result_3.3_SR8000_1GB_001nodes_008PEs_c.shrt result_3.3_SR8000_1GB_001nodes_008PEs_c.gz
7 1118.625 159.804 488.660 11.528 1207.508 8.000 1 * 7 result_3.3_SR8000_1GB_001nodes_007PEs_c.shrt result_3.3_SR8000_1GB_001nodes_007PEs_c.gz
6 974.033 162.339 506.776 11.361 1211.698 8.000 1 * 6 result_3.3_SR8000_1GB_001nodes_006PEs_c.shrt result_3.3_SR8000_1GB_001nodes_006PEs_c.gz
5 848.999 169.800 515.719 11.506 1211.176 8.000 1 * 5 result_3.3_SR8000_1GB_001nodes_005PEs_c.shrt result_3.3_SR8000_1GB_001nodes_005PEs_c.gz
4 714.477 178.619 527.187 11.321 1216.537 8.000 1 * 4 result_3.3_SR8000_1GB_001nodes_004PEs_c.shrt result_3.3_SR8000_1GB_001nodes_004PEs_c.gz
3 541.446 180.482 537.551 11.390 1222.115 8.000 1 * 3 result_3.3_SR8000_1GB_001nodes_003PEs_c.shrt result_3.3_SR8000_1GB_001nodes_003PEs_c.gz
2 410.553 205.276 552.597 11.462 1230.266 8.000 1 * 2 result_3.3_SR8000_1GB_001nodes_002PEs_c.shrt result_3.3_SR8000_1GB_001nodes_002PEs_c.gz
default round-robin order, i.e. ranks 0,3,6,... are on node 0, ranks 1,4,7,... on node 1, ranks 2,5,8,... on node 2:
24 915.478 38.145 110.275 23.077 741.535 8.000 3 * 8 result_3.3_SR8000_1GB_003nodes_024PEs.shrt result_3.3_SR8000_1GB_003nodes_024PEs.gz
24 922.392 38.433 110.291 23.302 741.305 8.000 3 * 8 result_3.3_SR8000_1GB_003nodes_024PEs_b.shrt result_3.3_SR8000_1GB_003nodes_024PEs_b.gz
18 895.539 49.752 138.199 23.185 752.172 8.000 3 * 6 result_3.3_SR8000_1GB_003nodes_018PEs.shrt result_3.3_SR8000_1GB_003nodes_018PEs.gz
12 819.624 68.302 221.940 23.075 773.075 8.000 3 * 4 result_3.3_SR8000_1GB_003nodes_012PEs.shrt result_3.3_SR8000_1GB_003nodes_012PEs.gz
6 618.331 103.055 361.906 23.158 785.927 8.000 3 * 2 result_3.3_SR8000_1GB_003nodes_006PEs.shrt result_3.3_SR8000_1GB_003nodes_006PEs.gz
3 429.108 143.036 464.218 22.883 797.131 8.000 3 * 1 result_3.3_SR8000_1GB_003nodes_003PEs.shrt result_3.3_SR8000_1GB_003nodes_003PEs.gz
16 775.710 48.482 115.840 36.781 103.655 8.000 2 * 8 result_3.3_SR8000_1GB_002nodes_016PEs.shrt result_3.3_SR8000_1GB_002nodes_016PEs.gz
16 768.112 48.007 115.286 29.816 128.435 8.000 2 * 8 result_3.3_SR8000_1GB_002nodes_016PEs_b.shrt result_3.3_SR8000_1GB_002nodes_016PEs_b.gz
12 768.140 64.012 158.576 23.544 770.873 8.000 2 * 6 result_3.3_SR8000_1GB_002nodes_012PEs.shrt result_3.3_SR8000_1GB_002nodes_012PEs.gz
8 659.282 82.410 230.119 23.220 784.569 8.000 2 * 4 result_3.3_SR8000_1GB_002nodes_008PEs.shrt result_3.3_SR8000_1GB_002nodes_008PEs.gz
8 680.633 85.079 230.015 23.430 775.896 8.000 2 * 4 result_3.3_SR8000_1GB_002nodes_008PEs_b.shrt result_3.3_SR8000_1GB_002nodes_008PEs_b.gz
6 583.527 97.254 278.203 23.182 789.365 8.000 2 * 3 result_3.3_SR8000_1GB_002nodes_006PEs.shrt result_3.3_SR8000_1GB_002nodes_006PEs.gz
4 495.397 123.849 390.279 23.160 792.499 8.000 2 * 2 result_3.3_SR8000_1GB_002nodes_004PEs.shrt result_3.3_SR8000_1GB_002nodes_004PEs.gz
2 306.623 153.311 522.781 23.004 799.908 8.000 2 * 1 result_3.3_SR8000_1GB_002nodes_002PEs.shrt result_3.3_SR8000_1GB_002nodes_002PEs.gz
8 1245.136 155.642 470.941 11.650 970.791 8.000 1 * 8 result_3.3_SR8000_1GB_001nodes_008PEs.shrt result_3.3_SR8000_1GB_001nodes_008PEs.gz
6 971.215 161.869 505.246 11.563 1213.715 8.000 1 * 6 result_3.3_SR8000_1GB_001nodes_006PEs.shrt result_3.3_SR8000_1GB_001nodes_006PEs.gz
4 706.801 176.700 526.974 11.521 1205.780 8.000 1 * 4 result_3.3_SR8000_1GB_001nodes_004PEs.shrt result_3.3_SR8000_1GB_001nodes_004PEs.gz
2 410.471 205.236 552.259 11.549 1226.577 8.000 1 * 2 result_3.3_SR8000_1GB_001nodes_002PEs.shrt result_3.3_SR8000_1GB_001nodes_002PEs.gz
explicitly allocated PEs, but using special additional options: options    
24 1805.675 75.236 400.133 11.728 954.936 8.000 --- result_3.3_SR8000_1GB_003nodes_024PEs_c.shrt result_3.3_SR8000_1GB_003nodes_024PEs_c.gz
24 1806.033 75.251 381.057 12.003 1014.339 8.000 SS result_3.3_SR8000_1GB_003nodes_024PEs_with_SS.shrt result_3.3_SR8000_1GB_003nodes_024PEs_with_SS.gz
24 1280.068 53.336 353.586 29.933 161.925 8.000 SS, 64 result_3.3_SR8000_1GB_003nodes_024PEs_with_SS_lp64.shrt result_3.3_SR8000_1GB_003nodes_024PEs_with_SS_lp64.gz
24 1225.848 51.077 295.441 27.604 144.019 8.000 64 result_3.3_SR8000_1GB_003nodes_024PEs_with_lp64.shrt result_3.3_SR8000_1GB_003nodes_024PEs_with_lp64.gz
Used commands: mpicc -o b_eff -DMEMORY_PER_PROCESSOR=1024 b_eff.c -lm
               hpstatus
               limit datasize 500000 
 
explicit allocation of contiguous ranks on 3 nodes: 
               mpiexec -p NODE0 -N 1 -n size/3 ./b_eff result_3.3_SR8000_1GB_003nodes_sizePEs_c \
                     : -p NODE1 -N 1 -n size/3 ./b_eff \
                     : -p NODE2 -N 1 -n size/3 ./b_eff
         contiguous ranks on 2 nodes: 
               mpiexec -p NODE0 -N 1 -n size/2 ./b_eff result_3.3_SR8000_1GB_002nodes_sizePEs_c \
                     : -p NODE1 -N 1 -n size/2 ./b_eff
         contiguous ranks on 1 node: 
               mpiexec -p NODE0 -N 1 -n size/1 ./b_eff result_3.3_SR8000_1GB_001nodes_sizePEs_c 

         special additional options
          64: option -lp64, used on mpicc
          SS: environment variable JOBTYPE=SS, set while executing mpiexec 
 
 
default round-robin allocation:  
               mpiexec -p ALL -N nodes -n size ./b_eff result_3.3_SR8000_1GB_nodesnodes_sizePEs
MPI release:   P-1811-1113, HI-UX/MPP 
Execution time < 150 sec 

On a Hitachi SR 2201 with 32+8 processors and 256 MByte/processor

On Nov. 9, 1999, on HI-UX/MPP hitachi 02-03 0 SR2201, with 256 MB/PE: The measurement was done while another application was running on the other 16 PEs. All PEs were used as dedicated PEs.

size beff
MByte/s
beff/size
MByte/s
summary full protocol
16 527.805 32.988 result_3.2_SR2201_016.shrt result_3.2_SR2201_016.gz
8 276.903 34.613 result_3.2_SR2201_008.shrt result_3.2_SR2201_008.gz
4 151.928 37.982 result_3.2_SR2201_004.shrt result_3.2_SR2201_004.gz
2 80.086 40.043 result_3.2_SR2201_002.shrt result_3.2_SR2201_002.gz
Used commands: mpicc -o b_eff -D MEMORY_PER_PROCESSOR=1024 b_eff.c -lm
               mpirun -n size ./b_eff result_3.2_SR2201_size
MPI release:   P-1811-1112, HI-UX/MPP 02-01 (based on MPICH Version 1.0.11) 

On a SwissTX T1-baby with 6*2 processors and 512 MByte/processor

On Nov. 15, 1999, on OSF1 toneb7 V5.0 910 alpha, with 512 MB/PE: The measurement was done while other applications were running on PEs that weren't used by this benchmark. All PEs were used as dedicated PEs.

size beff
MByte/s
beff/size
MByte/s
summary full protocol
12 ???.??? ??.??? result_3.3_SwissTX1baby_012a.shrt result_3.3_SwissTX1baby_012a.gz
8 97.497 12.187 result_3.3_SwissTX1baby_008a.shrt result_3.3_SwissTX1baby_008a.gz
4 49.394 12.348 result_3.3_SwissTX1baby_004a.shrt result_3.3_SwissTX1baby_004a.gz
2 25.792 12.896 result_3.3_SwissTX1baby_002a.shrt result_3.3_SwissTX1baby_002a.gz
Used commands: tnetcc -o b_eff -DMEMORY_PER_PROCESSOR=512 b_eff.c -lm
               bsub -Is -n size txrun b_eff result_3.3_SwissTX1baby_size
MPI release:   T-NET: 0.17 (see http://service.scs.ch/gb2/fci/revision/)
                           (sysconfig -q tnet Version) 
Execution time < 163 sec 

On IBM SP2

On Nov. 25/26, 1999, two measurements were done on dedicated processors:

size beff
MByte/s
beff/size
MByte/s
summary full protocol platform
128 2241.885 17.515 result_SP2_512MB_128PE.shrt result_SP2_512MB_128PE.gz P2SC (120 MHz) processors with 512 MB each
32 568.227 17.757 result_SP2_256MB_32PE.shrt result_SP2_256MB_32PE.gz POWER2 (77 MHz) processors with 256 MB each
Used commands: mpcc -o b_eff -DMEMORY_PER_PROCESSOR=512 b_eff.c -lm
               poe b_eff result_SP2_512MB_sizePE -procs size
MPI release:   ???
Execution time < 980 sec (very long due to the extremly high Alltoallv latancy)

Shared Memory Systems

The beff benchmark is not well-suited for shared memory systems:
  • In most cases, applications need not copy date between the processores. There is no way to measure these savings, although these saving are an important part of the communication capabilities of the system.
  • MPI communication is not the appropriate programming model of shared memory systems. The latency and bandwidth of data copying with OpenMP is normally at least double as fast as MPI message passing. Therefore communication bandwidth should be measured inside of shared memory nodes by copying data with OpenMP.
On hierarchical systems, OpenMP should be used inside the shared memory nodes and MPI should be used between the shared memory nodes. It is currently under discussion to extend the beff benchmark with an OpenMP based memory copying between the processors inside of a shared memory node.

On a NEC SX-4/32 with 32 processors and 256 MByte/processor

On Nov. 9, 1999, on SUPER-UX hwwsx4 9.1 Rev1 SX-4, with 256 MB/PE: The measurement was done on a (dedicated) resource block with 16 processors while other application were running on the other processors (exception: the benchmark on 4 processors was done nteractively with time-sharing).

size beff
MByte/s
beff/size
MByte/s
summary full protocol
16 9670.150 604.384 result_3.2_sx4_016.shrt result_3.2_sx4_016.gz
15 9493.817 632.921 result_3.2_sx4_015.shrt result_3.2_sx4_015.gz
14 9007.233 643.374 result_3.2_sx4_014.shrt result_3.2_sx4_014.gz
13 8301.263 638.559 result_3.2_sx4_013.shrt result_3.2_sx4_013.gz
12 7738.770 644.898 result_3.2_sx4_012.shrt result_3.2_sx4_012.gz
11 7129.367 648.124 result_3.2_sx4_011.shrt result_3.2_sx4_011.gz
10 6401.344 640.134 result_3.2_sx4_010.shrt result_3.2_sx4_010.gz
9 5765.670 640.630 result_3.2_sx4_009.shrt result_3.2_sx4_009.gz
8 5162.575 645.322 result_3.2_sx4_008.shrt result_3.2_sx4_008.gz
7 4535.283 647.898 result_3.2_sx4_007.shrt result_3.2_sx4_007.gz
6 3920.267 653.378 result_3.2_sx4_006.shrt result_3.2_sx4_006.gz
5 3261.534 652.307 result_3.2_sx4_005.shrt result_3.2_sx4_005.gz
4 2622.012 655.503 result_3.2_sx4_004.shrt result_3.2_sx4_004.gz
3 1983.912 661.304 result_3.2_sx4_003.shrt result_3.2_sx4_003.gz
2 1316.320 658.160 result_3.2_sx4_002.shrt result_3.2_sx4_002.gz
Used commands: mpicc -o b_eff -D MEMORY_PER_PROCESSOR=256 b_eff.c -lm
               mpirun -np size ./b_eff result_3.2_sx4_size
MPI release:   9.1

On a NEC SX-5/8B with 8 processors and 8 GByte/processor

On Nov. 9, 1999, on SUPER-UX sx5 9.2 k SX-5/8B, preliminary measurements were done with 256 MB/PE and without the non-blocking communication method:

size beff
MByte/s
beff/size
MByte/s
summary full protocol
4 5439.199 1359.800 result_3.2_sx5_256MB_004.shrt result_3.2_sx5_256MB_004.gz
2 2662.468 1331.234 result_3.2_sx5_256MB_002.shrt result_3.2_sx5_256MB_002.gz
Used commands: mpicc -o b_eff -D MEMORY_PER_PROCESSOR=256 b_eff_mthd1+2.c -lm
               mpirun -np size ./b_eff result_3.2_sx5_256MB_size
MPI release:   9.2

On a HP-V 9000/800/V2250 with 8 processors and 1024 MByte/processor

On Nov. 9, 1999, on HP-UX hwwhpv B.11.00 A 9000/800, with 1024 MB/PE: The measurement was done while another application was running, but with reduced priority (nice=39).

size beff
MByte/s
beff/size
MByte/s
summary full protocol
7 435.041 62.149 result_3.2_hpv_007c.shrt result_3.2_hpv_007c.gz
Used commands: mpicc -o b_eff -D MEMORY_PER_PROCESSOR=1024 b_eff.c -lm
               mpirun -np size ./b_eff result_3.2_hpv_size
MPI release:   HP MPI 1.4 implementation

On a SGI Cray SV1-B/16-8 with 16 processors and 512 MByte/processor

On Nov. 17, 1999, on sn9626 athos 10.0.0.6 eth.3 CRAY SV1, with 512 MB/PE: The measurement was done in time-sharing mode while other applications were running on the system. Looking at the results one can see, that for 2, 4 and 8 nodes, the benchmark was nearly scheduled as on dedicated processors. For 12 and 15 nodes one can see, that setting-up the maximum on all methods and repetitions results in reproducible bandwidth values. The measurement on 15 processors was done with looplengthmax reduced to 30 to reduce the total execution time. A measurement on all 16 PEs was not possible due to other applications running on the system with lower priority.

size beff
MByte/s
beff/size
MByte/s
summary full protocol
16 ~ 1487.200 ~ 92.950 extrapolation
based on column beff/size
and lines about 15, 12 and 8 PEs
15 1444.958 96.331 result_3.3_SV1B16_015a.shrt result_3.3_SV1B16_015a.gz
12 1283.318 106.943 result_3.3_SV1B16_012a.shrt result_3.3_SV1B16_012a.gz
8 958.823 119.853 result_3.3_SV1B16_008a.shrt result_3.3_SV1B16_008a.gz
4 626.880 156.720 result_3.3_SV1B16_004a.shrt result_3.3_SV1B16_004a.gz
2 359.071 179.535 result_3.3_SV1B16_002a.shrt result_3.3_SV1B16_002a.gz
Used commands: cc -h taskprivate $LIBCM -o b_eff -D MEMORY_PER_PROCESSOR=512 b_eff.c
               qsub -eo -q nqebatch jobsize
               with jobsize:
                 #!/bin/sh
                 export MPI_GROUP_MAX=64
                 ja
                 mpirun -nt size ~/b_eff ~/result_3.3_SV1B16_size
                 ja -st 
MPI release:   mpt.1.3.0.2
Execution time < 360 sec (expected on a dedicated system)

References:

[1]
Karl Solchenbach: Benchmarking the Balance of Parallel Computers. SPEC Workshop on Benchmarking Parallel and High-Performance Computing Systems (copy of the slides), Wuppertal, Germany, Sept. 13, 1999.
[2]
Karl Solchenbach, Hans-Joachim Plum and Gero Ritzenhoefer: Pallas Effective Bandwidth Benchmark - source code and sample results ( EFF_BW.tar.gz, 43 KB)
[3]
Rolf Hempel: Basic message passing benchmarks, methodology and pitfalls. SPEC Workshop on Benchmarking Parallel and High-Performance Computing Systems (copy of the slides), Wuppertal, Germany, Sept. 13, 1999.
[4]
William Gropp and Ewing Lusk: Reproducible Measurement of MPI Performance Characteristics. In J. Dongarra et al. (eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface, proceedings of the 6th European PVM/MPI Users' Group Meeting, EuroPVM/MPI'99, Barcelona, Spain, Sept. 26-29, 1999, LNCS 1697, pp 11-18. (Summary on the web)

Links

Rolf Rabenseifner
Gerrit Schulz