Effective Bandwidth (b_eff) Benchmark

The algorithm of b_eff (version 3.6 + bugfix 3.6.0.1)

The effective bandwidth b_eff measures the accumulated bandwidth of the communication network of parallel and/or distributed computing systems. Several message sizes, communication patterns and methods are used. The algorithm uses an average to take into account that short and long messages are transferred with different bandwidth values in real applications.

Definition of the effective bandwidth b_eff:

b_eff =	logavg	(	logavg_{ring patterns}	(sum_L (max_mthd (max_rep ( b(ring pat.,L,mthd,rep)	)))/21 ),
			logavg_{random patterns}	(sum_L (max_mthd (max_rep ( b(random pat.,L,mthd,rep)	)))/21 )
		)

with

b(pat,L,mthd,rep) = L * (total number of messages of a pattern "pat") * looplength / (maximum time on each process for executing the communication pattern looplength times)

Each measurement is repeated 3 times (rep=1..3). The maximum bandwidth of all repetitions is used (see max_mthd in the formula above).

Each pattern is programmed with three methods. The maximum bandwidth of all methods is used (max_mthd).

The measurement is done for different sizes of a message: The message length L has the following 21 values:
L = 1B, 2B, 4B, ... 2kB, 4kB, 4kB*(a**1), 4kB*(a**2), ... 4kB*(a**8)
with and 4kB*(a**8) = L_max and L_max = (memory per processor) / 128
The looplength is reduced dynamically to achieve a execution time for each loop between 2.5 and 5 msec.
The looplength for the first iteration is calculated with some PRE_MSG_LOOPS.
The minimum looplength is 1.
The average of the bandwidth of all messages sizes is computed (sum_L(...)/21).
A set of ring patterns and random patterns is used (see details section below).
The average for all ring patterns and the average of all random patterns is computed on the logarithmic scale
(logavg_{ring patterns} and logavg_{random patterns}).
Finally the effective bandwidth is the logarithmic average of these two values
(logavg(logavg_{ring patterns}, logavg_{random patterns}).

Details of the algoritm:

Programming methods:

Further details are discribed in the technical section.

Summary:

The effective bandwidth is number of MPI processes multiplied with the asymptotic bandwidth multiplied with the ratio of the area under the curve "bandwidth over message-lengths" and the area under the constant asymptotic bandwidth curve in the same diagram. To measure the bandwidth, several communication patterns are applied. The patterns are based on rings and on random distributions. The logarithmic average on all ring patterns and on all random patterns is computed and b_eff is the logarithmic average of these two values. The communication is implemented in three different ways with MPI and for each single measurement the maximum bandwidth of all three methods is used. For the ratio mentioned above the bandwidth is plotted over the message length and the used message length values are plotted equidistant on the abscissa, i.e. along two logarithmic scales, one from 1 byte to 4 kbyte (12 intervals) and the next from 4 kbyte to L (8 intervals).

Background

A first approach from Karl Solchenbach, Hans-Joachim Plum and Gero Ritzenhoefer [1,2] was based on the bi-section bandwidth. Due to several problems a redesign was done. This redesign tries not to violate the rules defined by Rolf Hempel in [3] and by William Gropp and Ewing Lusk in [4].

Output of the b_eff Benchmark

Each run of the benchmark on a particular system results in a set of output files. (Default prefix is b_eff.)

b_eff.prot a detailed protocol
b_eff.short a short version of the protocol
b_eff.sum a short overview
b_eff.plot a file with data for plotting
b_eff.gps a script for gnuplot
b_eff.tex a LaTeX source file to create a benchmark report

With only 3-steps you can create a nice benchmark report with charts. What you need is:

Gnuplot (version 3.7 or newer)
an installation of LaTeX
b_eff.plot b_eff.gps b_eff.tex

Commands: gnuplot b_eff.gps
          latex b_eff.tex
          dvips b_eff.dvi

In b_eff.short and b_eff.sum the last line reports e.g. b_eff = 9709.549 MB/s = 37.928 * 256 PEs with 128 MB/PE on sn6715 hwwt3e 2.0.4.71 unicosmk CRAY T3E This line reports

the effective bandwidth b_eff of the whole system,
the effective bandwidth of each processor (or node),
the number of processors (or nodes),
the memory of each processor (or node),
the output of uname -a.

A full description of the benchmark protocol is available here.

Sourcecode

b_eff.c (version 3.6.0.1)

Benchmarking

If you use this benchmark, please send us back the following information:

which compilation command was used,
which MPI implementation, version, ... was used,
do you have setup a special environment, e.g. UNIX environment, variables for compiling or running the benchmark,
with which command and/or batch script have you started the benchmark,
b_eff.c writes its results on stdout; please attach this output as a gzip'ed attachment.

Additionally -- only for you -- b_eff.c writes the last summary line also on stderr. Some examples on how to compile and start b_eff.c are given in the next sections. In all cases one has to choose the correct memory size value (in MBytes). The syntax for setting the CPP macro MEMORY_PER_PROCESSOR may differ, e.g. with or without a blank after the -D option.

Please send the mail to rabenseifner@rus.uni-stuttgart.de.

First Results

Distributed Memory Systems

Size and b_eff values are highlighted if the measurements evaluates the whole system.

On a Cray T3E-900 with 512+32 processors and 128 MByte/processor

On Nov. 7, 1999, on sn6715 hwwt3e 2.0.4.71 unicosmk CRAY T3E, with 128 MB/PE: The measurements with 2 to 256 PEs were done while an other application was running on the first 256 PEs. Currently the 512 PEs value must be computed on the base of former measurements with release 3.1, using the 1 dimensional cyclic and the random values. The MPI implementation mpt.1.3.0.2 was used and the environment variable MPI_BUFFER_MAX=4099 was set.

size	b_eff MByte/s	b_eff/size MByte/s	summary	full protocol
512	19919.128	38.905	result_3.3_t3e_512.shrt	result_3.3_t3e_512.gz
384	15526.600	40.434	result_3.2_t3e_384a.shrt	result_3.2_t3e_384a.gz
256	10056.033	39.281	result_3.2_t3e_256a.shrt	result_3.2_t3e_256a.gz
192	7871.336	40.997	result_3.2_t3e_192a.shrt	result_3.2_t3e_192a.gz
128	5620.345	43.909	result_3.2_t3e_128a.shrt	result_3.2_t3e_128a.gz
96	4180.723	43.549	result_3.2_t3e_096a.shrt	result_3.2_t3e_096a.gz
64	3158.554	49.352	result_3.2_t3e_064a.shrt	result_3.2_t3e_064a.gz
48	2725.891	56.789	result_3.2_t3e_048a.shrt	result_3.2_t3e_048a.gz
32	1893.872	59.183	result_3.2_t3e_032a.shrt	result_3.2_t3e_032a.gz
24	1522.225	63.426	result_3.2_t3e_024a.shrt	result_3.2_t3e_024a.gz
16	1063.217	66.451	result_3.2_t3e_016a.shrt	result_3.2_t3e_016a.gz
12	918.109	76.509	result_3.2_t3e_012a.shrt	result_3.2_t3e_012a.gz
8	612.815	76.602	result_3.2_t3e_008a.shrt	result_3.2_t3e_008a.gz
6	509.359	84.893	result_3.2_t3e_006a.shrt	result_3.2_t3e_006a.gz
4	355.045	88.761	result_3.2_t3e_004a.shrt	result_3.2_t3e_004a.gz
3	278.898	92.966	result_3.2_t3e_003a.shrt	result_3.2_t3e_003a.gz
2	182.989	91.495	result_3.2_t3e_002a.shrt	result_3.2_t3e_002a.gz

Used commands: module switch mpt mpt.1.3.0.2 
               cc -o b_eff -DMEMORY_PER_PROCESSOR=128 b_eff.c 
               export MPI_BUFFER_MAX=4099 
               mpirun -np size ./b_eff result_3.2_t3e_size
MPI release:   mpt.1.3.0.2
Execution time < 225 sec

On a Hitachi SR 8000 with 128 processors on 16 nodes and 1 GByte/processor

On Nov. 8, 2001, on HI-UX/MPP hwwsr8k 03-04 0 SR8000, with 1 GB/PE: The measurements were done with exclusivly used PEs.
Because of some problems the MPI-MPP measurements with size>=64 were redone with a new revision of b_eff in May 2002.

size	b_eff	b_eff/size	bandwidth per PE at L_max	PingPong latency	PingPong bandwidth	maximal message length L_max	*#nodes #PEs**	summary	full protocol
	MByte/s	MByte/s	MByte/s	microsec	MByte/s	MByte
MPI-MPP
96	6065.118	63.178	207.954	11.113	1281.377	8.000	12 * 8	result_3.5_SR8000_1GB_012nodes_096PEs.short	result_3.5_SR8000_1GB_012nodes_096PEs.tar.gz
80	5012.301	62.654	200.951	11.038	1281.284	8.000	10 * 8	result_3.5_SR8000_1GB_010nodes_080PEs.short	result_3.5_SR8000_1GB_010nodes_080PEs.tar.gz
64	4194.937	65.546	209.369	11.173	1285.415	8.000	8 * 8	result_3.5_SR8000_1GB_008nodes_064PEs.short	result_3.5_SR8000_1GB_008nodes_064PEs.tar.gz
48	3122.585	65.054	199.884	11.043	1198.888	8.000	6 * 8	result_3.5_SR8000_1GB_006nodes_048PEs.short	result_3.5_SR8000_1GB_006nodes_048PEs.tar.gz
32	2345.881	73.309	220.027	10.874	1189.626	8.000	4 * 8	result_3.5_SR8000_1GB_004nodes_032PEs.short	result_3.5_SR8000_1GB_004nodes_032PEs.tar.gz
MPI + Compas
12	1535.074	127.923	463.716	22.614	799.754	8.000	12 * 1	result_3.5_SR8000_1GB_012nodes_012PEs.short	result_3.5_SR8000_1GB_012nodes_012PEs.tar.gz
8	1075.017	134.377	506.474	22.339	806.329	8.000	8 * 1	result_3.5_SR8000_1GB_008nodes_008PEs.short	result_3.5_SR8000_1GB_008nodes_008PEs.tar.gz
6	796.021	132.670	483.954	22.362	807.569	8.000	6 * 1	result_3.5_SR8000_1GB_006nodes_006PEs.short	result_3.5_SR8000_1GB_006nodes_006PEs.tar.gz
4	605.150	151.288	523.326	22.279	805.586	8.000	4 * 1	result_3.5_SR8000_1GB_004nodes_004PEs.short	result_3.5_SR8000_1GB_004nodes_004PEs.tar.gz

Used commands: mpicc -O4 -pvec +Op -noparallel -o b_eff -DMEMORY_PER_PROCESSOR=1024 b_eff.c  (MPI-MPP)
               mpicc -O4 -pvec +Op -parallel -o b_eff -DMEMORY_PER_PROCESSOR=1024 b_eff.c    (MPI + Compas)
               mpiexec -p multi -N #nodes -n #processes  ./b_eff
MPI release:   P-1811-1113, HI-UX/MPP 
               MPIR_RANK_NO_ROUND=yes
               JOBTYPE=E8S
               was automatically set according to the HLRS defaults with MPI on SR8000
Execution time < 130 sec

On a Hitachi SR 8000 with 24 processors on 3 nodes and 1 GByte/processor

On Nov. 29, 1999, on HI-UX/MPP himiko 03-00 ad2b0 SR8000, with 1 GB/PE: The measurements were done with exclusively used PEs. The ping pong measurement is done with the first two MPI processes in each b_eff-configuration. The MPI implementation does not use the topology information given by the b_eff benchmark program and allocates the process ranks by default in a round-robin order. This results in a bad efficiency, see second part of the table. Two measurements were taken twice, see ..._b.shrt files. By using the multi-command interface of mpiexec, one can explicitly allocate contigous intervals of process ranks in each SR8000 node, see first part of the table.

size	b_eff	b_eff/size	bandwidth per PE at L_max	PingPong latency	PingPong bandwidth	maximal message length L_max	*#nodes #PEs**	summary	full protocol
	MByte/s	MByte/s	MByte/s	microsec	MByte/s	MByte
explicitly allocated PEs, i.e. contiguous ranks on each node:
24	1805.675	75.236	400.133	11.728	954.936	8.000	3 * 8	result_3.3_SR8000_1GB_003nodes_024PEs_c.shrt	result_3.3_SR8000_1GB_003nodes_024PEs_c.gz
18	1565.703	86.983	427.860	11.525	1202.586	8.000	3 * 6	result_3.3_SR8000_1GB_003nodes_018PEs_c.shrt	result_3.3_SR8000_1GB_003nodes_018PEs_c.gz
12	1257.728	104.811	489.445	11.475	1204.480	8.000	3 * 4	result_3.3_SR8000_1GB_003nodes_012PEs_c.shrt	result_3.3_SR8000_1GB_003nodes_012PEs_c.gz
6	758.508	126.418	477.788	11.437	1224.976	8.000	3 * 2	result_3.3_SR8000_1GB_003nodes_006PEs_c.shrt	result_3.3_SR8000_1GB_003nodes_006PEs_c.gz
3	396.829	132.276	447.107	23.307	791.866	8.000	3 * 1	result_3.3_SR8000_1GB_003nodes_003PEs_c.shrt	result_3.3_SR8000_1GB_003nodes_003PEs_c.gz
16	1530.664	95.667	411.060	11.811	969.781	8.000	2 * 8	result_3.3_SR8000_1GB_002nodes_016PEs_c.shrt	result_3.3_SR8000_1GB_002nodes_016PEs_c.gz
12	1287.352	107.279	439.721	11.527	1208.742	8.000	2 * 6	result_3.3_SR8000_1GB_002nodes_012PEs_c.shrt	result_3.3_SR8000_1GB_002nodes_012PEs_c.gz
8	989.464	123.683	504.567	11.521	1213.191	8.000	2 * 4	result_3.3_SR8000_1GB_002nodes_008PEs_c.shrt	result_3.3_SR8000_1GB_002nodes_008PEs_c.gz
6	766.605	127.768	499.667	11.555	1222.560	8.000	2 * 3	result_3.3_SR8000_1GB_002nodes_006PEs_c.shrt	result_3.3_SR8000_1GB_002nodes_006PEs_c.gz
4	574.523	143.631	519.596	11.484	1226.043	8.000	2 * 2	result_3.3_SR8000_1GB_002nodes_004PEs_c.shrt	result_3.3_SR8000_1GB_002nodes_004PEs_c.gz
2	306.570	153.285	521.074	22.923	799.677	8.000	2 * 1	result_3.3_SR8000_1GB_002nodes_002PEs_c.shrt	result_3.3_SR8000_1GB_002nodes_002PEs_c.gz
8	1218.994	152.374	455.575	11.570	916.839	8.000	1 * 8	result_3.3_SR8000_1GB_001nodes_008PEs_c.shrt	result_3.3_SR8000_1GB_001nodes_008PEs_c.gz
7	1118.625	159.804	488.660	11.528	1207.508	8.000	1 * 7	result_3.3_SR8000_1GB_001nodes_007PEs_c.shrt	result_3.3_SR8000_1GB_001nodes_007PEs_c.gz
6	974.033	162.339	506.776	11.361	1211.698	8.000	1 * 6	result_3.3_SR8000_1GB_001nodes_006PEs_c.shrt	result_3.3_SR8000_1GB_001nodes_006PEs_c.gz
5	848.999	169.800	515.719	11.506	1211.176	8.000	1 * 5	result_3.3_SR8000_1GB_001nodes_005PEs_c.shrt	result_3.3_SR8000_1GB_001nodes_005PEs_c.gz
4	714.477	178.619	527.187	11.321	1216.537	8.000	1 * 4	result_3.3_SR8000_1GB_001nodes_004PEs_c.shrt	result_3.3_SR8000_1GB_001nodes_004PEs_c.gz
3	541.446	180.482	537.551	11.390	1222.115	8.000	1 * 3	result_3.3_SR8000_1GB_001nodes_003PEs_c.shrt	result_3.3_SR8000_1GB_001nodes_003PEs_c.gz
2	410.553	205.276	552.597	11.462	1230.266	8.000	1 * 2	result_3.3_SR8000_1GB_001nodes_002PEs_c.shrt	result_3.3_SR8000_1GB_001nodes_002PEs_c.gz
default round-robin order, i.e. ranks 0,3,6,... are on node 0, ranks 1,4,7,... on node 1, ranks 2,5,8,... on node 2:
24	915.478	38.145	110.275	23.077	741.535	8.000	3 * 8	result_3.3_SR8000_1GB_003nodes_024PEs.shrt	result_3.3_SR8000_1GB_003nodes_024PEs.gz
24	922.392	38.433	110.291	23.302	741.305	8.000	3 * 8	result_3.3_SR8000_1GB_003nodes_024PEs_b.shrt	result_3.3_SR8000_1GB_003nodes_024PEs_b.gz
18	895.539	49.752	138.199	23.185	752.172	8.000	3 * 6	result_3.3_SR8000_1GB_003nodes_018PEs.shrt	result_3.3_SR8000_1GB_003nodes_018PEs.gz
12	819.624	68.302	221.940	23.075	773.075	8.000	3 * 4	result_3.3_SR8000_1GB_003nodes_012PEs.shrt	result_3.3_SR8000_1GB_003nodes_012PEs.gz
6	618.331	103.055	361.906	23.158	785.927	8.000	3 * 2	result_3.3_SR8000_1GB_003nodes_006PEs.shrt	result_3.3_SR8000_1GB_003nodes_006PEs.gz
3	429.108	143.036	464.218	22.883	797.131	8.000	3 * 1	result_3.3_SR8000_1GB_003nodes_003PEs.shrt	result_3.3_SR8000_1GB_003nodes_003PEs.gz
16	775.710	48.482	115.840	36.781	103.655	8.000	2 * 8	result_3.3_SR8000_1GB_002nodes_016PEs.shrt	result_3.3_SR8000_1GB_002nodes_016PEs.gz
16	768.112	48.007	115.286	29.816	128.435	8.000	2 * 8	result_3.3_SR8000_1GB_002nodes_016PEs_b.shrt	result_3.3_SR8000_1GB_002nodes_016PEs_b.gz
12	768.140	64.012	158.576	23.544	770.873	8.000	2 * 6	result_3.3_SR8000_1GB_002nodes_012PEs.shrt	result_3.3_SR8000_1GB_002nodes_012PEs.gz
8	659.282	82.410	230.119	23.220	784.569	8.000	2 * 4	result_3.3_SR8000_1GB_002nodes_008PEs.shrt	result_3.3_SR8000_1GB_002nodes_008PEs.gz
8	680.633	85.079	230.015	23.430	775.896	8.000	2 * 4	result_3.3_SR8000_1GB_002nodes_008PEs_b.shrt	result_3.3_SR8000_1GB_002nodes_008PEs_b.gz
6	583.527	97.254	278.203	23.182	789.365	8.000	2 * 3	result_3.3_SR8000_1GB_002nodes_006PEs.shrt	result_3.3_SR8000_1GB_002nodes_006PEs.gz
4	495.397	123.849	390.279	23.160	792.499	8.000	2 * 2	result_3.3_SR8000_1GB_002nodes_004PEs.shrt	result_3.3_SR8000_1GB_002nodes_004PEs.gz
2	306.623	153.311	522.781	23.004	799.908	8.000	2 * 1	result_3.3_SR8000_1GB_002nodes_002PEs.shrt	result_3.3_SR8000_1GB_002nodes_002PEs.gz
8	1245.136	155.642	470.941	11.650	970.791	8.000	1 * 8	result_3.3_SR8000_1GB_001nodes_008PEs.shrt	result_3.3_SR8000_1GB_001nodes_008PEs.gz
6	971.215	161.869	505.246	11.563	1213.715	8.000	1 * 6	result_3.3_SR8000_1GB_001nodes_006PEs.shrt	result_3.3_SR8000_1GB_001nodes_006PEs.gz
4	706.801	176.700	526.974	11.521	1205.780	8.000	1 * 4	result_3.3_SR8000_1GB_001nodes_004PEs.shrt	result_3.3_SR8000_1GB_001nodes_004PEs.gz
2	410.471	205.236	552.259	11.549	1226.577	8.000	1 * 2	result_3.3_SR8000_1GB_001nodes_002PEs.shrt	result_3.3_SR8000_1GB_001nodes_002PEs.gz
explicitly allocated PEs, but using special additional options:							options
24	1805.675	75.236	400.133	11.728	954.936	8.000	---	result_3.3_SR8000_1GB_003nodes_024PEs_c.shrt	result_3.3_SR8000_1GB_003nodes_024PEs_c.gz
24	1806.033	75.251	381.057	12.003	1014.339	8.000	SS	result_3.3_SR8000_1GB_003nodes_024PEs_with_SS.shrt	result_3.3_SR8000_1GB_003nodes_024PEs_with_SS.gz
24	1280.068	53.336	353.586	29.933	161.925	8.000	SS, 64	result_3.3_SR8000_1GB_003nodes_024PEs_with_SS_lp64.shrt	result_3.3_SR8000_1GB_003nodes_024PEs_with_SS_lp64.gz
24	1225.848	51.077	295.441	27.604	144.019	8.000	64	result_3.3_SR8000_1GB_003nodes_024PEs_with_lp64.shrt	result_3.3_SR8000_1GB_003nodes_024PEs_with_lp64.gz

Used commands: mpicc -o b_eff -DMEMORY_PER_PROCESSOR=1024 b_eff.c -lm
               hpstatus
               limit datasize 500000 
 
explicit allocation of contiguous ranks on 3 nodes: 
               mpiexec -p NODE0 -N 1 -n size/3 ./b_eff result_3.3_SR8000_1GB_003nodes_sizePEs_c \
                     : -p NODE1 -N 1 -n size/3 ./b_eff \
                     : -p NODE2 -N 1 -n size/3 ./b_eff
         contiguous ranks on 2 nodes: 
               mpiexec -p NODE0 -N 1 -n size/2 ./b_eff result_3.3_SR8000_1GB_002nodes_sizePEs_c \
                     : -p NODE1 -N 1 -n size/2 ./b_eff
         contiguous ranks on 1 node: 
               mpiexec -p NODE0 -N 1 -n size/1 ./b_eff result_3.3_SR8000_1GB_001nodes_sizePEs_c 

         special additional options
          64: option -lp64, used on mpicc
          SS: environment variable JOBTYPE=SS, set while executing mpiexec 
 
 
default round-robin allocation:  
               mpiexec -p ALL -N nodes -n size ./b_eff result_3.3_SR8000_1GB_nodesnodes_sizePEs
MPI release:   P-1811-1113, HI-UX/MPP 
Execution time < 150 sec

On a Hitachi SR 2201 with 32+8 processors and 256 MByte/processor

On Nov. 9, 1999, on HI-UX/MPP hitachi 02-03 0 SR2201, with 256 MB/PE: The measurement was done while another application was running on the other 16 PEs. All PEs were used as dedicated PEs.

size	b_eff MByte/s	b_eff/size MByte/s	summary	full protocol
16	527.805	32.988	result_3.2_SR2201_016.shrt	result_3.2_SR2201_016.gz
8	276.903	34.613	result_3.2_SR2201_008.shrt	result_3.2_SR2201_008.gz
4	151.928	37.982	result_3.2_SR2201_004.shrt	result_3.2_SR2201_004.gz
2	80.086	40.043	result_3.2_SR2201_002.shrt	result_3.2_SR2201_002.gz

Used commands: mpicc -o b_eff -D MEMORY_PER_PROCESSOR=1024 b_eff.c -lm
               mpirun -n size ./b_eff result_3.2_SR2201_size
MPI release:   P-1811-1112, HI-UX/MPP 02-01 (based on MPICH Version 1.0.11)

On a SwissTX T1-baby with 6*2 processors and 512 MByte/processor

On Nov. 15, 1999, on OSF1 toneb7 V5.0 910 alpha, with 512 MB/PE: The measurement was done while other applications were running on PEs that weren't used by this benchmark. All PEs were used as dedicated PEs.

size	b_eff MByte/s	b_eff/size MByte/s	summary	full protocol
12	???.???	??.???	result_3.3_SwissTX1baby_012a.shrt	result_3.3_SwissTX1baby_012a.gz
8	97.497	12.187	result_3.3_SwissTX1baby_008a.shrt	result_3.3_SwissTX1baby_008a.gz
4	49.394	12.348	result_3.3_SwissTX1baby_004a.shrt	result_3.3_SwissTX1baby_004a.gz
2	25.792	12.896	result_3.3_SwissTX1baby_002a.shrt	result_3.3_SwissTX1baby_002a.gz

Used commands: tnetcc -o b_eff -DMEMORY_PER_PROCESSOR=512 b_eff.c -lm
               bsub -Is -n size txrun b_eff result_3.3_SwissTX1baby_size
MPI release:   T-NET: 0.17 (see http://service.scs.ch/gb2/fci/revision/)
                           (sysconfig -q tnet Version) 
Execution time < 163 sec

On IBM SP2

On Nov. 25/26, 1999, two measurements were done on dedicated processors:

size	b_eff MByte/s	b_eff/size MByte/s	summary	full protocol	platform
128	2241.885	17.515	result_SP2_512MB_128PE.shrt	result_SP2_512MB_128PE.gz	P2SC (120 MHz) processors with 512 MB each
32	568.227	17.757	result_SP2_256MB_32PE.shrt	result_SP2_256MB_32PE.gz	POWER2 (77 MHz) processors with 256 MB each

Used commands: mpcc -o b_eff -DMEMORY_PER_PROCESSOR=512 b_eff.c -lm
               poe b_eff result_SP2_512MB_sizePE -procs size
MPI release:   ???
Execution time < 980 sec (very long due to the extremly high Alltoallv latancy)

Shared Memory Systems

The b_eff benchmark is not well-suited for shared memory systems:

In most cases, applications need not copy date between the processores. There is no way to measure these savings, although these saving are an important part of the communication capabilities of the system.
MPI communication is not the appropriate programming model of shared memory systems. The latency and bandwidth of data copying with OpenMP is normally at least double as fast as MPI message passing. Therefore communication bandwidth should be measured inside of shared memory nodes by copying data with OpenMP.

On hierarchical systems, OpenMP should be used inside the shared memory nodes and MPI should be used between the shared memory nodes. It is currently under discussion to extend the b_eff benchmark with an OpenMP based memory copying between the processors inside of a shared memory node.

On a NEC SX-4/32 with 32 processors and 256 MByte/processor

On Nov. 9, 1999, on SUPER-UX hwwsx4 9.1 Rev1 SX-4, with 256 MB/PE: The measurement was done on a (dedicated) resource block with 16 processors while other application were running on the other processors (exception: the benchmark on 4 processors was done nteractively with time-sharing).

size	b_eff MByte/s	b_eff/size MByte/s	summary	full protocol
16	9670.150	604.384	result_3.2_sx4_016.shrt	result_3.2_sx4_016.gz
15	9493.817	632.921	result_3.2_sx4_015.shrt	result_3.2_sx4_015.gz
14	9007.233	643.374	result_3.2_sx4_014.shrt	result_3.2_sx4_014.gz
13	8301.263	638.559	result_3.2_sx4_013.shrt	result_3.2_sx4_013.gz
12	7738.770	644.898	result_3.2_sx4_012.shrt	result_3.2_sx4_012.gz
11	7129.367	648.124	result_3.2_sx4_011.shrt	result_3.2_sx4_011.gz
10	6401.344	640.134	result_3.2_sx4_010.shrt	result_3.2_sx4_010.gz
9	5765.670	640.630	result_3.2_sx4_009.shrt	result_3.2_sx4_009.gz
8	5162.575	645.322	result_3.2_sx4_008.shrt	result_3.2_sx4_008.gz
7	4535.283	647.898	result_3.2_sx4_007.shrt	result_3.2_sx4_007.gz
6	3920.267	653.378	result_3.2_sx4_006.shrt	result_3.2_sx4_006.gz
5	3261.534	652.307	result_3.2_sx4_005.shrt	result_3.2_sx4_005.gz
4	2622.012	655.503	result_3.2_sx4_004.shrt	result_3.2_sx4_004.gz
3	1983.912	661.304	result_3.2_sx4_003.shrt	result_3.2_sx4_003.gz
2	1316.320	658.160	result_3.2_sx4_002.shrt	result_3.2_sx4_002.gz

Used commands: mpicc -o b_eff -D MEMORY_PER_PROCESSOR=256 b_eff.c -lm
               mpirun -np size ./b_eff result_3.2_sx4_size
MPI release:   9.1

On a NEC SX-5/8B with 8 processors and 8 GByte/processor

On Nov. 9, 1999, on SUPER-UX sx5 9.2 k SX-5/8B, preliminary measurements were done with 256 MB/PE and without the non-blocking communication method:

size	b_eff MByte/s	b_eff/size MByte/s	summary	full protocol
4	5439.199	1359.800	result_3.2_sx5_256MB_004.shrt	result_3.2_sx5_256MB_004.gz
2	2662.468	1331.234	result_3.2_sx5_256MB_002.shrt	result_3.2_sx5_256MB_002.gz

Used commands: mpicc -o b_eff -D MEMORY_PER_PROCESSOR=256 b_eff_mthd1+2.c -lm
               mpirun -np size ./b_eff result_3.2_sx5_256MB_size
MPI release:   9.2

On a HP-V 9000/800/V2250 with 8 processors and 1024 MByte/processor

On Nov. 9, 1999, on HP-UX hwwhpv B.11.00 A 9000/800, with 1024 MB/PE: The measurement was done while another application was running, but with reduced priority (nice=39).

size	b_eff MByte/s	b_eff/size MByte/s	summary	full protocol
7	435.041	62.149	result_3.2_hpv_007c.shrt	result_3.2_hpv_007c.gz

Used commands: mpicc -o b_eff -D MEMORY_PER_PROCESSOR=1024 b_eff.c -lm
               mpirun -np size ./b_eff result_3.2_hpv_size
MPI release:   HP MPI 1.4 implementation

On a SGI Cray SV1-B/16-8 with 16 processors and 512 MByte/processor

On Nov. 17, 1999, on sn9626 athos 10.0.0.6 eth.3 CRAY SV1, with 512 MB/PE: The measurement was done in time-sharing mode while other applications were running on the system. Looking at the results one can see, that for 2, 4 and 8 nodes, the benchmark was nearly scheduled as on dedicated processors. For 12 and 15 nodes one can see, that setting-up the maximum on all methods and repetitions results in reproducible bandwidth values. The measurement on 15 processors was done with looplength_max reduced to 30 to reduce the total execution time. A measurement on all 16 PEs was not possible due to other applications running on the system with lower priority.

size	b_eff MByte/s	b_eff/size MByte/s	summary	full protocol
16	~ 1487.200	~ 92.950	extrapolation based on column b_eff/size and lines about 15, 12 and 8 PEs
15	1444.958	96.331	result_3.3_SV1B16_015a.shrt	result_3.3_SV1B16_015a.gz
12	1283.318	106.943	result_3.3_SV1B16_012a.shrt	result_3.3_SV1B16_012a.gz
8	958.823	119.853	result_3.3_SV1B16_008a.shrt	result_3.3_SV1B16_008a.gz
4	626.880	156.720	result_3.3_SV1B16_004a.shrt	result_3.3_SV1B16_004a.gz
2	359.071	179.535	result_3.3_SV1B16_002a.shrt	result_3.3_SV1B16_002a.gz

Used commands: cc -h taskprivate $LIBCM -o b_eff -D MEMORY_PER_PROCESSOR=512 b_eff.c
               qsub -eo -q nqebatch jobsize
               with jobsize:
                 #!/bin/sh
                 export MPI_GROUP_MAX=64
                 ja
                 mpirun -nt size ~/b_eff ~/result_3.3_SV1B16_size
                 ja -st 
MPI release:   mpt.1.3.0.2
Execution time < 360 sec (expected on a dedicated system)

References:

[1]: Karl Solchenbach: Benchmarking the Balance of Parallel Computers. SPEC Workshop on Benchmarking Parallel and High-Performance Computing Systems (copy of the slides), Wuppertal, Germany, Sept. 13, 1999.
[2]

Links

Rolf Rabenseifner
Gerrit Schulz