568K mesh: Globus performance on individual clusters

After the 262K study was completed, the AEDIF code was restructured to remove all unnecessary tables and subroutines to permit larger meshes with smaller executables than otherwise possible.

Table 12: 568K mesh: Globus performances on individual clusters - 16 CPUs -O3 option

Run type	Globus	Globus	Globus	Globus
Name of cluster	INRIA-nina	IUSTI	INRIA-pf	CEMEF
Processor speed	2 GHz	2 GHz	1 GHz	1 GHz
LAN speed	1 Gbps	100 Mbps	100 Mbps	100 Mbps
cache	512K	512K	256K	256K
RAM/CPUe	1/2 GB	1 GB	1/4 GB	1/4 GB
Executable size	237 MB	237 MB	237 MB	237 MB
Number of processors	16	16	16	16
Total computational time	104.0	94.4	195.6	189.7
Local inter-comm. time	1.8	13.3	12.2	13.4
Global inter-comm. time	51.5	39.1	55.9	81.5
Computational ratio	1.0	0.91	1.9	1.9
Communication/Work	1.0	1.2	0.5	1.0

Shown in Table 12 are the performances on the different individual clusters using 16 processors. The performances on the INRIA-pf and the CEMEF clusters are quite good (compute ratio < 2). Note that the IUSTI cluster performance is 20 percent faster than the INRIA-nina cluster, an unexpected result. However, the Communication/Work ratios for the 568K mesh with 16 processors are much larger than for the 262K mesh using 8 processors⁴. In examining the computational times for the the 262K and 568K runs, it was found that different compile options were used⁵and explains the differences in the Communication/Work ratios. Therefore for the same mesh, the more efficient the code (less Work per processor) the larger the Communication/Work ratios as the communication times depend on the LAN speeds that remain unchanged⁶!

Table 13: 568K mesh: Globus performances on individual clusters - 16 CPUs -O1 option

Run type	Globus	Globus	Globus	Globus
Name of cluster	INRIA-nina	iusti	INRIA-pf	cemef
Processor speed	2 GHz	2 GHz	1 Ghz	1 GHz
LAN speed	1 Gbps	100 Mbps	100 Mbps	100 Mbps
cache	512K	512K	256K	256K
RAM/CPUe	1/2 GB	1 GB	1/4 Gb	1/4 GB
Executable size	871 MB	871 MB	871	871 MB
Number of processors	8-8	8-8	8-8	8-8
Total computational time	547.3	449.9	740.9	1039.8
Local inter-comm. time	2.3	12.8	9.4	12.5
Global inter-comm. time	280.8	178.9	288.7	277.3
Computational ratio	1.00	0.82	1.35	1.90
Communication/Work	1.07	0.74	0.67	0.39

Table 13 shows the performances on the individual clusters using the -O1 option. Comparison of Table 12 (-O3 option) with Table 13 (-O1 option) shows that compiling with the -O1 option reduces the Communication/Work time ratios⁷.

Table 14: 568K mesh: inter-cluster performance - 16 CPUs -O1 option

Run type	Globus	Globus	Globus	Globus	Globus	Globus
Name of cluster	nina	nina-pf	nina-iusti	nina-cemef	pf-cemef	iusti-cemef
Processor speed	2 GHz	2/1 Ghz	2/2 GHz	2/1 Ghz	1/1 Ghz	2/1 Ghz
Executable size	343 MB	343 MB	343 MB	343 MB	343 MB	343 MB
Number of processors	16	8-8	8-8	8-8	8-8	8-8
Total computational time	547.3	702.3	1207.9	1322.4	1323.4	2041.6
Local inter-comm. time	2.3	10.5	496.3	190.8	181.9	554.0
Global inter-comm. time	280.8	279.5	449.2	411.4	411.4	449.4
Computational ratio	1.00	1.28	2.21	2.42	2.41	3.73
Communication/Work	1.07	0.70	3.61	0.83	0.81	0.97

Shown in Table 14 are some of the inter-cluster performances with 16 processors. It is noted that the local communication times for nina-cemef and pf-cemef are approximately two times smaller than the other inter-cluster combinations. This astonishing observation cannot be explained.