next up previous
Next: MecaGRID efficiency using 64 Up: Globus performance on a Previous: Globus performance on a


Analysis of Grid performance

The extremely poor performance for the Globus inter cluster runs shown in Tables 17-18 is hardware related. Take for example the mismatch in the hardware characteristics of the different frontend machines shown below9:

IUSTI has a Pentium IV processor at 2Ghz with 1 GB of RAM
CEMEF has a Pentium IV processor at 400Mhz with 256 MB of RAM
INRIA has a dual-Pentium III processor at 1.2 GHz with 1 GB of RAM

The frontends use three different generations of the Pentium processors. One can immediately see a probable reason why the inter-cluster performances involving the CEMEF are poor. Recall that with the VPN approach all message passing is through the frontend machines. The CEMEF frontend can only receive and send messages at 400 Mhz compared to 2 Ghz at the IUSTI and dual 1.2 Ghz processors at INRIA. Additionally the available RAM at the CEMEF is 256 MB compared to 1 GB at both INRIA and the IUSTI. In theory these reasons result in network jams whenever inter-cluster applications involve the CEMEF cluster.

The poor performance using 60-64 nina-iusti processors cannot be totally attributed to the frontend hardware characteristics as the frontends at INRIA and the IUSTI are roughly equivalent. A possible reason for the poor performance may be that the VPN becomes saturated as the number of processors increases.

Tests using more than 24 processors were limited since the IUSTI has only 30 processors available and 24 at the CEMEF10. Therefore it was not possible to perform numerical experiments varing the number of nina/iusti processors for the 64-partitions mesh.

To evaluate the MecaGRID performance for a fixed number of processors, numerical experiments were performed using the 32 partition mesh. These results of these experiments are shown in Table 19 varying the number of nina-iusti processors from 8 to 32. The total number of nina-iusti processors for each run was 32. Ideally one would like to see the CRate and Ts/Hr constant for the different combinations of nina-iusti processors. However the performance degrades with the number of iusti processors increases due to larger communication times (T2).

Table 19: 32 partitions: Performance vs. nina-iusti processors
CPUs Performance using 32 CPUs
nina iusti T1/T2 T/W CRate Ts/Hr
32 0 129/ 80 1.65 796.89 278
30 2 381/ 226 3.29 349.56 122
28 4 650/ 447 4.02 184.80 65
24 8 1201/ 924 5.44 94.21 33
20 12 1288/1025 7.22 88.32 31
16 16 1590/1281 7.17 70.59 25
12 20 1492/1184 7.21 76.45 27
8 24 952/ 781 54.20 129.54 45
4 28 631/ 441 22.48 223.86 78


The 1.03M vertices mesh can be computer with 24 processors. Attempts using 16 processors resulted in a buffer size too large11. The buffer size can be changed in the AERO-F parameter statements but this was not tried.12

At this point in the study the capability to compute the MPI transfer rates between the processors was added. Each partition sends/receives data from its neighboring partitions. The transfer rate is computed by multiplying the total number of data sent/received 13divided by the time between the sends and receives. Table 20 shows the performances using 24 CPUs14 compiled with the -O3 option. Also shown are the transfer rates computed by Basset [#!basset!#] and some ping test results-see APPENDIX G. Ping tests from cemef to the INRIA cluster have three decimal place time accuracy and the computed transfer rates are reasonable. The rates shown are for 100 ping tests. Ping tests from INRIA have only 1 decimal place accuracy which is not sufficent to compute transfer rates. The reader is referred to the report of Basset to better understand the effect of hardware on Grid performance. Table 20 shows a significant loss in performance when inter-clusters are used.


Table 20: 1.03M mesh: Globus MPI Transfer rates using 24 CPUs
Processor distribution Mbps  
nina pf cemef iusti AEDIF Basset [#!basset!#] ping CRate Ts/Hr Tcom/Twork
24 - - - 206.1 509.7 - 1920.7 223 1.46
12 12 - - 40.1 89.3 - 1309.2 152 1.00
- 24 - - 37.4 86.3 - 1094.8 127 1.03
- - 24 - 36.6 84.1 - 1062.3 123 0.63
inter cluster
12 - 12 - 1.9 7.2 60.3 464.2 54 2.12
- 12 12 - 1.9 7.2 60.3 492.8 57 2.40
- - 12 12 0.6 3.7 - 224.4 26 3.71
12 - - 12 0.7 5.0 - 263.1 30 5.17
8 - 8 8 0.5 - - 203.1 23 4.47



next up previous
Next: MecaGRID efficiency using 64 Up: Globus performance on a Previous: Globus performance on a
Stephen Wornom 2004-09-10