The extremely poor performance for the Globus inter cluster runs shown in
Tables 17-18
is hardware related. Take for example the mismatch in the hardware characteristics of the different
frontend machines shown below9:
IUSTI has a Pentium IV processor at 2Ghz with 1 GB of RAM
CEMEF has a Pentium IV processor at 400Mhz with 256 MB of RAM
INRIA has a dual-Pentium III processor at 1.2 GHz with 1 GB of RAM
The frontends use three different generations of the Pentium processors. One can immediately see a probable reason why the inter-cluster performances involving the CEMEF are poor. Recall that with the VPN approach all message passing is through the frontend machines. The CEMEF frontend can only receive and send messages at 400 Mhz compared to 2 Ghz at the IUSTI and dual 1.2 Ghz processors at INRIA. Additionally the available RAM at the CEMEF is 256 MB compared to 1 GB at both INRIA and the IUSTI. In theory these reasons result in network jams whenever inter-cluster applications involve the CEMEF cluster.
The poor performance using 60-64 nina-iusti processors cannot be totally attributed to the frontend hardware characteristics as the frontends at INRIA and the IUSTI are roughly equivalent. A possible reason for the poor performance may be that the VPN becomes saturated as the number of processors increases.
Tests using more than 24 processors were limited since the IUSTI has only 30 processors available and 24 at the CEMEF10. Therefore it was not possible to perform numerical experiments varing the number of nina/iusti processors for the 64-partitions mesh.
To evaluate the MecaGRID performance for a fixed number of processors,
numerical experiments were performed using the 32 partition mesh.
These results of these experiments are shown in Table 19
varying the number of nina-iusti processors from 8 to 32.
The total number of nina-iusti processors for each run was 32.
Ideally one would like to see the CRate and Ts/Hr constant for the different
combinations of nina-iusti processors. However the performance
degrades with the number of iusti processors increases due to
larger communication times (T2).
The 1.03M vertices mesh can be computer with 24 processors. Attempts using 16 processors resulted in a buffer size too large11. The buffer size can be changed in the AERO-F parameter statements but this was not tried.12
At this point in the study the capability to compute the MPI transfer rates between the processors was added. Each partition sends/receives data from its neighboring partitions. The transfer rate is computed by multiplying the total number of data sent/received 13divided by the time between the sends and receives. Table 20 shows the performances using 24 CPUs14 compiled with the -O3 option. Also shown are the transfer rates computed by Basset [#!basset!#] and some ping test results-see APPENDIX G. Ping tests from cemef to the INRIA cluster have three decimal place time accuracy and the computed transfer rates are reasonable. The rates shown are for 100 ping tests. Ping tests from INRIA have only 1 decimal place accuracy which is not sufficent to compute transfer rates. The reader is referred to the report of Basset to better understand the effect of hardware on Grid performance. Table 20 shows a significant loss in performance when inter-clusters are used.
|