Next: Summary Up: latex2html_globus Previous: Submitting jobs to the

Optimizing MecaGRID calculations

Methods to optimize MecaGRID computations can be described as follows:

Global submittal scripts- Creating a submittal script that accounts for global processor availability should be a priority. This was discussed previously. This approach is not currently being pursued at INRIA but is essential before users will accept the MecaGRID. It is a topic of current interest on the Globus users E-mail list and is part of the Globus Tool Kit (Metacomputing Directory Service or MDS).
Load balancing by processor speed (LB-1)- The recent improvements by Lanrivain [#!rodolphe!#] to the mesh partitioner developed by Digonnet [#!hugues!#] at the CEMEF is notable. Lanrivain and Digonnet created a heterogeneous version that partitions a mesh accounting for the different processor speeds on the different clusters. This approach would work well in the present MecaGRID configuration where the clusters and the number of processors to be used on each cluster must be specified in advance. Basset [#!basset!#] obtained mixed results using this approach, very good or very poor.
For the global processor availability method, there are several disadvantages to this innovative approach, they are: 1) the clusters that will be used and the number of processors on the cluster are not known in advance. Therefore the partitioner must be executed in the same run as AEDIF run to create the partitions that AEDIF will use on the same processors and, 2) Table 22 shows that as much as an hour is required to partition a large mesh, thus a major overhead for each AEDIF run.

Load balancing by processor speed (LB-2)- A simpler load balancing approach was suggested by Alain Dervieux and tested in this study. The idea is that rather than partition the mesh according to the processor speed to obtain partitions of different sizes (LB-1), create homogeneous mesh partitions (equal sizes) and give more partitions to the faster processors at execution. This avoids the necessity to run the mesh partitioner before executing the AERO-F and AEDIF codes. The MecaGRID clusters have either 1 Ghz or 2 Ghz processors. Therefore the INRIA-nina and the IUSTI processors would get two partitions and the INRIA-pf and the CEMEF clusters one partition. It does require an Interface to write the MyMachine.LINUX file. Another advantage of this approach is that the homogeneous partitions can be configured to fit the minimum RAM available (256 MB on the INRIA-pf) thus avoiding swapping in/out of RAM.

Table 23 and Table 24 shows some non-Globus AEDIF results using the LB-2 method for the 1.03M mesh with 32 and 64 partitions. Table 23 shows that LB-2 using 24 processors (8-nina and 16-pf) is approximately as efficient as using 16-nina and 16-pf processors (a saving of 8 processors). Table 24 shows that LB-2 using 48 processors (16-nina and 32-pf) is approximately as efficient as using 64 processors (32-nina 32-pf) with 1 processor/partition. The LB-2 method run with 48 processors requires 25 percent fewer processors (48 instead of 64) to achieve the same result in approximately the same run time.

Table 23: Load balancing method LB-2 using 32 partitions

		INRIA-nina		INRIA-pf
CPUs/Parts	Method	CPUs	Partitions	CPUs	Partitions	Time
32/32	STD	16	16	16	16	205 sec
24/32	LB-2	8	16	16	16	226 sec

Table 24: Load balancing method LB-2 using 64 partitions

		INRIA-nina		INRIA-pf
CPUs/Parts	Method	CPUs	Partitions	CPUs	Partitions	Time
64/64	STD	32	32	32	32	165 sec
48/64	LB-2	16	32	32	32	185 sec

Dynamic memory allocation (DMA)- The current version of AEDIF is compiled with F77. The executable size is based on the maximum partition size. Therefore load balancing by processor speed using heterogeneous partitions requires the same RAM for both large and small partitions! To avoid this, Dynamic Memory Allocation should be introduced in future versions of the AEDIF code so that smaller partitions would require smaller RAM. This is most easily accomplished using F90 and would not (based on the author's experience in F90 programming) be difficult to implement.
Optimizing RAM- For the same processor speed, different sizes of RAM available may degrade performance in the sense that an IUSTI processor has 1 GB RAM available and INRIA-nina less 1/2 GB, yet both have have 2 Ghz processors. A method to equalize the RAM is as follows. For each nina Node requested, use only one CPU of that Node therefore the nina processor used will have 1 GB RAM available.

Next: Summary Up: latex2html_globus Previous: Submitting jobs to the

Stephen Wornom 2004-09-10