Globus performance on a 1.03M vertices mesh

This section examines the MecaGRID performance for a large number of processors with a mesh containing 1.03M vertices.

Shown in Tables 17-18 are MecaGRID performances using 60-64 CPUs for the 1.03M mesh where Ts/Hr = time steps per hour, CRate⁸= Number of vertices * nstages/(computational time)/(number of time steps). The performances are for 10 time steps. The first column indicates the type of run ( g=Globus, ng=non-Globus). The second column gives the number of partitions (P) and the number of processors used. T1 is the computational time and T2 the communication time. W is the work = T1 - T2 and includes the time to write the solution files. Sav = the number of times the solution files were written. The times shown in Tables 17-18 are the average CPU times. For 64/48, a total of 48 processors are used for the 64 partition mesh( see Load balancing by processor speed (LB-2) in section 12. Table 17 shows that for non-Globus computations on the INRIA clusters one can compute at a rate of approximately 200 time steps/hour with the 1.03M vertices mesh writing solution files every 10th time step. When the solution files are written every two time steps using nina-pf processors, Table 18 shows a Globus computational rate on the order of 50. When inter-clusters are used the rate is on the order of 10 time steps per hour.

Table 17: 1.03M results Sav = 1

		Processor distribution
Typ	P/CPU	nina	pf	cemef	iusti	Sav	T1/T2	CRate	Ts/Hr	T/W
ng	64/64	32	32	-	-	1	165/ 108	1870	218	1.9
ng	64/48	16	32	-	-	1	185/ 113	1674	195	1.6
inter cluster
g	60/60	32	-	-	28	1	3523/ 2452	88	10	2.3

Table 18: 1.03M results Sav = 5

		Processor distribution
Typ	P/CPU	nina	pf	cemef	iusti	Sav	T1/T2	CRate	Ts/Hr	T/W
ng	64/64	32	32	-	-	5	681/ 470	454	53	2.2
g	64/64	32	32	-	-	5	655/ 442	472	55	2.1
g	62/62	32	-	-	-	5	637/ 417	485	57	1.9
inter cluster
g	64/64	32	16	-	16	5	2796/ 1950	111	13	2.3
g	64/64	16	4	8	24	5	3520/ 2537	88	10	2.6

Four full Globus production runs (800 time steps) with the 1.03M mesh were attempted using 62 processors (32 nina CPUs + 30 iusti CPUs) and 60 processors (32 nina CPUs + 28 iusti CPUs). Three of the four runs failed when one of the requested CPUs failed to start execution. The problem of failing CPUs has existed for at least six months. It occurs at random, the job remains active blocking the CPUs until the job is killed. Two of the failed runs blocked the system for nina and IUSTI users for five and eight hours before being killed. This failing due to dying CPUs has been noted by other Globus users on the Globus users E-mail list to which all Globus users can subscribe. The Globus software is an evolutionary software, open source and free. Users download the software, install it, and test it. Bugs are found and usually reported on the Globus users E-mail lists often with fixes that they have found or simply bring the bugs to the attention of the Globus Alliance gurus who seek to fix the problems that occur in an evolutionary software. It is possible that this problem is solved in the newer versions of the Globus software.