Deliverable D10

Interconnection Networks

Introduction to the deliverable
Contents of the deliverable

Introduction to Deliverable D10

The deliverable D10 which comprises reports from Imperial College and CWI is briefly described and put into perspective. Many of the reports also relate to deliverable D11 on the distribution of time delays.

We classify INs (Interconnection Networks) according to the components they connect: either processor-memory where the objective is to provide fast access to a coherent, distributed memory or processor-processor where efficient message passing is required. These two applications are considered in sections Processor-memory interconnection and Processor-processor interconnection of this report respectively, although no work has been undertaken in QMIPS specifically on the latter. Section Topologies considers the prevalent topologies that provide the desired connectivity with (hopefully) minimal latency.

Processor-memory interconnection

In order to make shared memory, parallel computers faster, the latency of memory accesses must be made as small as possible and this is accomplished by the use of caches which are very fast, local memories ``close'' to a processor. These are relatively small (although often several megabytes) and so certain memory accesses have to be remote, i.e. must access main memory over some communication medium. This medium is often a bus, or multi-bus, but may take the form of any of the networks we will consider. Memory is increasingly organised into cache hierarchies in which communication is between first-level cache, second-level cache and so on, eventually to main memory. However, more than two levels is uncommon.

For example, in the DASH computer, [5], processors are organised into clusters. Each processor has its own cache and each cluster has its own memory. Intra-cluster memory accesses are by bus but inter-cluster accesses are via a multi-stage IN (see section Topologies).

The crucial implementation issue is cache coherency, i.e. ensuring that the value of a (virtual) memory location is the same in all caches, or, that the value read by any processor is the one most recently written to a given location. Cache coherency may be ensured by any of a number of protocols, each of which generates additional IN traffic and requires additional operations in each memory access. Thus, the cache coherency creates a performance overhead which is crucial to the system designer and so modeller.

There are several papers in the literature considering the performance of snooping cache protocols, e.g. [1]. Based partly on this approach, the paper ``A Uniform-memory Access Model of a Distributed Coherent Cache System'' by A.J. Field, P.G. Harrison and N. Lehovetski was presented at the 10th UK Performance Engineering Workshop in Edinburgh, September 1994. It is a development of the work on SCI [3] initiated by Harrison and Baccelli's student, Lehovetski, in the summer of 1993 and presented at the first annual review. The work has since been taken further by Field and Harrison and another report is in preparation. In particular, private memory, shared system variables and shared data are distinguished, leading to many more cache line states. Also the architecture of the nodes on the SCI ring is modelled in more detail.

Processor-processor interconnection

The emphasis in processor-processor interconnection is on distributed systems as opposed to parallel processing (in a ``supercomputer''). The networks in question are therefore local area networks (LANs) such as token rings and Xerox's Ethernet. Many models have been developed for the Ethernet and others with CSMA protocols. An application of increasing importance is distributed databases, which have a similar coherency problem to that of cache hierarchies. It is a prime concern to compare quantitatively, for example, the overhead of various locking policies which allow safe, exclusive access to parts of the database. However, as noted in the Introduction, QMIPS has not addressed these issues specifically.

Topologies

Various topologies for interconnection networks have been proposed, implemented and used. These include cubes, meshes and MINs which are well established and newer designs such as ``fat trees" [4] (as used in CM5) and the accepted IEEE standard SCI, [3]. Many papers have been written on the connectivity of such networks and their complexity. Similar analyses have been undertaken for some ``hybrid'' networks, i.e. subnetworks of one of these types interconnected according to some (other) topology. For example, in DASH, clusters of buses are connected by a MIN.

Given a particular topology, the network is classified by features that include its routing strategy (centralised or decentralised), whether it is synchronised or not by a global clock, its flow control, switching mechanism (i.e. circuit-switched, packet-switched or a combination) and buffering. Some of these issues are addressed in [2].

Work on IN topologies has focused on the distribution of the latency, i.e. end-to-end transmission time. Results on packet-switched, finite buffered delta networks have been reported by Harrison and Pinto in ``Response time distributions in packet-switched banyan networks'', presented at the 2nd Workshop on Performance Modelling and Evaluation of ATM Networks, Bradford, July 1994. This was based in part on the previous paper (which did not consider transmission times) by the same authors ``An approximate analysis of asynchronous, packet-switched, buffered Banyan networks with blocking'', Performance Evaluation Journal, 1994: also enclosed.

The latency in a single crossbar switch has also been considered and we enclose two papers: ``Transmission times in buffered full crossbar communication networks with cyclic arbitration'' by Field and Harrison, in Proc Int. Conf. on Parallel Processing, 1993 and by Cohen.

References

1: A.G. Greenberg, I. Mitrani, L. Rudolph
Analysis of Snooping Caches
Proc. PERFORMANCE '87, Brussels, N. Holland, 1988.
2: P.G. Harrison
Analytic models for multi-stage interconnection networks
J. Parallel and Distributed Computing, 1991.
3: IEEE P1596 Working Group
The Scalable Coherent Interface, DRAFT 1.0
IEEE, January 1991.
4: C.E. Leiserson
Fat-trees: Universal networks for hardware-efficient supercomputing
IEEE Trans. Comp., C-34(10), 892-901, October 1985.
5: D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, J. Hennessy
The Directory-Based Cache Coherency Protocol for the DASH Multiprocessor
Proc. IEEE Conference on Computer Architecture, 1990.

Contents of the Deliverable

D10-1: An approximate analysis of asynchronous, packet-switched, buffered Banyan networks with blocking, P.G. Harrison, A. de C. Pinto. Performance Evaluation Journal, 1994 (front page).
D10-2: Response time distributions in packet-switched banyan networks, P.G. Harrison, A. de C. Pinto. In Proc. 2nd Workshop on Performance Modelling and Evaluation of ATM Networks, Bradford, July 1994 (front page). Report included with deliverable D11.
D10-3: A Uniform-memory Access Model of a Distributed Coherent Cache System, A.J. Field, P.G. Harrison, N. Lehovetski. In Proc. 10th UK Performance Engineering Workshop, Edinburgh, September 1994 (front page). . Report in proceedings of the QMIPS London Workshop, deliverable D20. Also in deliverable D9.
D10-4: Transmission times in buffered full crossbar communication networks with cyclic arbitration, A.J. Field, P.G. Harrison. In Proc Int. Conf. on Parallel Processing, St. Charles, Illinois, USA, 1993 (front page). Report included with deliverable D11.
D10-5: On the determination of the stationary distribution of a symmetric clocked buffered switch, J.W. Cohen. CWI Research Report BS-R9427, July 1994 (front page).

Page Web by Alain Jean-Marie (Alain.Jean-Marie@sophia.inria.fr) Last Modified: 26 Sep 1995.