Backend Code Optimisation

Sid-Ahmed-Ali Touati

This habilitation thesis was defended in June 30th, 2010 at the University of Versailles Saint-Quentin en Yvelines (France). The honorable committee was composed of :

  1. Pr. Jens Knoop, Technical University of Vienna (Austria), Referee
  2. Pr. Jagannathan Ramanujam, Louisiana State University (USA), Referee
  3. Pr. Denis Trystram, reviewer, Grenoble Institute of Technology (France), Referee
  4. Dr. Christian Bertin, director of the compiler expertise center at STMicroelectronics (France)
  5. Dr. Alain Darte, CNRS senior researcher at École normale supérieure de Lyon (France)
  6. Pr. William Jalby, University of Versailles St-Quentin en Yvelines (France)
  7. Pr. Pascal Sainrat, University of Paul Sabatier (France), committee president

The table of contents of the manuscript can be retrieved here.

Abstract

This manuscript is a synthesis of our research effort since one full decade on the topic of low level code optimisation, devoted to an integration in a compiler backend or in a semi-automatic optimisation tool. At the backend level, processor characteristics are known and can be used to generate codes using the underlying hardware more efficiently.

We start our document by a global view on the phase ordering problem in optimising compilation. Nowadays, hundreds of compilation passes and code optimisation methods exist, but nobody knows exactly how to combine and order them efficiently. Consequently, a best effort strategy consists in doing an iterative compilation by successively executing the program to decide about the passes and optimisation parameters to apply. We prove that iterative compilation does not fundamentally simplify the problem, and using static performance models remains a reasonable choice.

A well known phase ordering dilemma between register allocation and instruction scheduling has been debated for long time in the literature. We show how to efficiently decouple register constraints from instruction scheduling by introducing the notion of register saturation (RS). RS is the maximal register need of all the possible schedules of a data dependence graph. We provide formal methods for its efficient computation, that allows to detect obsolete register constraints. Consequently, they can be neglected from the instruction scheduling process.

In order to guarantee the absence of spilling before instruction scheduling, we introduce the SIRA framework. It is a graph theoretical approach that bound the maximal register need for any subsequent software pipelining, while saving instruction level parallelism. SIRA model periodic register constraints in the context of multiple register types, buffers and rotating register files. We provide an efficient heuristic that show satisfactory results as a standalone tool, as well as an integrated compilation pass inside a real compiler.

SIRA defines a formal relationship between the number of allocated registers, the instruction level parallelism and the loop unrolling factor. We use this relationship to write an optimal algorithm that minimises the unrolling factor while saving instruction level parallelism and guaranteeing the absence of spilling. As far as we know, this is the first result in the literature proving that code size compaction and code performance are not antagonistic optimisation objectives.

The interaction between memory hierarchy and instruction level parallelism is of crucial issue if we want to hide or to tolerate load latencies. Firstly, we practically demonstrate that superscalar out-of-order processors have a performance bug in their memory disambiguation mechanism. We show that a load/store vectorisation solves this problem for regular codes. For irregular codes, we study the combination of low level data pre-loading and prefetching, designed for embedded VLIW processors.

Finally, with the introduction of multicore processors, we observe that program execution times may be very variable in practice. In order to improve the reproducibility of the experimental results, we design the Speedup-Test, which is a rigorous statistical protocol. We rely on well known statistical tests (Shapiro-wilk’s test, Fisher’s F-test, Student’s t-test, Kolmogorov-Smirnov’s test, Wilcoxon-Mann-Whitney’s test) to evaluate if an observed speedup of the average or the median execution time is significant.

Keywords

Instruction-Level Parallelism, Instruction Scheduling, Register Allocation, Register Saturation, Software Pipelining, Linear Programming, Integer Linear Programming, Memory Hierarchy, Program Performance Evaluation, Compilation, Code Optimisation


This document was translated from LATEX by HEVEA.