The GNU Instruction Scheduler

Michael D. Tiemann
June 16, 1989 (Updated July, 1989)

Abstract

The GNU C compiler is a mostly portable, optimizing compiler written by Richard Stallman and others to provide the GNU project with a free C compiler. The GNU Instruction Scheduler is a new pass written by the author for the GNU C compiler. This report begins with a brief overview of the architecture of the GNU C compiler, followed by a description of list-based instruction scheduling. Design goals of the scheduler and accomplishments to date are presented, followed by a section on work yet to be done. This report concludes with an evaluation of the scheduler, its limitations and its strengths.

1 The Architecture of the GNU C Compiler

The GNU C compiler exists to provide the GNU project and others with a free C compiler. Of the roughly 65,000 lines of code implementing GNU C, about 15,000 lines of code implement the front-end, and the remaining 50,000 lines of code implement the back end. The compiler is surprisingly portable: most machine descriptions (with optimizations) can be written in 2000 lines of the GNU machine description format, a combination of Register Transfer Language (RTL) and C, plus a 1000 line (more or less) header file. There is a little more to the story, but the main point is that with a ratio better than 15:1 (machine independent code to machine dependent code), the compiler is quite portable.

Just as the back end of the compiler was designed to be portable across different register-based machine platforms, the front-end language processor was designed to support different languages. Both the front-end and back-end achieve their power by implementing a set of powerful primitives. For example, there's only one looping construct in in the front-end; this construct is called the loop statement. When a C while or for statement is read in, it is represented as a combination of a loop statement and few other statement types perhaps put around the loop statement. In the case of a for, where there is an initialization, it is represented as a compound of initialize followed by loop. The end tests that are built in in a while or for statements are represented using one exit the loop construct. Such organization makes life easier for the programmer writing a new front-end, but may make life more difficult for the programmer writing an optimizer to perform optimizations based on elaborate loop analysis.

The GNU C compiler, like most optimizing compilers, runs in many passes. The front end builds the syntax tree bottom up, using an LALR parser built by Bison [Corbett85]. When expression syntax grows to statement syntax, the statement is translated from its tree representation to its RTL representation. RTL representation looks like lisp lists; RTL objects are self-describing, and can be printed, copied, and traversed in much the same way that lisp lists are. The RTL representation for instructions is the aggregation of a Unique Identifier (UID), pointers to previous and next RTL instruction objects (making the insn list a doubly linked list, hence efficiently supporting insertion and deletion of other insns), the instruction pattern, the instruction code (a memoized value which associates the instruction object with a specific instruction template in the machine description), the instruction's logical links (pointers to instructions which feed this instruction via dataflow), and register notes (a list of assertions about the properties of the registers used by this RTL instruction object).

Once the primitive instructions (insns) have been emitted in response to requests from the front end, the passes of the back end go to work. The first pass is jump optimization, which is powerful and fast. When first run, its purpose is to remove dead code, so that the compiler has less code it must slog through. After initial jump optimization has been performed, common subexpressions are eliminated. Common subexpression elimination (CSE) is performed for each extended basic block (a sequence of code with a unique entry point, but possibly many exits points). If after performing common subexpression elimination, branch statements have been affected, jump optimization is run again, to remove any newly discovered dead code. CSE is only half-smart about memory: it knows to distinguish scalar variables from aggregates, and unknown addresses from either, but its ability to understand memory references is limited. A consequence is that often writes to independent memory locations tend to destroy valid common subexpressions.

Having simplified the code with jump and CSE optimizations, loop optimizations are performed. Loop invariant code motion, loop inversion, induction variable elimination, strength reduction, and use of special machine constructs (such as decrement and branch) are applied heuristically. Loop optimizations not performed include loop peeling, loop fission and loop fusion. Loop unrolling, perhaps most important for superscalar machines, is also not implemented. Loop optimizations may have again affected the branching behavior of the program; if so, jump optimization is again performed.

As discussed previously, all the different kinds of loops provided by e.g. C are mapped to a single kind of RTL-level looping structure. This makes some kinds of loop-based optimizations harder to do. For example, to get better performance from a superscalar machine, the compiler should unroll loops which do not have inter-iteration dependencies to achieve some degree of software pipelining. To compute such dependencies, it is very helpful to know the precise nature of the loop: its beginning, its body, its iteration step code, and its termination condition. As things stand right now, the loop optimizer would have to derive all that information itself from the RTL - a non-trivial task.

Common subexpression elimination and loop optimizations can take place without knowledge about basic blocks. The remaining optimizations depend on basic-block level data-flow analysis. A pass computes the basic blocks, initializing vectors which describe characteristics of basic blocks (live registers on entry) characteristics of registers within and across basic blocks (number of times set, number of times referenced, live across how many instructions, live in which basic blocks, number of calls crossed), and characteristics of insns (what registers die, what insns a data-flow sources for this insn). Data-flow analysis computes live variables, and hence can detect what become dead stores, which can then be eliminated.

Having computed such detailed information, the compiler is prepared to perform Frazier-Davidson style code optimization. The essence of this technique is to look for two or three insns related by data-flow, and to see whether the insns can be rewritten as a single insn which results from performing algebraic substitution.

The following example shows how the machine description supports an accumulate-multiply instruction. A complete description of the machine description syntax and semantics are beyond the scope of this paper, but the interested reader is encouraged to consult Using and Porting GNU CC [Stallman89] for more information. If the machine description contains e.g. the following patter:

;; FPA Add and multiply
(define_insn ""
  [(set (match_op:DF 0 "register_op" "=x,y,y")
        (mult:DF (plus:DF (match_op:DF 1 "register_op""=xH,y,y")
                          (match_op:DF 2 "general_op" "x,y,rmF"))
                 (match_op:DF 3 "general_op" "xH,rmF,y")))]
  "TARGET_FPA"
  "*

  if (which_alternative == 0)
    return \"fpam%.d %2,%w1,%w3,%0\";
  return \"fpam%.d %x2,%1,%x3,%0\";
")

Then when the combiner comes across these two (possibly non-adjacent) patterns:

(insn 78 77 79
  (set (reg:DF 99)
       (plus:DF (reg:DF 88)
                (reg:DF 89))) -1 (nil)
  (expr_list:REG_DEAD (reg 89) (nil)))

(...)

(insn 92 91 93
  (set (reg:DF 100)
       (mult:DF (reg:DF 99)
                (reg:DF 80))) -1 (insn_list 78 (nil))
  (nil))

it can substitute the source of the set for pseudoregister 99 in insn 78 for the use of pseudoregister 99 in insn 92, leading to:

(note 78 77 79 "" NOTE_INSN_DELETED)

(...)

(insn 92 91 93
  (set (reg:DF 100)
       (mult:DF (plus:DF (reg:DF 88)
                         (reg:DF 89))
                (reg:DF 80))) -1 (insn_list 78 (nil))
  (expr_list:REG_DEAD (reg 89) (nil)))

This technique can be thought of as performing peephole optimization in data-flow space. A much later pass handles peephole optimization in the conventional manner, by attempting to combine physically adjacent instructions.

After instruction selection has been performed, the compiler goes about allocating registers. Register allocation is performed in two passes. The first pass allocates registers within basic blocks. Only registers local to a single basic block, and which do not cross calls are allocated in this pass. The algorithm is efficient, and does a good, though limited, job. Register targeting is performed, which means that if the machine needs a particular register to contain a computed value, the register allocator tries to make the computed value wind up in the desired register. Its strategy for doing so is simple however, and it sometimes fails, resulting in register shuffling.

After basic block register allocation has been performed, the compiler allocates the remaining registers by building a conflict matrix, and proceeds to allocate registers by graph coloring. Heuristics use a priority function to determine which quantities should be register allocated and which quantities should go to memory. The priority function assigns quantities priorities computed from the value of the number of times a register is used times the logarithm of the number of insns across which the register is live, divided by the number of insns across which the register is live. Quantities in loops have a weight which makes them more likely to be register allocated.

It is possible that during global register allocation, temporary registers are needed to compute an intermediate result. The register allocator does not know how to find a global register which may happen to be free to perform this computation, and all local register may have been allocated by the previous phase. In such a case, a local register must be spilled. This means throwing away all quantities previously held by that register within that basic block; the quantities thus spilled will have to be computed again using different registers.

Once all registers have been allocated, the compiler can perform branch scheduling. Not all machines have delayed branches; for machines that do not, this pass is not run. The branch scheduler walks through basic blocks looking for unfilled delay slots. When one is found, it searches the basic block for instructions which could be put in its delay slot. The branch scheduler is quite general. It was initially designed to support a machine with zero to three delay slots, with or without nullification. Adapting it to schedule only one delay slot is straightforward.

A final jump optimization combined with peephole optimization is run. During this jump optimization phase, cross jump optimizations are performed. The peephole optimizer runs concurrently, and recognizes machine idioms which are difficult or impossible to represent using data-flow relationships.

Finally, assembly code is produced. At this time, redundant conditional tests are removed, and other machine-specific optimizations can be performed by the assembly output routines themselves.

2 List-based Instruction Scheduling

Instruction scheduling is a technique to exploit instruction level parallelism exposed by pipelined machines. The idea behind list-based scheduling is to simultaneously make the critical path of computation run as quickly as possible while maximizing parallel instruction execution. Scheduling only the critical path of a computation (critical path scheduling) does not account for instructions which are not immediately on the critical path, but were they to execute, would allow more instructions to be executed in parallel. Scheduling to achieve maximum parallelism is also not good, because it can very quickly use more registers than most machines have available, and because the schedule does not take into account an important performance constraint: the critical path. A later example details these problems. List scheduling is a simple way to balance these two approaches, and usually outperform both.

The algorithm for list scheduling is as follows: starting with a basic block, compute instruction priorities. An instruction with priority P1 must be scheduled to execute later than any instruction with priority P2 if P1 > P2.

An instruction I1 which depends on an instruction I2 cannot execute before I2. By giving I1 a larger priority number than I2, and always placing larger-prioritied instructions later in the schedule than smaller-prioritied ones, we preserved correct execution order. Instructions which take more time to compute are also given larger priority numbers. This has the effect of exposing the critical path: the instruction with the largest priority number is that last instruction on the critical path. After computing all priorities, the scheduler marks all instructions on which no instructions depend ready. These instructions are then sorted by priority order. The last instruction in the basic block is always on the ready list, but it need not always have the highest priority number.

The algorithm then starts scheduling by issuing the instruction with the highest priority order. This procedure will build the instruction block from last instruction to first. Each time we schedule an instruction, we check to see whether instructions on which it depended now have no more dependers. For each such instruction, we can mark it ready, and add it to the ready list in priority order. When all instructions have been scheduled, the algorithm terminates, and the next block, if there is one, can be scheduled.

In a vacuum, the implementation of instruction scheduling is quite straightforward, amounting to no more than performing a reverse topological sort on instruction dependencies (or a topological sort on anti-dependencies, depending on one's point of view), adding weights to break ties between some topologically equivalent instructions. The first implementation of list scheduling for the GNU C compiler was essentially this algorithm, but the results were not very good: on a matrix multiply program in which the inner multiply loop had been unrolled four times reduced execution speed by nearly a factor of two! What went wrong?

3 Smarter List Scheduling

The problem with the list scheduling algorithm presented in the previous section is that it is too simple. For example, when an add instruction can be scheduled, two multiply instructions on which it depends may become ready . They may have identical priorities, which means that both multiplications will be performed very close to one another, which is desirable, if the multiply unit is pipelined. Each of these multiply instructions may depend on some memory read instructions, which may all have the same priority. All the memory read instructions will then be scheduled very close to one another. Now each memory read may depend on instructions which compute memory addresses, all of which may have the same priorties. If this loop is unrolled four times, the schedule then becomes: compute eight memory addresses, perform eight memory reads, perform 4 multiplies, perform two addition operations, and finally perform one addition operation. If the machine has enough registers and functional units, this is a great schedule. If not, then register spills and interlocks take their toll.

To handle this problem, the priority function for certain instructions was modified as follows. When the scheduler chooses an instruction in which a register dies, the scheduler becomes sensitive to the instruction where that register was born. As soon as that instruction becomes ready (i.e., it has no further dependers), its priority is increased so that it will be the next instruction scheduled. The effect of this modification is to reduce register lives without greatly affecting parallelism.

Unfortunately, though this modification halved the number of register spills introduced by the scheduler, it did not eliminate all of them. This was because when several instructions' birthing registers became ready simultaneously, they all had the same high priority, and hence could still interfere with one another. The solution to this problem was to introduce a gradient on the priorities: the first birthing instruction got highest priority, and subsequent birthing instructions got lower priorities. This caused the scheduler to generate the schedule compute, load, compute, load, ... instead of compute, compute, ..., load, load, ..., and all spills were eliminated.

4 Design Goals

As has been stated before, the GNU C compiler is a mostly portable compiler. To maintain that property, it was essential to design the instruction scheduler to be as machine-independent as possible. The GNU C compiler can generate optimized code which retains symboltable and debugging information; this is a very nice feature of that compiler, and one which should be compromised as little as possible in the presence of instruction scheduling. Finally, the GNU C compiler generates optimized code very quickly. For example, it can generate better code than industry-standard A T&T compilers, yet do so about three times faster. While compilation time should be a secondary criterion when evaluating optimizing compilers, it would be nice to keep the compiler fast.

The GNU C compiler makes a number of simplifying assumptions which inhibit instruction scheduling freedom. For example, it is assumed that any instruction which tests condition codes immediately follows the instruction which set them. Similarly, it is assumed that before register allocation has been performed, machine registers (not pseudo registers) are only used for specific purposes, and the compiler takes shortcuts based on such assumptions. For example, when performing a subroutine call on a machine which passes arguments in registers, it is assumed that no instructions between the loading of the parameter registers and the use of the function return value intervene.

A final design goal of the instruction scheduler was to design it to be easily understood and improved upon. If it is to gain wide acceptance and use across many machines, its design must be simple and robust. For example, it should be possible to bootstrap the compiler with instruction scheduling enabled in a relatively short amount of time. It should also be cooperative with other passes of the compiler. A scheduler which only just works is not good enough; it must be sufficiently general and robust that others can modify other passes of the compiler, and their modifications will work, or will fail to work in easily understood, or at least documented ways.

5 Implementation

The optimizations performed by an instruction scheduler are necessarily machine specific. Therefore, to design a machine-independent instruction scheduler, it was necessary to abstract away as much of the machine as possible. This was done by using a program to build encapsulations of machine-dependent information that could be carried by instructions. The essential information used by the instruction scheduler presented here is the number of cycles an instruction expects to execute, and the resources used by the instruction during its execution. At such a level of abstraction, the scheduler is sufficiently machine-independent.

Preserving debugging information is a hard problem, especially when instructions are moved across binding contours, e.g.,

    Original Code                    Scheduled Code 

{                                {
  double i = 3.14;                 double i(1) = 3.14;
  int b;                           int b(1);

    {                                {
      int i = foo ();                  int i(2) = foo ();
      int *p;                          int *p(1);
      b *= i;                          i(1)*= 2;
      *p -= i;                         b(1) *= i(2);
     }                               }
   b -= 2;                         *p(1)-= i(2);
   i *= 2;                         b(1) -= 2;
}                                }

The problem is, how can the debugger maintain correspondence between the binding contours of the program and the instructions in the program if instructions can move freely across them? I do not have a good answer for this, but I have preserved all debugging information when instructions do not move across binding contours. It should be pointed out that most binding contours appear at program control points, such as loops and conditionals. Since instructions are only scheduled within a basic block, the distance an instruction can travel is limited to remain within the control construct, and hence the binding contour.

The instruction scheduler that I implemented is currently one of the faster passes of the compiler. Mysteriously, the compiler is sometimes faster when performing instruction scheduling then when not. From profiling information, it appears that scheduled instructions utilizes registers in a way that simplifies the task of local register allocation. Time spent scheduling is often offset by time saved in register allocation. For such a fast instruction scheduler, the performance is pretty good. On a Sun 4/110, where loads and stores take three cycles when an interlock must be used, or two cycles if not, a large application (GNU CC itself) shows a speedup of 8%. Assuming that loads and stores constitute 25% of the instruction mix [Hennessey88], the maximum performance improvement achievable from just scheduling loads and stores is 0.25 (frequency) * 0.50 (speedup) = 0.125, or 12.5%. On the Sun 4/110, floating point operations are not pipelined, so there is no benefit to be gained there. Thus 8% is not bad, and shows that good instruction schedulers can be made fast.

The instruction scheduler was implemented to write out data dependency information to assist the compiler writer in analyzing its decisions. Instructions which are ready to execute, instruction priorities, stalls, and successfully filled delays slots are all printed prior to dumping the function's RTL. It is hoped that this information will be helpful to people who are interested in experimenting with, or measuring the performance of, the priority function.

Alas, the instruction scheduler and the branch scheduler do not yet work together. The main problem is caused by the fact that in order to get the branch scheduler to use instructions which come after the branch, data-flow information is needed. Unfortunately, the GNU C compiler does not keep its data flow information up to date as the compiler progresses. Specifically, optimizations such as cross-jumping change completely the landscape of the code, altering basic blocks in ways not accounted for. I was able to reconstruct most of the data needed to still do a decent job filling branch delay slots, but in so doing, introduced new data dependencies which violated assumptions made by the instruction scheduler. The compiler has been bootstrapped with instruction scheduling and branch scheduling as mutually exclusive options, but it does not yet work with both options enabled at once.

6 Performance Analysis

It is interesting to compare the speedup gained by instruction scheduling as compared with that gained by branch scheduling. While many people do not make the distinction between the two, there is an important difference which should be recognized. Instruction scheduling has to do with data dependencies, while branch scheduling has to do with control dependencies. List scheduling is an algorithm which attacks the problem of the many alternatives possible due to data parallelism. For scalar and superscalar machines, control parallelism is converted to data parallelism (by e.g. loop unrolling), but there still are slots to fill in control transfer instructions. Algorithms for filling these slots can be much more aggressive about finding instructions to fill these slots than one can be with data-parallel instructions, especially if the control constructs can annul speculative computation. The GNU C compiler had a primitive branch scheduler which worked by finding instructions which had delay slots, then looking for an earlier instructions which could be place in the delay slot. This simple strategy always improves code when it applies; the problem is that it does not apply with great frequency.

I extended the branch scheduler by adding a second pass to fill slots left previously unfilled (by the original branch scheduler) with instructions taken either from the target or following the branch. Strategies and tradeoffs for eager branch scheduling can be found in [Hennessey88]. After making this work (which was harder than anticipated), I compared the performance of an application (the compiler itself) compiled with instruction scheduling versus that same application compiled with branch scheduling.

Under test, branch scheduling showed an 18% performance improvement, while instruction scheduling showed only about 8%. Why such a large difference if there are about as many branch delays to fill as there are load and store delays? Without precise measurements, only an approximate answer can be given. First, the branch scheduler reduces the cost of a branch instruction from two cycles to one, where the instruction scheduler reduces the cost of loads and stores from three cycles to two. Second, many memory delay slots may be filled just by luck, say about one fourth the time. Such accounting would make an perfectly efficient instruction scheduler give a performance improvement of 14%, versus a perfectly efficient branch scheduler a performance improvement of 18%. The branch scheduler often separates loads from uses across the branch instruction, which is pure fortitude. Therefore, in the estimated scheduler performance, the cost of memory operations is somewhat reduced. Also, the compiler has some ability to fill branch delay slots using peephole optimizations, independent of the branch scheduler, hence in the second table, branch delay is estimated to be on average 1.9 cycles per instruction.

                         Perfect Schedulers
              CPI unscheduled  i-scheduled  b-scheduled

25 load/store     2.8 ~/insn   2.0 ~/insn   2.7 ~/insn
25 branch         1.9 ~/insn   1.9 ~/insn   1.0 ~/insn
50 ALU            1.0 ~/insn   1.0 ~/insn   1.0 ~/insn
100 insns         1.68 ~/insn  1.48 ~/insn  1.42 ~/insn

Performance improvement: 1.00         1.14         1.18

                         Estimated Schedulers
              CPI unscheduled  i-scheduled  b-scheduled

25 load/store     2.8 ~/insn   2.3 ~/insn   2.7 ~/insn
25 branch         1.9 ~/insn   1.9 ~/insn   1.2 ~/insn
50 ALU            1.0 ~/insn   1.0 ~/insn   1.0 ~/insn
100 insns         1.68 ~/insn  1.55 ~/insn  1.47 ~/insn

Performance improvement: 1.00         1.08         1.14

If the analysis is accurate, the best performance improvement one can expect to see on a Sun4/110 using GNU C's scheduling optimizers would be 34%; using the estimates in the second table, a 23% performance improvement can be expected using both schedulers.

7 Extensions and Future Work

Before talking about future work, some past work should be presented. That work should give the reader some context and understanding of how to approach doing future work should s/he be so inclined.

There are two major aspects of the instruction scheduler: the work done for the scheduler and the work done by the scheduler. The work done for the scheduler involved modifying the data flow analysis pass of the compiler to compute dependency information. Initially, inter-instruction dependencies were only recorded between the setting of a register and its first use. This had to be modified so that all uses depended on the preceding set, and that following sets depended on the preceding set as well. Hard registers and function calls complicated matters further: dependencies had to be added to preserve call order, and to ensure that parameter passing for one function was not interleaved with parameter passing for another. Adjustments to the stack pointer rounded out the kind of operations about which the scheduler had to be cautious.

Live register information is often represented in compilers by bitvectors, and GNU C is no exception. The total space consumed by bits is the number of basic blocks times the number of pseudoregisters used by the function. Originally, flow analysis only needed to allocate from the heap bits to record the registers live on entry to a basic block. It could allocate the bits for registers live on exit from a basic block from the stack. When performing instruction scheduling, both must be allocated from the heap, because accurate live register information must be maintained after scheduling for register allocation.

The additional overhead of adding all these dependencies and extra bitvectors slows down flow analysis, already one of the slowest passes in the compiler, by about 10%.

Collecting the dependencies was a straightforward task, but keeping the information accurate was not so simple. When two instructions are combined to make a new instruction, flow information must be propagated from the instructions disappearing to the new instruction. Consider a multiply instruction I1 and an add instruction I2 becoming a multiply-accumulate instruction. If a third instruction, I3, had to execute after I1, how will it know that it must now execute after I2? To keep the scheduler's memory utilization within reasonable limits, I utilized transitive dependencies whenever possible. This meant that after the combiner ran, the dependencies could be in shambles. It is not desirable to punt and re-run flow analysis when interesting instruction reorganizations are performed, since this would have to be done twice - both before and after instruction scheduling, slowing the compiler by an additional 40%. Instead, the combiner and the scheduler were modified to incrementally fix most dependency information that they broke. The additional time spent by the combiner doing these fixups is virtually unmeasureable. For the scheduler, the additional time is also quite small, though the solution is much trickier.

The scheduler has found the new instructions made by the combiner based on dependencies pointing to deleted instructions. In many cases the deleted instruction became part of the current instruction, hence no dependency is needed at all. The remaining cases involve searching for the merged instruction. The search uses a table to record all dependency information derived while searching for some particular dependence. As the table is built, searches become shorter.

The instruction scheduler runs before register allocation. This gives the scheduler much more flexibility in moving instructions around, but it can lead to worse registers allocation. For example, nothing prevents the scheduler from moving instructions across calls. Such code motion may be detrimental if moving the instruction causes a register not previously live across a call to become so. Running the scheduler before register allocation can also result in incomplete instruction scheduling: if a quantity fails to be register allocated (i.e. spilled to memory), the memory access may affect the schedule. It would be interesting to design an incremental scheduler which could take care of such problems without rescheduling the whole block.

To save time, the scheduler does nothing if the instructions are already in their priority order. The scheduler could look for other blocks with which to merge in order to improve schedule of both blocks. Selecting the blocks with which to merge, and setting thresholds on code duplication both probably need good heuristics. Merging would also mean extending the incremental life analysis performed by the scheduler.

The scheduler has a very simple model of memory accesses. Three functions, true_dependence, anti_dependence, and output_dependence a predicate functions which return the truth value of the relationship between two memory references for true-, anti-, and output-dependencies respectively. They are currently implemented by simply returning TRUE for all references. Making these functions return more accurate answers would give the scheduler more freedom to schedule, since their would be fewer dependencies acting as constraints.

The loop optimization pass is run well before the scheduling pass, thus the scheduler has no input on how loops should be unrolled. It would be interesting to see how one can reconcile the independence of these two interdependent passes.

Some machines do not have primitive support for basic algebraic operations such as multiplication. When this is the case, the compiler writer must make the tradeoff of exposing the machine's weakness and hiding the actual machine implementation. For example, for a machine without multiplication, the compiler can implement multiplication by constants as a sequence of shifts and additions. If multiplication is transformed into shifts and adds too early, the algebraic simplifiers may miss some simplifications, whereas if it is transformed too late, the sequence of instructions will not be available for instruction scheduling.

To make good use of instruction scheduling, basic blocks should be large enough to expose a good deal of instruction level parallelism. This generally means performing procedure integration and loop unrolling. The mechanics of implementing these methods in GNU C is quite simple; in fact, procedure integration is already done. The main difficulty is deciding which procedure to integrate, or deciding which loop to unroll and how many times to do so. Good loop unrolling is probably the next important optimization that should be implemented in the GNU C compiler.

For the basic blocks where there is a fair amount of instruction level parallelism, there is still about a 50-50 chance that the scheduler will build a worse schedule rather than a better one. With the many registers available on a Sparc (due to its use of register windows), the down side is often not seen, but here is an example which exposes it:

static void
decode (shorts, low, hi)
     short *shorts; int *low;
{
  *low = (shorts[3] << 24)
          | (shorts[2] << 16)
          | (shorts[1] << 8)
          | shorts[0];
}

The GNU C compiler will, on its own, generate nice code for this, while the scheduler will try to do too much in parallel, and get lost:

ldsh [%o0+6],%o3         add %sp,-88,%sp
sll %o3,24,%o3           st %l0,[%sp+80+0]
ldsh [%o0+4],%o4         ldsh [%o0+2],%o4
sll %o4,16,%o4           ldsh [%o0],%o5
or %o3,%o4,%o3           sll %o4,8,%o4
ldsh [%o0+2],%o4         ldsh [%o0+4],%l0
sll %o4,8,%o4            ldsh [%o0+6],%o3
or %o3,%o4,%o3           sll %l0,16,%l0
ldsh [%o0],%o4           sll %o3,24,%o3
or %o3,%o4,%o3           or %o3,%l0,%o3
st %o3,[%o1]             or %o3,%o4,%o3
retl                     or %o3,%o5,%o3
nop                      st %o3,[%o1]
                         ld [%sp+80+0],%l0
                         retl
                         add %sp,88,%sp

The unscheduled code has stalls due to interlocks, but the scheduled code executes an extra load, an extra store, and a stack manipulation operation, resulting in slower overall code. It would be nice to see future work improve the priority assignment function so that the scheduler always makes at least code-improving transformations.

Scheduling across basic blocks is another way to achieve more parallelism. This technique was not attempted for this project, but should extend from this work in a straightforward manner. As with loop unrolling and procedure integration, cross-basic-block scheduling can result in code explosion. The heuristics of GNU C compiler perform their optimizations based only on the code being compiled. While they tend to do a good job, a number of these optimizations could benefit from profiling information. In particular, profile information should be used to guide cross-basic-block instruction scheduling.

The instruction scheduler should be modified to work after register allocation. Register spills become loads and stores, and can change the characteristics of the instruction stream enough to warrant rescheduling. It would be trivial to run the scheduler after register allocation, just to take care of the cases of register spills. What would be more interesting is to modify the scheduler and/or the register to note the cause of the spill. If the spill occurred because a register's life was wantonly extended (perhaps across a call or two), the scheduler should attempt to shorten that register's life when next it builds its schedule. Live-range splitting is not implemented in GNU C, but it would be logical to have the live-range splitter and the scheduler coordinate their actions.

8 Conclusion

The difficult work of interfacing an instruction scheduler to the GNU C compiler has been performed. The compiler has been successfully bootstrapped using the instruction scheduler presented here, and an 8% performance improvement was observed. The compiler has also been successfully bootstrapped using a branch scheduler, resulting in a 20% performance improvement. Work on making the two cooperate is almost complete; when this is accomplished (expected in the next few days) the opportunity will exist for others to make scheduling-related improvements, such as those suggested in the previous section, without having to worry about too many other compiler bookkeeping details.

The GNU C compiler has been a willing partner in this undertaking. It showed fine sportsmanship in permitting itself to be bootstrapped with the two schedulers presented here. It was a disappointment that it did not work with both schedulers enabled, but it is expected to do so soon.

Bibliography

Corbett85: Corbett, Robert Paul, Static Semantics in Compiler Error Recovery, June 1985, Report No. UCB/CSD 85/251.

Stallman89: Stallman, Richard M., Using and Porting GNU CC, June 1989, Free Software Foundation, Cambridge MA, 02139.

Hennessey88: Hennessey, John L., EE 282 Class Notes, September-December 1989, Stanford University, Stanford CA, 94305.

Fisher81: Fisher, J. A., Trace Scheduling: A Technique for Global Microcode Compaction. IEEE Transactions on Computers C-30(7): 478-490, 1981.

Appendix

Here are the numbers of GCC compiled without GCC instruction scheduling:

        Cold                             Warm

time in parse: 4.750000          time in parse: 4.490000
time in integration: 0.000000    time in integration: 0.000000
time in jump: 0.970000           time in jump: 1.000000
time in cse: 4.600000            time in cse: 4.310000
time in loop: 1.030000           time in loop: 1.050000
time in flow: 5.140000           time in flow: 5.560000
time in combine: 1.470000        time in combine: 1.430000
time in sched: 0.940000          time in sched: 0.890000
time in local-alloc: 3.520000    time in local-alloc: 3.400000
time in global-alloc: 3.880000   time in global-alloc: 3.870000
time in dbranch: 0.000000        time in dbranch: 0.000000
time in final: 1.910000          time in final: 1.830000
time in varconst: 0.060000       time in varconst: 0.030000
time in symout: 0.000000         time in symout: 0.000000
time in dump: 0.000000           time in dump: 0.000000
28.1u 0.5s 0:32 88% 0+1232k 20+21io 89pf+0w
27.8u 0.4s 0:30 91% 0+1240k 3+18io 4pf+0w

With GCC instruction scheduling:

time in parse: 4.470000          time in parse: 4.500000
time in integration: 0.000000    time in integration: 0.000000
time in jump: 0.830000           time in jump: 0.850000
time in cse: 4.080000            time in cse: 4.090000
time in loop: 0.920000           time in loop: 0.940000
time in flow: 5.510000           time in flow: 4.720000
time in combine: 1.420000        time in combine: 1.370000
time in sched: 0.860000          time in sched: 0.920000
time in local-alloc: 3.290000    time in local-alloc: 2.990000
time in global-alloc: 3.650000   time in global-alloc: 3.630000
time in dbranch: 0.000000        time in dbranch: 0.000000
time in final: 1.800000          time in final: 1.810000
time in varconst: 0.010000       time in varconst: 0.040000
time in symout: 0.000000         time in symout: 0.000000
time in dump: 0.000000           time in dump: 0.000000
26.6u 0.5s 0:30 88% 0+1256k 0+18io 6pf+0w
25.8u 0.3s 0:28 93% 0+1256k 0+18io 0pf+0w

Performance difference: 27.8 / 25.8 = 7.8%

With GCC branch scheduling: time in parse:

time in parse: 4.580000          time in parse: 4.100000
time in integration: 0.000000    time in integration: 0.000000
time in jump: 0.860000           time in jump: 0.740000
time in cse: 3.980000            time in cse: 4.070000
time in loop: 0.940000           time in loop: 0.840000
time in flow: 5.630000           time in flow: 4.250000
time in combine: 1.340000        time in combine: 1.420000
time in sched: 0.000000          time in sched: 0.000000
time in local-alloc: 2.980000    time in local-alloc: 2.920000
time in global-alloc: 3.430000   time in global-alloc: 3.490000
time in dbranch: 0.000000        time in dbranch: 0.000000
time in final: 1.880000          time in final: 1.770000
time in varconst: 0.020000       time in varconst: 0.020000
time in symout: 0.000000         time in symout: 0.000000
time in dump: 0.000000           time in dump: 0.000000
25.4u 0.5s 0:33 77% 0+1184k 20+21io 89pf+0w
23.5u 0.4s 0:26 89% 0+1184k 0+19io 0pf+0w

Performance difference: 27.8 / 23.5 = 18.2%

Updated Results

The paper was written under strict time constraints. The instruction and branch schedulers were successfully merged in July 1989. At that time, performance improvement was approximately 25% for the GNU C benchmark. The following shows the performance comparison.

With GCC branch and instruction scheduling:

time in parse: 4.370000          time in parse: 4.270000
time in integration: 0.000000    time in integration: 0.000000
time in jump: 0.930000           time in jump: 0.900000
time in cse: 3.900000            time in cse: 3.760000
time in loop: 0.860000           time in loop: 0.840000
time in flow: 4.060000           time in flow: 3.960000
time in combine: 1.260000        time in combine: 1.320000
time in sched: 0.000000          time in sched: 0.000000
time in local-alloc: 2.820000    time in local-alloc: 2.810000
time in global-alloc: 3.350000   time in global-alloc: 3.440000
time in dbranch: 0.000000        time in dbranch: 0.000000
time in final: 1.650000          time in final: 1.670000
time in varconst: 0.040000       time in varconst: 0.050000
time in symout: 0.000000         time in symout: 0.000000
time in dump: 0.000000           time in dump: 0.000000
23.0u 0.5s 0:27 86% 0+1176k 22+22io 90pf+0w
22.9u 0.4s 0:24 96% 0+1184k 1+19io 0pf+0w

Performance difference: 27.8 / 22.9 = 21.3%

GCC Compiled by Sun CC (-O2, SunOS 4.0.1 distribution)

time in parse: 5.910000          time in parse: 5.100000
time in integration: 0.000000    time in integration: 0.000000
time in jump: 0.940000           time in jump: 0.870000
time in cse: 5.130000            time in cse: 5.300000
time in loop: 1.000000           time in loop: 1.010000
time in flow: 4.620000           time in flow: 4.620000
time in combine: 1.380000        time in combine: 1.310000
time in sched: 0.000000          time in sched: 0.000000
time in local-alloc: 4.640000    time in local-alloc: 5.070000
time in global-alloc: 3.720000   time in global-alloc: 3.700000
time in dbranch: 0.000000        time in dbranch: 0.000000
time in final: 1.980000          time in final: 1.850000
time in varconst: 0.040000       time in varconst: 0.050000
time in symout: 0.000000         time in symout: 0.000000
time in dump: 0.000000           time in dump: 0.000000
29.2u 0.5s 0:32 90% 0+1416k 6+21io 88pf+0w
28.8u 0.3s 0:30 96% 0+1424k 1+19io 0pf+0w

Performance difference: 28.8 / 22.9 = 25.7%