Just as the back end of the compiler was designed to be portable across different register-based machine platforms, the front-end language processor was designed to support different languages. Both the front-end and back-end achieve their power by implementing a set of powerful primitives. For example, there's only one looping construct in in the front-end; this construct is called the loop statement. When a C while or for statement is read in, it is represented as a combination of a loop statement and few other statement types perhaps put around the loop statement. In the case of a for, where there is an initialization, it is represented as a compound of initialize followed by loop. The end tests that are built in in a while or for statements are represented using one exit the loop construct. Such organization makes life easier for the programmer writing a new front-end, but may make life more difficult for the programmer writing an optimizer to perform optimizations based on elaborate loop analysis.
The GNU C compiler, like most optimizing compilers, runs in many passes. The front end builds the syntax tree bottom up, using an LALR parser built by Bison [Corbett85]. When expression syntax grows to statement syntax, the statement is translated from its tree representation to its RTL representation. RTL representation looks like lisp lists; RTL objects are self-describing, and can be printed, copied, and traversed in much the same way that lisp lists are. The RTL representation for instructions is the aggregation of a Unique Identifier (UID), pointers to previous and next RTL instruction objects (making the insn list a doubly linked list, hence efficiently supporting insertion and deletion of other insns), the instruction pattern, the instruction code (a memoized value which associates the instruction object with a specific instruction template in the machine description), the instruction's logical links (pointers to instructions which feed this instruction via dataflow), and register notes (a list of assertions about the properties of the registers used by this RTL instruction object).
Once the primitive instructions (insns) have been emitted in response to requests from the front end, the passes of the back end go to work. The first pass is jump optimization, which is powerful and fast. When first run, its purpose is to remove dead code, so that the compiler has less code it must slog through. After initial jump optimization has been performed, common subexpressions are eliminated. Common subexpression elimination (CSE) is performed for each extended basic block (a sequence of code with a unique entry point, but possibly many exits points). If after performing common subexpression elimination, branch statements have been affected, jump optimization is run again, to remove any newly discovered dead code. CSE is only half-smart about memory: it knows to distinguish scalar variables from aggregates, and unknown addresses from either, but its ability to understand memory references is limited. A consequence is that often writes to independent memory locations tend to destroy valid common subexpressions.
Having simplified the code with jump and CSE optimizations, loop optimizations are performed. Loop invariant code motion, loop inversion, induction variable elimination, strength reduction, and use of special machine constructs (such as decrement and branch) are applied heuristically. Loop optimizations not performed include loop peeling, loop fission and loop fusion. Loop unrolling, perhaps most important for superscalar machines, is also not implemented. Loop optimizations may have again affected the branching behavior of the program; if so, jump optimization is again performed.
As discussed previously, all the different kinds of loops provided by e.g. C are mapped to a single kind of RTL-level looping structure. This makes some kinds of loop-based optimizations harder to do. For example, to get better performance from a superscalar machine, the compiler should unroll loops which do not have inter-iteration dependencies to achieve some degree of software pipelining. To compute such dependencies, it is very helpful to know the precise nature of the loop: its beginning, its body, its iteration step code, and its termination condition. As things stand right now, the loop optimizer would have to derive all that information itself from the RTL - a non-trivial task.
Common subexpression elimination and loop optimizations can take place without knowledge about basic blocks. The remaining optimizations depend on basic-block level data-flow analysis. A pass computes the basic blocks, initializing vectors which describe characteristics of basic blocks (live registers on entry) characteristics of registers within and across basic blocks (number of times set, number of times referenced, live across how many instructions, live in which basic blocks, number of calls crossed), and characteristics of insns (what registers die, what insns a data-flow sources for this insn). Data-flow analysis computes live variables, and hence can detect what become dead stores, which can then be eliminated.
Having computed such detailed information, the compiler is prepared to perform Frazier-Davidson style code optimization. The essence of this technique is to look for two or three insns related by data-flow, and to see whether the insns can be rewritten as a single insn which results from performing algebraic substitution.
The following example shows how the machine description supports an accumulate-multiply instruction. A complete description of the machine description syntax and semantics are beyond the scope of this paper, but the interested reader is encouraged to consult Using and Porting GNU CC [Stallman89] for more information. If the machine description contains e.g. the following patter:
;; FPA Add and multiply
(define_insn ""
[(set (match_op:DF 0 "register_op" "=x,y,y")
(mult:DF (plus:DF (match_op:DF 1 "register_op""=xH,y,y")
(match_op:DF 2 "general_op" "x,y,rmF"))
(match_op:DF 3 "general_op" "xH,rmF,y")))]
"TARGET_FPA"
"*
if (which_alternative == 0)
return \"fpam%.d %2,%w1,%w3,%0\";
return \"fpam%.d %x2,%1,%x3,%0\";
")
Then when the combiner comes across these two (possibly non-adjacent) patterns:
(insn 78 77 79
(set (reg:DF 99)
(plus:DF (reg:DF 88)
(reg:DF 89))) -1 (nil)
(expr_list:REG_DEAD (reg 89) (nil)))
(...)
(insn 92 91 93
(set (reg:DF 100)
(mult:DF (reg:DF 99)
(reg:DF 80))) -1 (insn_list 78 (nil))
(nil))
it can substitute the source of the set for pseudoregister 99 in insn 78 for the use of pseudoregister 99 in insn 92, leading to:
(note 78 77 79 "" NOTE_INSN_DELETED)
(...)
(insn 92 91 93
(set (reg:DF 100)
(mult:DF (plus:DF (reg:DF 88)
(reg:DF 89))
(reg:DF 80))) -1 (insn_list 78 (nil))
(expr_list:REG_DEAD (reg 89) (nil)))
This technique can be thought of as performing peephole optimization in data-flow space. A much later pass handles peephole optimization in the conventional manner, by attempting to combine physically adjacent instructions.
After instruction selection has been performed, the compiler goes about allocating registers. Register allocation is performed in two passes. The first pass allocates registers within basic blocks. Only registers local to a single basic block, and which do not cross calls are allocated in this pass. The algorithm is efficient, and does a good, though limited, job. Register targeting is performed, which means that if the machine needs a particular register to contain a computed value, the register allocator tries to make the computed value wind up in the desired register. Its strategy for doing so is simple however, and it sometimes fails, resulting in register shuffling.
After basic block register allocation has been performed, the compiler allocates the remaining registers by building a conflict matrix, and proceeds to allocate registers by graph coloring. Heuristics use a priority function to determine which quantities should be register allocated and which quantities should go to memory. The priority function assigns quantities priorities computed from the value of the number of times a register is used times the logarithm of the number of insns across which the register is live, divided by the number of insns across which the register is live. Quantities in loops have a weight which makes them more likely to be register allocated.
It is possible that during global register allocation, temporary registers are needed to compute an intermediate result. The register allocator does not know how to find a global register which may happen to be free to perform this computation, and all local register may have been allocated by the previous phase. In such a case, a local register must be spilled. This means throwing away all quantities previously held by that register within that basic block; the quantities thus spilled will have to be computed again using different registers.
Once all registers have been allocated, the compiler can perform branch scheduling. Not all machines have delayed branches; for machines that do not, this pass is not run. The branch scheduler walks through basic blocks looking for unfilled delay slots. When one is found, it searches the basic block for instructions which could be put in its delay slot. The branch scheduler is quite general. It was initially designed to support a machine with zero to three delay slots, with or without nullification. Adapting it to schedule only one delay slot is straightforward.
A final jump optimization combined with peephole optimization is run. During this jump optimization phase, cross jump optimizations are performed. The peephole optimizer runs concurrently, and recognizes machine idioms which are difficult or impossible to represent using data-flow relationships.
Finally, assembly code is produced. At this time, redundant conditional tests are removed, and other machine-specific optimizations can be performed by the assembly output routines themselves.
The algorithm for list scheduling is as follows: starting with a basic block, compute instruction priorities. An instruction with priority P1 must be scheduled to execute later than any instruction with priority P2 if P1 > P2.
An instruction I1 which depends on an instruction I2 cannot execute before I2. By giving I1 a larger priority number than I2, and always placing larger-prioritied instructions later in the schedule than smaller-prioritied ones, we preserved correct execution order. Instructions which take more time to compute are also given larger priority numbers. This has the effect of exposing the critical path: the instruction with the largest priority number is that last instruction on the critical path. After computing all priorities, the scheduler marks all instructions on which no instructions depend ready. These instructions are then sorted by priority order. The last instruction in the basic block is always on the ready list, but it need not always have the highest priority number.
The algorithm then starts scheduling by issuing the instruction with the highest priority order. This procedure will build the instruction block from last instruction to first. Each time we schedule an instruction, we check to see whether instructions on which it depended now have no more dependers. For each such instruction, we can mark it ready, and add it to the ready list in priority order. When all instructions have been scheduled, the algorithm terminates, and the next block, if there is one, can be scheduled.
In a vacuum, the implementation of instruction scheduling is quite straightforward, amounting to no more than performing a reverse topological sort on instruction dependencies (or a topological sort on anti-dependencies, depending on one's point of view), adding weights to break ties between some topologically equivalent instructions. The first implementation of list scheduling for the GNU C compiler was essentially this algorithm, but the results were not very good: on a matrix multiply program in which the inner multiply loop had been unrolled four times reduced execution speed by nearly a factor of two! What went wrong?
To handle this problem, the priority function for certain instructions was modified as follows. When the scheduler chooses an instruction in which a register dies, the scheduler becomes sensitive to the instruction where that register was born. As soon as that instruction becomes ready (i.e., it has no further dependers), its priority is increased so that it will be the next instruction scheduled. The effect of this modification is to reduce register lives without greatly affecting parallelism.
Unfortunately, though this modification halved the number of register spills introduced by the scheduler, it did not eliminate all of them. This was because when several instructions' birthing registers became ready simultaneously, they all had the same high priority, and hence could still interfere with one another. The solution to this problem was to introduce a gradient on the priorities: the first birthing instruction got highest priority, and subsequent birthing instructions got lower priorities. This caused the scheduler to generate the schedule compute, load, compute, load, ... instead of compute, compute, ..., load, load, ..., and all spills were eliminated.
The GNU C compiler makes a number of simplifying assumptions which inhibit instruction scheduling freedom. For example, it is assumed that any instruction which tests condition codes immediately follows the instruction which set them. Similarly, it is assumed that before register allocation has been performed, machine registers (not pseudo registers) are only used for specific purposes, and the compiler takes shortcuts based on such assumptions. For example, when performing a subroutine call on a machine which passes arguments in registers, it is assumed that no instructions between the loading of the parameter registers and the use of the function return value intervene.
A final design goal of the instruction scheduler was to design it to be easily understood and improved upon. If it is to gain wide acceptance and use across many machines, its design must be simple and robust. For example, it should be possible to bootstrap the compiler with instruction scheduling enabled in a relatively short amount of time. It should also be cooperative with other passes of the compiler. A scheduler which only just works is not good enough; it must be sufficiently general and robust that others can modify other passes of the compiler, and their modifications will work, or will fail to work in easily understood, or at least documented ways.
Preserving debugging information is a hard problem, especially when instructions are moved across binding contours, e.g.,
Original Code Scheduled Code
{ {
double i = 3.14; double i(1) = 3.14;
int b; int b(1);
{ {
int i = foo (); int i(2) = foo ();
int *p; int *p(1);
b *= i; i(1)*= 2;
*p -= i; b(1) *= i(2);
} }
b -= 2; *p(1)-= i(2);
i *= 2; b(1) -= 2;
} }
The problem is, how can the debugger maintain correspondence between the binding contours of the program and the instructions in the program if instructions can move freely across them? I do not have a good answer for this, but I have preserved all debugging information when instructions do not move across binding contours. It should be pointed out that most binding contours appear at program control points, such as loops and conditionals. Since instructions are only scheduled within a basic block, the distance an instruction can travel is limited to remain within the control construct, and hence the binding contour.
The instruction scheduler that I implemented is currently one of the faster passes of the compiler. Mysteriously, the compiler is sometimes faster when performing instruction scheduling then when not. From profiling information, it appears that scheduled instructions utilizes registers in a way that simplifies the task of local register allocation. Time spent scheduling is often offset by time saved in register allocation. For such a fast instruction scheduler, the performance is pretty good. On a Sun 4/110, where loads and stores take three cycles when an interlock must be used, or two cycles if not, a large application (GNU CC itself) shows a speedup of 8%. Assuming that loads and stores constitute 25% of the instruction mix [Hennessey88], the maximum performance improvement achievable from just scheduling loads and stores is 0.25 (frequency) * 0.50 (speedup) = 0.125, or 12.5%. On the Sun 4/110, floating point operations are not pipelined, so there is no benefit to be gained there. Thus 8% is not bad, and shows that good instruction schedulers can be made fast.
The instruction scheduler was implemented to write out data dependency information to assist the compiler writer in analyzing its decisions. Instructions which are ready to execute, instruction priorities, stalls, and successfully filled delays slots are all printed prior to dumping the function's RTL. It is hoped that this information will be helpful to people who are interested in experimenting with, or measuring the performance of, the priority function.
Alas, the instruction scheduler and the branch scheduler do not yet work together. The main problem is caused by the fact that in order to get the branch scheduler to use instructions which come after the branch, data-flow information is needed. Unfortunately, the GNU C compiler does not keep its data flow information up to date as the compiler progresses. Specifically, optimizations such as cross-jumping change completely the landscape of the code, altering basic blocks in ways not accounted for. I was able to reconstruct most of the data needed to still do a decent job filling branch delay slots, but in so doing, introduced new data dependencies which violated assumptions made by the instruction scheduler. The compiler has been bootstrapped with instruction scheduling and branch scheduling as mutually exclusive options, but it does not yet work with both options enabled at once.
I extended the branch scheduler by adding a second pass to fill slots left previously unfilled (by the original branch scheduler) with instructions taken either from the target or following the branch. Strategies and tradeoffs for eager branch scheduling can be found in [Hennessey88]. After making this work (which was harder than anticipated), I compared the performance of an application (the compiler itself) compiled with instruction scheduling versus that same application compiled with branch scheduling.
Under test, branch scheduling showed an 18% performance improvement, while instruction scheduling showed only about 8%. Why such a large difference if there are about as many branch delays to fill as there are load and store delays? Without precise measurements, only an approximate answer can be given. First, the branch scheduler reduces the cost of a branch instruction from two cycles to one, where the instruction scheduler reduces the cost of loads and stores from three cycles to two. Second, many memory delay slots may be filled just by luck, say about one fourth the time. Such accounting would make an perfectly efficient instruction scheduler give a performance improvement of 14%, versus a perfectly efficient branch scheduler a performance improvement of 18%. The branch scheduler often separates loads from uses across the branch instruction, which is pure fortitude. Therefore, in the estimated scheduler performance, the cost of memory operations is somewhat reduced. Also, the compiler has some ability to fill branch delay slots using peephole optimizations, independent of the branch scheduler, hence in the second table, branch delay is estimated to be on average 1.9 cycles per instruction.
Perfect Schedulers
CPI unscheduled i-scheduled b-scheduled
25 load/store 2.8 ~/insn 2.0 ~/insn 2.7 ~/insn
25 branch 1.9 ~/insn 1.9 ~/insn 1.0 ~/insn
50 ALU 1.0 ~/insn 1.0 ~/insn 1.0 ~/insn
100 insns 1.68 ~/insn 1.48 ~/insn 1.42 ~/insn
Performance improvement: 1.00 1.14 1.18
Estimated Schedulers
CPI unscheduled i-scheduled b-scheduled
25 load/store 2.8 ~/insn 2.3 ~/insn 2.7 ~/insn
25 branch 1.9 ~/insn 1.9 ~/insn 1.2 ~/insn
50 ALU 1.0 ~/insn 1.0 ~/insn 1.0 ~/insn
100 insns 1.68 ~/insn 1.55 ~/insn 1.47 ~/insn
Performance improvement: 1.00 1.08 1.14
If the analysis is accurate, the best performance improvement one can expect to see on a Sun4/110 using GNU C's scheduling optimizers would be 34%; using the estimates in the second table, a 23% performance improvement can be expected using both schedulers.
There are two major aspects of the instruction scheduler: the work done for the scheduler and the work done by the scheduler. The work done for the scheduler involved modifying the data flow analysis pass of the compiler to compute dependency information. Initially, inter-instruction dependencies were only recorded between the setting of a register and its first use. This had to be modified so that all uses depended on the preceding set, and that following sets depended on the preceding set as well. Hard registers and function calls complicated matters further: dependencies had to be added to preserve call order, and to ensure that parameter passing for one function was not interleaved with parameter passing for another. Adjustments to the stack pointer rounded out the kind of operations about which the scheduler had to be cautious.
Live register information is often represented in compilers by bitvectors, and GNU C is no exception. The total space consumed by bits is the number of basic blocks times the number of pseudoregisters used by the function. Originally, flow analysis only needed to allocate from the heap bits to record the registers live on entry to a basic block. It could allocate the bits for registers live on exit from a basic block from the stack. When performing instruction scheduling, both must be allocated from the heap, because accurate live register information must be maintained after scheduling for register allocation.
The additional overhead of adding all these dependencies and extra bitvectors slows down flow analysis, already one of the slowest passes in the compiler, by about 10%.
Collecting the dependencies was a straightforward task, but keeping the information accurate was not so simple. When two instructions are combined to make a new instruction, flow information must be propagated from the instructions disappearing to the new instruction. Consider a multiply instruction I1 and an add instruction I2 becoming a multiply-accumulate instruction. If a third instruction, I3, had to execute after I1, how will it know that it must now execute after I2? To keep the scheduler's memory utilization within reasonable limits, I utilized transitive dependencies whenever possible. This meant that after the combiner ran, the dependencies could be in shambles. It is not desirable to punt and re-run flow analysis when interesting instruction reorganizations are performed, since this would have to be done twice - both before and after instruction scheduling, slowing the compiler by an additional 40%. Instead, the combiner and the scheduler were modified to incrementally fix most dependency information that they broke. The additional time spent by the combiner doing these fixups is virtually unmeasureable. For the scheduler, the additional time is also quite small, though the solution is much trickier.
The scheduler has found the new instructions made by the combiner based on dependencies pointing to deleted instructions. In many cases the deleted instruction became part of the current instruction, hence no dependency is needed at all. The remaining cases involve searching for the merged instruction. The search uses a table to record all dependency information derived while searching for some particular dependence. As the table is built, searches become shorter.
The instruction scheduler runs before register allocation. This gives the scheduler much more flexibility in moving instructions around, but it can lead to worse registers allocation. For example, nothing prevents the scheduler from moving instructions across calls. Such code motion may be detrimental if moving the instruction causes a register not previously live across a call to become so. Running the scheduler before register allocation can also result in incomplete instruction scheduling: if a quantity fails to be register allocated (i.e. spilled to memory), the memory access may affect the schedule. It would be interesting to design an incremental scheduler which could take care of such problems without rescheduling the whole block.
To save time, the scheduler does nothing if the instructions are already in their priority order. The scheduler could look for other blocks with which to merge in order to improve schedule of both blocks. Selecting the blocks with which to merge, and setting thresholds on code duplication both probably need good heuristics. Merging would also mean extending the incremental life analysis performed by the scheduler.
The scheduler has a very simple model of memory accesses. Three functions, true_dependence, anti_dependence, and output_dependence a predicate functions which return the truth value of the relationship between two memory references for true-, anti-, and output-dependencies respectively. They are currently implemented by simply returning TRUE for all references. Making these functions return more accurate answers would give the scheduler more freedom to schedule, since their would be fewer dependencies acting as constraints.
The loop optimization pass is run well before the scheduling pass, thus the scheduler has no input on how loops should be unrolled. It would be interesting to see how one can reconcile the independence of these two interdependent passes.
Some machines do not have primitive support for basic algebraic operations such as multiplication. When this is the case, the compiler writer must make the tradeoff of exposing the machine's weakness and hiding the actual machine implementation. For example, for a machine without multiplication, the compiler can implement multiplication by constants as a sequence of shifts and additions. If multiplication is transformed into shifts and adds too early, the algebraic simplifiers may miss some simplifications, whereas if it is transformed too late, the sequence of instructions will not be available for instruction scheduling.
To make good use of instruction scheduling, basic blocks should be large enough to expose a good deal of instruction level parallelism. This generally means performing procedure integration and loop unrolling. The mechanics of implementing these methods in GNU C is quite simple; in fact, procedure integration is already done. The main difficulty is deciding which procedure to integrate, or deciding which loop to unroll and how many times to do so. Good loop unrolling is probably the next important optimization that should be implemented in the GNU C compiler.
For the basic blocks where there is a fair amount of instruction level parallelism, there is still about a 50-50 chance that the scheduler will build a worse schedule rather than a better one. With the many registers available on a Sparc (due to its use of register windows), the down side is often not seen, but here is an example which exposes it:
static void
decode (shorts, low, hi)
short *shorts; int *low;
{
*low = (shorts[3] << 24)
| (shorts[2] << 16)
| (shorts[1] << 8)
| shorts[0];
}
The GNU C compiler will, on its own, generate nice code for this, while the scheduler will try to do too much in parallel, and get lost:
ldsh [%o0+6],%o3 add %sp,-88,%sp
sll %o3,24,%o3 st %l0,[%sp+80+0]
ldsh [%o0+4],%o4 ldsh [%o0+2],%o4
sll %o4,16,%o4 ldsh [%o0],%o5
or %o3,%o4,%o3 sll %o4,8,%o4
ldsh [%o0+2],%o4 ldsh [%o0+4],%l0
sll %o4,8,%o4 ldsh [%o0+6],%o3
or %o3,%o4,%o3 sll %l0,16,%l0
ldsh [%o0],%o4 sll %o3,24,%o3
or %o3,%o4,%o3 or %o3,%l0,%o3
st %o3,[%o1] or %o3,%o4,%o3
retl or %o3,%o5,%o3
nop st %o3,[%o1]
ld [%sp+80+0],%l0
retl
add %sp,88,%sp
The unscheduled code has stalls due to interlocks, but the scheduled code executes an extra load, an extra store, and a stack manipulation operation, resulting in slower overall code. It would be nice to see future work improve the priority assignment function so that the scheduler always makes at least code-improving transformations.
Scheduling across basic blocks is another way to achieve more parallelism. This technique was not attempted for this project, but should extend from this work in a straightforward manner. As with loop unrolling and procedure integration, cross-basic-block scheduling can result in code explosion. The heuristics of GNU C compiler perform their optimizations based only on the code being compiled. While they tend to do a good job, a number of these optimizations could benefit from profiling information. In particular, profile information should be used to guide cross-basic-block instruction scheduling.
The instruction scheduler should be modified to work after register allocation. Register spills become loads and stores, and can change the characteristics of the instruction stream enough to warrant rescheduling. It would be trivial to run the scheduler after register allocation, just to take care of the cases of register spills. What would be more interesting is to modify the scheduler and/or the register to note the cause of the spill. If the spill occurred because a register's life was wantonly extended (perhaps across a call or two), the scheduler should attempt to shorten that register's life when next it builds its schedule. Live-range splitting is not implemented in GNU C, but it would be logical to have the live-range splitter and the scheduler coordinate their actions.
The GNU C compiler has been a willing partner in this undertaking. It showed fine sportsmanship in permitting itself to be bootstrapped with the two schedulers presented here. It was a disappointment that it did not work with both schedulers enabled, but it is expected to do so soon.
Stallman89: Stallman, Richard M., Using and Porting GNU CC, June 1989, Free Software Foundation, Cambridge MA, 02139.
Hennessey88: Hennessey, John L., EE 282 Class Notes, September-December 1989, Stanford University, Stanford CA, 94305.
Fisher81: Fisher, J. A., Trace Scheduling: A Technique for Global Microcode Compaction. IEEE Transactions on Computers C-30(7): 478-490, 1981.
Cold Warm
time in parse: 4.750000 time in parse: 4.490000
time in integration: 0.000000 time in integration: 0.000000
time in jump: 0.970000 time in jump: 1.000000
time in cse: 4.600000 time in cse: 4.310000
time in loop: 1.030000 time in loop: 1.050000
time in flow: 5.140000 time in flow: 5.560000
time in combine: 1.470000 time in combine: 1.430000
time in sched: 0.940000 time in sched: 0.890000
time in local-alloc: 3.520000 time in local-alloc: 3.400000
time in global-alloc: 3.880000 time in global-alloc: 3.870000
time in dbranch: 0.000000 time in dbranch: 0.000000
time in final: 1.910000 time in final: 1.830000
time in varconst: 0.060000 time in varconst: 0.030000
time in symout: 0.000000 time in symout: 0.000000
time in dump: 0.000000 time in dump: 0.000000
28.1u 0.5s 0:32 88% 0+1232k 20+21io 89pf+0w
27.8u 0.4s 0:30 91% 0+1240k 3+18io 4pf+0w
With GCC instruction scheduling:
time in parse: 4.470000 time in parse: 4.500000 time in integration: 0.000000 time in integration: 0.000000 time in jump: 0.830000 time in jump: 0.850000 time in cse: 4.080000 time in cse: 4.090000 time in loop: 0.920000 time in loop: 0.940000 time in flow: 5.510000 time in flow: 4.720000 time in combine: 1.420000 time in combine: 1.370000 time in sched: 0.860000 time in sched: 0.920000 time in local-alloc: 3.290000 time in local-alloc: 2.990000 time in global-alloc: 3.650000 time in global-alloc: 3.630000 time in dbranch: 0.000000 time in dbranch: 0.000000 time in final: 1.800000 time in final: 1.810000 time in varconst: 0.010000 time in varconst: 0.040000 time in symout: 0.000000 time in symout: 0.000000 time in dump: 0.000000 time in dump: 0.000000 26.6u 0.5s 0:30 88% 0+1256k 0+18io 6pf+0w 25.8u 0.3s 0:28 93% 0+1256k 0+18io 0pf+0w
Performance difference: 27.8 / 25.8 = 7.8%
With GCC branch scheduling: time in parse:
time in parse: 4.580000 time in parse: 4.100000 time in integration: 0.000000 time in integration: 0.000000 time in jump: 0.860000 time in jump: 0.740000 time in cse: 3.980000 time in cse: 4.070000 time in loop: 0.940000 time in loop: 0.840000 time in flow: 5.630000 time in flow: 4.250000 time in combine: 1.340000 time in combine: 1.420000 time in sched: 0.000000 time in sched: 0.000000 time in local-alloc: 2.980000 time in local-alloc: 2.920000 time in global-alloc: 3.430000 time in global-alloc: 3.490000 time in dbranch: 0.000000 time in dbranch: 0.000000 time in final: 1.880000 time in final: 1.770000 time in varconst: 0.020000 time in varconst: 0.020000 time in symout: 0.000000 time in symout: 0.000000 time in dump: 0.000000 time in dump: 0.000000 25.4u 0.5s 0:33 77% 0+1184k 20+21io 89pf+0w 23.5u 0.4s 0:26 89% 0+1184k 0+19io 0pf+0w
Performance difference: 27.8 / 23.5 = 18.2%
With GCC branch and instruction scheduling:
time in parse: 4.370000 time in parse: 4.270000 time in integration: 0.000000 time in integration: 0.000000 time in jump: 0.930000 time in jump: 0.900000 time in cse: 3.900000 time in cse: 3.760000 time in loop: 0.860000 time in loop: 0.840000 time in flow: 4.060000 time in flow: 3.960000 time in combine: 1.260000 time in combine: 1.320000 time in sched: 0.000000 time in sched: 0.000000 time in local-alloc: 2.820000 time in local-alloc: 2.810000 time in global-alloc: 3.350000 time in global-alloc: 3.440000 time in dbranch: 0.000000 time in dbranch: 0.000000 time in final: 1.650000 time in final: 1.670000 time in varconst: 0.040000 time in varconst: 0.050000 time in symout: 0.000000 time in symout: 0.000000 time in dump: 0.000000 time in dump: 0.000000 23.0u 0.5s 0:27 86% 0+1176k 22+22io 90pf+0w 22.9u 0.4s 0:24 96% 0+1184k 1+19io 0pf+0w
Performance difference: 27.8 / 22.9 = 21.3%
GCC Compiled by Sun CC (-O2, SunOS 4.0.1 distribution)
time in parse: 5.910000 time in parse: 5.100000 time in integration: 0.000000 time in integration: 0.000000 time in jump: 0.940000 time in jump: 0.870000 time in cse: 5.130000 time in cse: 5.300000 time in loop: 1.000000 time in loop: 1.010000 time in flow: 4.620000 time in flow: 4.620000 time in combine: 1.380000 time in combine: 1.310000 time in sched: 0.000000 time in sched: 0.000000 time in local-alloc: 4.640000 time in local-alloc: 5.070000 time in global-alloc: 3.720000 time in global-alloc: 3.700000 time in dbranch: 0.000000 time in dbranch: 0.000000 time in final: 1.980000 time in final: 1.850000 time in varconst: 0.040000 time in varconst: 0.050000 time in symout: 0.000000 time in symout: 0.000000 time in dump: 0.000000 time in dump: 0.000000 29.2u 0.5s 0:32 90% 0+1416k 6+21io 88pf+0w 28.8u 0.3s 0:30 96% 0+1424k 1+19io 0pf+0w
Performance difference: 28.8 / 22.9 = 25.7%