Instruction scheduling: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 00:48, 3 February 2010 edit 130.166.46.96 (talk) →See also ← Previous edit		Latest revision as of 23:01, 5 July 2025 edit undo Avessa (talk \| contribs) 244 edits →Compiler examples: add uops.info
(46 intermediate revisions by 25 users not shown)
Line 1: {{Short description\|Compiler optimization technique}} ~~{{Unreferenced\|date=January 2009}}~~ {{Refimprove\|date=April 2018}} In [[computer science]], '''instruction scheduling''' is a [[compiler optimization]] used to improve [[instruction-level parallelism]], which improves performance on machines with [[instruction pipeline]]s. Put more simply, it tries to do the following without changing the meaning of the code~~, it tries to~~ : * Avoid [[pipeline stall]]s by rearranging the order of instructions.▼ * Avoid illegal or semantically ambiguous operations (typically involving subtle instruction pipeline timing issues or non-interlocked resources.)▼ ▲* Avoid [[pipeline stall]]s by rearranging the order of instructions.<ref name="ColdScheduling_1994"/> ▲* Avoid illegal or semantically ambiguous operations (typically involving subtle instruction pipeline timing issues or non-interlocked resources.). <!-- * order the instructions to avoid duplicated memory access This doesn't belong in instruction scheduling Line 17 ⟶ 19: Technically, there is a fourth type, Read after Read (RAR or "Input"): Both instructions read the same ___location. Input dependence does not constrain the execution order of two statements, but it is useful in scalar replacement of array elements. To make sure we respect the three types of dependencies, we construct a dependency graph, which is a [[directed graph]] where each vertex is an instruction and there is an edge from I<sub>1</sub> to I<sub>2</sub> if I<sub>1</sub> must come before I<sub>2</sub> due to a dependency. If loop-carried dependencies are left out, the dependency graph is a [[directed acyclic graph]]. Then, any [[topological sort]] of this graph is a valid instruction schedule. The edges of the graph are usually labelled with the '''latency''' of the dependence. This is the number of clock cycles that needs to elapse before the pipeline can proceed with the target instruction without stalling. ~~<!-- == Algorithms ==~~ ~~However, we want more than a valid instruction scheduling: we also want one which minimizes delays. ''Write about how it does this...'' -->~~ == Algorithms == The simplest algorithm to find a topological sort is frequently used and is known as ~~'''~~[[list scheduling~~'''~~]]. Conceptually, it repeatedly selects a source of the dependency graph, appends it to the current instruction schedule and removes it from the graph. This may cause other vertices to be sources, which will then also be considered for scheduling. The algorithm terminates if the graph is empty. To arrive at a good schedule, stalls should be prevented. This is determined by the choice of the next instruction to be scheduled. A number of heuristics are in common use: * The processor resources used by the already scheduled instructions are recorded. If a candidate uses a resource that is occupied, its priority will drop. * If a candidate is scheduled closer to its predecessors than the associated latency, its priority will drop. * If a candidate lies on the critical path of the graph, its priority will rise. This heuristic provides some form of look-ahead in an otherwise local decision process. * If choosing a candidate will create many new sources, its priority will rise. This heuristic tends to generate more freedom for the scheduler. == ~~The phase~~Phase order ~~of Instruction Scheduling~~ == Instruction scheduling may be done either before or after [[register allocation]] or both before and after it. The advantage of doing it before register allocation is that this results in maximum parallelism. The disadvantage of doing it before register allocation is that this can result in the register allocator needing to use a number of registers exceeding those available. This will cause spill/fill code to be introduced, which will reduce the performance of the section of code in question. If the architecture being scheduled has instruction sequences that have potentially illegal combinations (due to a lack of instruction interlocks), the instructions must be scheduled after register allocation. This second scheduling pass will also improve the placement of the spill/fill code. If scheduling is only done after register allocation, then there will be false dependencies introduced by the register allocation that will limit the amount of instruction motion possible by the scheduler. == Types ~~of Instruction Scheduling~~ == There are several types of instruction scheduling: #~~Basic~~''Local'' ~~Block~~(''basic ~~Scheduling~~block'') ''scheduling'': instructions can't move across basic block boundaries. #''Global scheduling'': instructions can move across basic block boundaries. #''Modulo ~~Scheduling~~scheduling'': ~~another~~an ~~name~~algorithm for generating [[software pipelining]], which is a ~~form~~way of increasing instruction ~~scheduling~~level ~~that~~parallelism ~~interleaves~~by interleaving different iterations of aan inner [[~~Control_flow~~Control flow#Loops\|loop]]. #''[[Trace scheduling]]'': athe first practical ~~form~~approach offor global scheduling, ~~that~~trace scheduling tries to optimize the control flow path that is executed most often. #''Superblock scheduling'': ana ~~improved~~simplified form of trace scheduling ~~algorithm.~~which Itdoes ~~has~~not ~~single~~attempt ~~entry~~to ~~multi~~merge ~~exits~~control ~~shape~~flow paths at trace "side entrances". Instead, code can be implemented by ~~tail~~more ~~duplication~~than ~~and~~one ~~simplify~~schedule, vastly simplifying the ~~scheduling~~code ~~work~~generator. ==~~See~~ ~~also~~Compiler examples == The [[GNU Compiler Collection]] is one compiler known to perform instruction scheduling, using the {{code\|-march}} (both instruction set and scheduling) or {{code\|-mtune}} (only scheduling) flags. It uses descriptions of instruction latencies and what instructions can be run in parallel (or equivalently, which "port" each use) for each microarchitecture to perform the task. This feature is available to almost all architectures that GCC supports.<ref>{{cite web \|title=x86 Options \|url=https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html \|website=Using the GNU Compiler Collection (GCC)}}</ref> ~~{{Portal\|Computer science}}~~ * [[Code generation (compiler)\|Code generation]]▼ * [[Register allocation]] * [[Instruction selection]]▼ * [[Branch predication]] Until version 12.0.0, the instruction scheduling in [[LLVM]]/Clang could only accept a {{code\|-march}} (called {{code\|target-cpu}} in LLVM parlance) switch for both instruction set and scheduling. Version 12 adds support for {{code\|-mtune}} ({{code\|tune-cpu}}) for x86 only.<ref>{{cite web \|title=⚙ D85384 [X86] Add basic support for -mtune command line option in clang \|url=http://reviews.llvm.org/D85384 \|website=reviews.llvm.org}}</ref> [[Category:Compiler optimizations]]▼ Sources of information on latency and port usage include: * GCC and LLVM; * [[Agner Fog]], who compiles extensive data for the [[x86 architecture]];<ref name="optimize">{{cite web \|title=Software optimization resources. C++ and assembly. Windows, Linux, BSD, Mac OS X \|url=https://www.agner.org/optimize/ \|website=Agner Fog}}</ref> * InstLatx64, which uses [[AIDA64]] to collect data on x86 CPUs.<ref>{{cite web \|title=x86, x64 Instruction Latency, Memory Latency and CPUID dumps \|url=http://instlatx64.atw.hu/ \|website=instlatx64.atw.hu}} See also the "Comments" link on the page.</ref> * uops.info, which provides latency, throughput, and port usage information for x86 microarchitectures.<ref>[https://uops.info/ uops.info]</ref><ref>{{Cite conference \|title=uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures \|last1=Abel \|first1=Andreas \|last2=Reineke \|first2=Jan \|date=2019 \|conference=[[International Conference on Architectural Support for Programming Languages and Operating Systems]], Providence, RI, USA, April 13–17, 2019 \|arxiv=1810.04610 \|doi=10.1145/3297858.3304062 \|isbn=978-1-4503-6240-5 \|pages=673-686 \|book-title=ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems \|publication-date=April 2019 \|publication-place=New York, NY, USA \|publisher=[[Association for Computing Machinery\|ACM]] }}</ref> LLVM's {{code\|llvm-exegesis}} should be usable on all machines, especially to gather information on non-x86 ones.<ref>{{cite web \|title=llvm-exegesis - LLVM Machine Instruction Benchmark \|url=https://llvm.org/docs/CommandGuide/llvm-exegesis.html \|website=LLVM 12 Documentation}}</ref> ~~[[ja:命令スケジューリング]]~~ ==See also == ~~{{Portal\|Computer science}}~~ * [[Code generation (compiler)\|Code generation]] * [[Register allocation]] * [[Instruction selection]] * [[Branch predication]] ▲* [[Code generation (compiler)\|Code generation]] ▲* [[Instruction ~~selection~~unit]] * [[Out-of-order execution]] ==References== [[Category:Compiler optimizations]]▼ {{reflist\|refs= <ref name="ColdScheduling_1994">{{cite report \|author-first1=Ching-Long \|author-last1=Su \|author-first2=Chi-Ying \|author-last2=Tsui \|author-first3=Alvin M. \|author-last3=Despain \|url=http://www.scarpaz.com/2100-papers/Power%20Estimation/su94-low%20power%20architecture%20and%20compilation.pdf \|title=Low Power Architecture Design and Compilation Techniques for High-Performance Processors \|date=1994 \|publisher=Advanced Computer Architecture Laboratory \|id=ACAL-TR-94-01}} (''Cold scheduling'')</ref> }} ==Further reading== ~~[[ja:命令スケジューリング]]~~ * {{cite journal \|author-first=Joseph A. \|author-last=Fisher \|author-link=Joseph A. Fisher \|title=Trace Scheduling: A Technique for Global Microcode Compaction \|journal=IEEE Transactions on Computers \|volume=30 \|number=7 \|pages=478–490 \|date=1981 \|doi=10.1109/TC.1981.1675827 \|s2cid=1650655 }} (''[[Trace scheduling]]'') * {{cite journal \|author-first1=Alexandru \|author-last1=Nicolau \|author-first2=Joseph A. \|author-last2=Fisher \|author-link2=Joseph A. Fisher \|title=Measuring the Parallelism Available for Very Long Instruction Word Architectures \|journal=IEEE Transactions on Computers \|volume=33 \|number=11 \|date=1984}} (''Percolation scheduling'') * {{cite journal \|author-first1=David \|author-last1=Bernstein \|author-first2=Michael \|author-last2=Rodeh \|title=Global Instruction Scheduling for Superscalar Machines \|journal=Proceedings of the ACM, SIGPLAN '91 Conference on Programming Language Design and Implementation \|date=June 1991\|url=http://pages.cs.wisc.edu/~fischer/cs701.f06/berstein_rodeh.pdf}} (''Global scheduling'') * {{cite web \|last1=Cordes \|first1=Peter \|title=assembly - Instruction reordering in x86 / x64 asm - performance optimisation with latest CPUs \|url=https://stackoverflow.com/a/45970664 \|website=Stack Overflow}} ▲~~[[Category:~~{{Compiler optimizations]]}} ▲[[Category:Compiler optimizations]]