Program optimization: Difference between revisions

Content deleted Content added
rm non-standard tools
m Reverted 1 edit by 2001:8003:B05C:FD00:5D26:402:8D51:86B4 (talk) to last revision by Mortense
 
(667 intermediate revisions by more than 100 users not shown)
Line 1:
{{Short description|Improving the efficiency of software}}
In [[computing]], '''optimization''' modifies a system to improve its efficiency. This ''either'' reduces the resources the system consumes ''or'' increases the performance.
{{multiple issues|{{original research|date=September 2016}}
{{essay like|date=July 2017}}
{{Refimprove section|date=February 2018}}|collapsed=|section=}}
 
In [[computer science]], '''program optimization''', '''code optimization''', or '''software optimization''' is the process of modifying a software system to make some aspect of it work more [[algorithmic efficiency|efficiently]] or use fewer resources.<ref>[[Robert Sedgewick (computer scientist)|Robert Sedgewick]], ''Algorithms'', 1984, p. 84.</ref> In general, a [[computer program]] may be optimized so that it executes more rapidly, or to make it capable of operating with less [[Computer data storage|memory storage]] or other resources, or draw less power.
The system can be a single computer program, a collection of computers, or even
an entire network such as Internet.
The optmization might be to reduce
the maximum execution time, memory use, bandwidth, or some other resource. Those objectives can be mutually exclusive, and require a [[tradeoff]].
 
==Overview==
In [[operations research]], '''optimization''' is the problem of determining the inputs of a function that minimize or maximize its value. Sometimes constraints are imposed on the values that the inputs can take; this problem is known as '''constrained optimization'''.
Although the term "optimization" is derived from "optimum",<ref>{{Cite book |last1=Antoniou |first1=Andreas |url=https://link.springer.com/content/pdf/10.1007/978-1-0716-0843-2.pdf |title=Practical Optimization |last2=Lu |first2=Wu-Sheng |series=Texts in Computer Science |publisher=[[Springer Publishing|Springer]] |year=2021 |edition=2nd |pages=1 |doi=10.1007/978-1-0716-0843-2 |isbn=978-1-0716-0841-8 |language=en}}</ref> achieving a truly optimal system is rare in practice, which is referred to as [[superoptimization]]. Optimization typically focuses on improving a system with respect to a specific quality metric rather than making it universally optimal. This often leads to trade-offs, where enhancing one metric may come at the expense of another. One frequently cited example is the [[space-time tradeoff]], where reducing a program’s execution time can increase its memory consumption. Conversely, in scenarios where memory is limited, engineers might prioritize a slower [[algorithm]] to conserve space. There is rarely a single design that can excel in all situations, requiring [[software engineers|programmers]] to prioritize attributes most relevant to the application at hand. Metrics for software include throughput, [[Frames per second|latency]], [[RAM|volatile memory usage]], [[Disk storage|persistent storage]], [[internet usage]], [[energy consumption]], and hardware [[wear and tear]]. The most common metric is speed.
 
Furthermore, achieving absolute optimization often demands disproportionate effort relative to the benefits gained. Consequently, optimization processes usually slow once sufficient improvements are achieved. Fortunately, significant gains often occur early in the optimization process, making it practical to stop before reaching [[diminishing returns]].
In [[computer programming]], '''optimization''' usually specifically
means to modify code and its compilation [[setting]]s
on a given [[computer architecture]] to produce more efficient software.
 
Usually,==Levels theof term "optimization" presumes that the system retains the==
Optimization can occur at a number of levels. Typically the higher levels have greater impact, and are harder to change later on in a project, requiring significant changes or a complete rewrite if they need to be changed. Thus optimization can typically proceed via refinement from higher to lower, with initial gains being larger and achieved with less work, and later gains being smaller and requiring more work. However, in some cases overall performance depends on performance of very low-level portions of a program, and small changes at a late stage or early consideration of low-level details can have outsized impact. Typically some consideration is given to efficiency throughout a project{{snd}} though this varies significantly{{snd}} but major optimization is often considered a refinement to be done late, if ever. On longer-running projects there are typically cycles of optimization, where improving one area reveals limitations in another, and these are typically curtailed when performance is acceptable or gains become too small or costly. Best practices for optimization during iterative development cycles include continuous monitoring for performance issues coupled with regular performance testing.<ref>{{cite web |title= Performance Optimization in Software Development: Speeding Up Your Applications|url=https://senlainc.com/blog/performance-optimization-in-software-development/#best-practices-for-performance-optimization |access-date=12 July 2025}}</ref><ref>{{cite web |author=Agrawal, Amit |title= Maximizing Efficiency: Implementing a Performance Monitoring System |url=https://www.developers.dev/tech-talk/implement-a-system-for-monitoring-application.html |access-date=12 July 2025}}</ref>
same functionality. However, often a crucial optimization is to solve only the actual problem, removing useless code.
 
As performance is part of the specification of a program{{snd}} a program that is unusably slow is not fit for purpose: a video game with 60&nbsp;Hz (frames-per-second) is acceptable, but 6 frames-per-second is unacceptably choppy{{snd}} performance is a consideration from the start, to ensure that the system is able to deliver sufficient performance, and early prototypes need to have roughly acceptable performance for there to be confidence that the final system will (with optimization) achieve acceptable performance. This is sometimes omitted in the belief that optimization can always be done later, resulting in prototype systems that are far too slow{{snd}} often by an [[order of magnitude]] or more{{snd}} and systems that ultimately are failures because they architecturally cannot achieve their performance goals, such as the [[Intel 432]] (1981); or ones that take years of work to achieve acceptable performance, such as Java (1995), which achieved performance comparable with native code only with [[HotSpot (virtual machine)|HotSpot]] (1999).<ref>{{cite web |author=Düppe, Ingo |title= Hitchhiker’s Guide to Java Performance: The Past, the Present, and the Future |url=https://javapro.io/2025/04/07/hitchhikers-guide-to-java-performance |access-date=12 July 2025}}</ref> The degree to which performance changes between prototype and production system, and how amenable it is to optimization, can be a significant source of uncertainty and risk.
Although the word "optimization" shares the same root as "optimal,"
economical optimizations can rarely find a truly optimal system.
 
===Design level===
Typical problems have such a large number of possibilities that a programming organization can only afford a "good enough" solution.
At the highest level, the design may be optimized to make best use of the available resources, given goals, constraints, and expected use/load. The architectural design of a system overwhelmingly affects its performance. For example, a system that is network latency-bound (where network latency is the main constraint on overall performance) would be optimized to minimize network trips, ideally making a single request (or no requests, as in a [[push protocol]]) rather than multiple roundtrips. Choice of design depends on the goals: when designing a [[compiler]], if fast compilation is the key priority, a [[one-pass compiler]] is faster than a [[multi-pass compiler]] (assuming same work), but if speed of output code is the goal, a slower multi-pass compiler fulfills the goal better, even though it takes longer itself. Choice of platform and programming language occur at this level, and changing them frequently requires a complete rewrite, though a modular system may allow rewrite of only some component{{snd}} for example, for a Python program one may rewrite performance-critical sections in C. In a distributed system, choice of architecture ([[client-server]], [[peer-to-peer]], etc.) occurs at the design level, and may be difficult to change, particularly if all components cannot be replaced in sync (e.g., old clients).
 
===Algorithms and data structures===
Some authorities define these terms:
Given an overall design, a good choice of [[algorithmic efficiency|efficient algorithms]] and [[data structure]]s, and efficient implementation of these algorithms and data structures comes next. After design, the choice of [[algorithm]]s and data structures affects efficiency more than any other aspect of the program. Generally data structures are more difficult to change than algorithms, as a data structure assumption and its performance assumptions are used throughout the program, though this can be minimized by the use of [[abstract data type]]s in function definitions, and keeping the concrete data structure definitions restricted to a few places. Changes in data structures mapped to a database may require schema migration and other complex software or infrastructure changes.<ref>{{cite web |author=Mullins, Craig S. |title=The Impact of Change on Database Structures |url=https://www.dbta.com/Columns/DBA-Corner/The-Impact-of-Change-on-Database-Structures-101931.aspx |access-date=12 July 2025}}</ref>
* '''code optimization''' or '''code improvement''' -- is used in some academic environments to mean the most effective and efficient possible code. These authorities prefer the term "code improvement" to avoid confusion with performance tuning.
* '''performance tuning''' is said to include both "code improvement" and the scalability now often needed with network-based computing and large-scale software projects.
 
For algorithms, this primarily consists of ensuring that algorithms are constant O(1), logarithmic O(log ''n''), linear O(''n''), or in some cases log-linear O(''n'' log ''n'') in the input (both in space and time). Algorithms with quadratic complexity O(''n''<sup>2</sup>) fail to scale, and even linear algorithms cause problems if repeatedly called, and are typically replaced with constant or logarithmic if possible.
Optimization can be automated by compilers or performed by programmers. Gains are usually limited for local optimization, and larger for global optimizations. Perhaps the most powerful optimization is to find a superior [[algorithm]].
 
Beyond asymptotic order of growth, the constant factors matter: an asymptotically slower algorithm may be faster or smaller (because simpler) than an asymptotically faster algorithm when they are both faced with small input, which may be the case that occurs in reality. Often a [[hybrid algorithm]] will provide the best performance, due to this tradeoff changing with size.
== When do I want to do optimization? ==
It is well told that optimization often undermines readibility and adds code that is used only to improve the performance. This may complicate programs or systems, making hard to maintain and debug. Compiler optimization, for example, may introduce weird behavior because of compiler bugs. Because of that, it is preferable that optimization or performance tuning is done in the end of [[development stage]]. In other words, often systems or programs perform poorly in middle of developing.
 
A general technique to improve performance is to avoid work. A good example is the use of a [[fast path]] for common cases, improving performance by avoiding unnecessary work. For example, using a simple text layout algorithm for Latin text, only switching to a complex layout algorithm for complex scripts, such as [[Devanagari]]. Another important technique is caching, particularly [[memoization]], which avoids redundant computations. Because of the importance of caching, there are often many levels of caching in a system, which can cause problems from memory use, and correctness issues from stale caches.
== Systemic Optimization ==
The architectural design of a system can overwhelmingly affect its performance.
 
===Source code level===
'''Load balancing''' spreads the load over a large number of servers. Often load balancing is done [[transparent]]ly (i.e., without users noticing it), using a so-called [[layer 4 router]].
Beyond general algorithms and their implementation on an abstract machine, concrete source code level choices can make a significant difference. For example, on early C compilers, <code>while(1)</code> was slower than <code>for(;;)</code> for an unconditional loop, because <code>while(1)</code> evaluated 1 and then had a conditional jump which tested if it was true, while <code>for (;;)</code> had an unconditional jump . Some optimizations (such as this one) can nowadays be performed by [[optimizing compiler]]s. This depends on the source language, the target machine language, and the compiler, and can be both difficult to understand or predict and changes over time; this is a key place where understanding of compilers and machine code can improve performance. [[Loop-invariant code motion]] and [[return value optimization]] are examples of optimizations that reduce the need for auxiliary variables and can even result in faster performance by avoiding round-about optimizations.
 
===Build level===
'''Caching''' stores intermediate products of computation to avoid duplicate computations.
Between the source and compile level, [[Directive (programming)|directives]] and [[Build automation|build flags]] can be used to tune performance options in the source code and compiler respectively, such as using [[preprocessor]] defines to disable unneeded software features, optimizing for specific processor models or hardware capabilities, or predicting [[branch (computer science)|branching]], for instance. Source-based software distribution systems such as [[Berkeley Software Distribution|BSD]]'s [[Ports collection|Ports]] and [[Gentoo Linux|Gentoo]]'s [[Portage (software)|Portage]] can take advantage of this form of optimization.
 
===Compile level===
Optimizing a whole system is usually done by human beings because the system is too complex for automated optimizers. [[Grid computing]] or [[distributed computing]] aims to optimize the whole system, by moving tasks from computers with high usage to computers with idle time.
Use of an [[optimizing compiler]] with optimizations enabled tends to ensure that the [[executable program]] is optimized at least as much as the compiler can reasonable perform. See [[Optimizing compiler]] for more details.
 
===Assembly level===
== Algorithms and data structures ==
At the lowest level, writing code using an [[assembly language]], designed for a particular hardware platform can produce the most efficient and compact code if the programmer takes advantage of the full repertoire of [[machine instruction]]s. Many [[operating system]]s used on [[embedded system]]s have been traditionally written in assembler code for this reason. Programs (other than very small programs) are seldom written from start to finish in assembly due to the time and cost involved. Most are compiled down from a high level language to assembly and hand optimized from there. When efficiency and size are less important large parts may be written in a high-level language.
The choice of algorithm affects efficiency more than any other item of the design. Usually, more complex algorithms and data structures perform well with many items while simple algorithms are more suitable to small amounts of data. For small sets of data, the set-up and initialization time of the more complex algorithm can outweigh the benefit of the better algorithm.
 
With more modern [[optimizing compiler]]s and the greater complexity of recent [[CPU]]s, it is harder to write more efficient code than what the compiler generates, and few projects need this "ultimate" optimization step.
Usually, the more memory the program uses, the faster program runs. Take a filtering program. The common practice in such a program is read each line and filter and output that line in the same time. The memory is only needed for one line, but typically the performance is poor. To improve the performance, read the entire file then output the filtered result. This typically improve the peformance dramatically but causes heaviy memory use. Caching the result is also effective though requiring some or huge memory use.
 
Much of the code written today is intended to run on as many machines as possible. As a consequence, programmers and compilers don't always take advantage of the more efficient instructions provided by newer CPUs or quirks of older models. Additionally, assembly code tuned for a particular processor without using such instructions might still be suboptimal on a different processor, expecting a different tuning of the code.
== Manual optimization ==
In this technique, programmers or system administrators explicitly change code so that the system performs better. Although it can produce better efficiencies, it is far more expensive than automated optimizations.
 
Typically today rather than writing in assembly language, programmers will use a [[disassembler]] to analyze the output of a compiler and change the high-level source code so that it can be compiled more efficiently, or understand why it is inefficient.
Code optimization usually starts with a rethinking of the algorithm used in the program: more often than not, a particular algorithm can be specifically tailored to a particular problem, yelding better performance than a generic algorithm. For example, the task of sorting a huge list of items is usually done with a [[quicksort]] routine, which is one of the most efficient generic algorithms. But if some characteristic of the items is exploitable (for example, they are already arranged in some particular order), a different method can be used, or even a custom-made sort routine.
 
===Run time===
After one is reasonably sure that the best algorithm is selected, code optimization can start: loops can be unrolled (for maximum efficiency of a processor [[cache memory]]), data types as small as possible can be used, an integer arithmetic can be used instead of a floating-point one, [[hash table]]s can replace linear vectors, and so on.
[[Just-in-time compilation|Just-in-time]] compilers can produce customized machine code based on run-time data, at the cost of compilation overhead. This technique dates to the earliest [[regular expression]] engines, and has become widespread with Java HotSpot and V8 for JavaScript. In some cases [[adaptive optimization]] may be able to perform [[run time (program lifecycle phase)|run time]] optimization exceeding the capability of static compilers by dynamically adjusting parameters according to the actual input or other factors.
 
[[Profile-guided optimization]] is an ahead-of-time (AOT) compilation optimization technique based on run time profiles, and is similar to a static "average case" analog of the dynamic technique of adaptive optimization.
Performance bottlenecks can be due to the language rather than algorithms or datastructures used in the program. Sometimes, a critical part of the program can be re-written in a different, faster [[programming language]]. For example, it is common for very high-level languages like [[Python programming language|Python]] to have modules written in [[C programming language|C]], for a greater speed. Programs already written in C can have modules written in [[assembly language|assembly]]. See subpages for each language-specific optimization:
* [[Optimization of Java]]
* [[Optimization of C plus plus|Optimization of C++]]
 
[[Self-modifying code]] can alter itself in response to run time conditions in order to optimize code; this was more common in assembly language programs.
Rewriting pays off because of a law known as the [[90/10 law]], which states that 90% of the time is spent in 10% of the code, and only 10% of the time in the remaining 90% of the code. So optimizing just a small part of the program can have a huge effect on the overall speed.
 
Some [[CPU design]]s can perform some optimizations at run time. Some examples include [[out-of-order execution]], [[speculative execution]], [[instruction pipeline]]s, and [[branch predictor]]s. Compilers can help the program take advantage of these CPU features, for example through [[instruction scheduling]].
Manual optimization often has the side-effect of undermining readability. Thus code optimizations should be carefully documented and their effect on future development evaluated.
 
===Platform dependent and independent optimizations===
== Automated Optimizations ==
Code optimization can be also broadly categorized as [[computer platform|platform]]-dependent and platform-independent techniques. While the latter ones are effective on most or all platforms, platform-dependent techniques use specific properties of one platform, or rely on parameters depending on the single platform or even on the single processor. Writing or producing different versions of the same code for different processors might therefore be needed. For instance, in the case of compile-level optimization, platform-independent techniques are generic techniques (such as [[loop unwinding|loop unrolling]], reduction in function calls, memory efficient routines, reduction in conditions, etc.), that impact most CPU architectures in a similar way. A great example of platform-independent optimization has been shown with inner for loop, where it was observed that a loop with an inner for loop performs more computations per unit time than a loop without it or one with an inner while loop.<ref>{{Cite journal|last=Adewumi|first=Tosin P.|date=2018-08-01|title=Inner loop program construct: A faster way for program execution|journal=Open Computer Science|language=en|volume=8|issue=1|pages=115–122|doi=10.1515/comp-2018-0004|doi-access=free}}</ref> Generally, these serve to reduce the total [[instruction path length]] required to complete the program and/or reduce total memory usage during the process. On the other hand, platform-dependent techniques involve instruction scheduling, [[instruction-level parallelism]], data-level parallelism, cache optimization techniques (i.e., parameters that differ among various platforms) and the optimal instruction scheduling might be different even on different processors of the same architecture.
The program that does the automated optimization is called an '''optimizer'''. Most optimizers are embedded in compilers and operate during compilation. Optimizers often can tailor the generated code to specific processors.
 
==Strength reduction==
Today, automated optimizations are almost exclusively limited to compiler optimization. '''Compiler [[optimization]]''' is used to improve the efficiency (in terms of running time or resource usage) of the code output by a [[compiler]]. These techniques allow [[computer programmer|programmers]] to write code in a straightforward manner, expressing their intentions clearly, while allowing the computer to make choices about implementation details that lead to efficient code. Contrary to what the term might imply, this rarely results in code that is perfectly "optimal" by any measure, only code that is much improved compared to direct translation of the programmer's original code.
Computational tasks can be performed in several different ways with varying efficiency. A more efficient version with equivalent functionality is known as a [[strength reduction]]. For example, consider the following [[C (programming language)|C]] code snippet whose intention is to obtain the sum of all integers from 1 to {{var|N}}:
 
<syntaxhighlight lang="c">
Further problems with optimizing compilers are:
int i, sum = 0;
for (i = 1; i <= N; ++i) {
sum += i;
}
printf("sum: %d\n", sum);
</syntaxhighlight>
 
This code can (assuming no [[arithmetic overflow]]) be rewritten using a mathematical formula like:
* Usually, an optimizing compiler simply takes the intermediate representation of a program code and replaces it with a better version. In other words, high-level redundancy in the source program (such as an inefficient algorithm) remains unchanged.
 
<syntaxhighlight lang="c">
* Modern third-party compilers usually have to support several objectives. In so doing, these compilers are the jack of all trades yet the master of none.
int sum = N * (1 + N) / 2;
printf("sum: %d\n", sum);
</syntaxhighlight>
 
The optimization, sometimes performed automatically by an optimizing compiler, is to select a method ([[algorithm]]) that is more computationally efficient, while retaining the same functionality. See [[algorithmic efficiency]] for a discussion of some of these techniques. However, a significant improvement in performance can often be achieved by removing extraneous functionality.
* A compiler can only handle a small part of an application at a time, the result being that it is unable to consider important contextual information. In fact, most optimizations performed by standard compilers are very localized, applying to basic blocks or even single instructions.
 
Optimization is not always an obvious or intuitive process. In the example above, the "optimized" version might actually be slower than the original version if {{var|N}} were sufficiently small and the particular hardware happens to be much faster at performing addition and [[Loop (computing)#Loops|loop]]ing operations than multiplication and division.
This is where so-called &quot;post pass&quot; optimizers come in. These tools take the program code output by an &quot;optimizing&quot; compiler and optimize it even further (see http://www.absint.com/aipop for an example of such tool). As opposed to compilers which optimize intermediate representations of programs, post pass optimizers work on the [[Assembly language]] level.
 
== Trade-offs ==<!-- [[Pessimization]] redirects here -->
In the early times of computer science, compiler optimizations were not as good as hand-written ones. As compiler technologies have improved, good compilers can often generate better code than human programmers &#151; and good post pass optimizers can improve highly hand-optimized code even further. In the [[RISC]] CPU architecture, compiler optimization is the key for obtaining an optimal code, because the RISC instruction set is so compact that it is hard for a human to manually schedule or combine small instructions to get efficient results.
In some cases, however, optimization relies on using more elaborate algorithms, making use of "special cases" and special "tricks" and performing complex trade-offs. A "fully optimized" program might be more difficult to comprehend and hence may contain more [[software bug|faults]] than unoptimized versions. Beyond eliminating obvious antipatterns, some code level optimizations decrease maintainability.
 
Optimization will generally focus on improving just one or two aspects of performance: execution time, memory usage, disk space, bandwidth, power consumption or some other resource. This will usually require a trade-off{{snd}} where one factor is optimized at the expense of others. For example, increasing the size of [[cache (computing)|cache]] improves run time performance, but also increases the memory consumption. Other common trade-offs include code clarity and conciseness.
Techniques in optimization can be broken up along various dimensions:
 
There are instances where the programmer performing the optimization must decide to make the software better for some operations but at the cost of making other operations less efficient. These trade-offs may sometimes be of a non-technical nature{{snd}} such as when a competitor has published a [[Benchmark (computing)|benchmark]] result that must be beaten in order to improve commercial success but comes perhaps with the burden of making normal usage of the software less efficient. Such changes are sometimes jokingly referred to as ''pessimizations''.
; ''local vs. global'': Local techniques tend to be easier to implement, but result in lesser gains, while global techniques make the opposite tradeoff. Often, optimizations that act on the complete [[control flow graph]] are termed ''global'', and those that work on a [[basic block]] alone are termed ''local''.
 
==Bottlenecks==
; ''[[programming language]]-independent vs. language-dependent'': Most high-level languages share common programming constructs and abstractions--decision (if, switch, case), looping (for, while, repeat .. until, do .. while), encapsulation (structures, objects). Thus similar optimization techniques can be used across languages. However certain language features make some kinds of optimizations possible and/or difficult. For instance, the existence of pointers in [[C programming language|C]] and [[C++ programming language|C++]] makes certain optmizations of array accesses difficult. Conversely, in some languages functions are not permitted to have "side effects". Therefore, if repeated calls to the same function with the same arguments are made, the compiler can immediately infer that results need only be computed once and the result referred to repeatedly.
Optimization may include finding a [[Bottleneck (engineering)|bottleneck]] in a system{{snd}} a component that is the limiting factor on performance. In terms of code, this will often be a [[Hot spot (computer science)|hot spot]]{{snd}} a critical part of the code that is the primary consumer of the needed resource{{snd}} though it can be another factor, such as I/O latency or network bandwidth.
 
In computer science, resource consumption often follows a form of [[power law]] distribution, and the [[Pareto principle]] can be applied to resource optimization by observing that 80% of the resources are typically used by 20% of the operations.<ref>{{cite book | last = Wescott | first = Bob | title = The Every Computer Performance Book, Chapter 3: Useful laws | publisher = [[CreateSpace]] | date = 2013 | isbn = 978-1482657753}}</ref> In software engineering, it is often a better approximation that 90% of the execution time of a computer program is spent executing 10% of the code (known as the 90/10 law in this context).
; ''machine independent vs. machine dependent'': A lot of optimizations that operate on abstract programming concepts (loops, objects, structures) are independent of the machine targetted by the compiler. But many of the most effective optimizations are those that best exploit special features of the target platform.
 
More complex algorithms and data structures perform well with many items, while simple algorithms are more suitable for small amounts of data — the setup, initialization time, and constant factors of the more complex algorithm can outweigh the benefit, and thus a [[hybrid algorithm]] or [[adaptive algorithm]] may be faster than any single algorithm. A performance profiler can be used to narrow down decisions about which functionality fits which conditions.<ref>{{cite web |url=http://www.developforperformance.com/PerformanceProfilingWithAFocus.html#FittingTheSituation |author=Krauss, Kirk J. |title=Performance Profiling with a Focus |access-date=15 August 2017}}</ref>
These dimensions aren't completely orthogonal. It is often the case that machine dependent optimizations are local, for example.
 
Performance profiling therefore provides not only bottleneck detection but rather a variety of methods for optimization guidance. [[Empirical algorithmics]] is the practice of using empirical methods, typically performance profiling, to study the behavior of algorithms, for developer understanding that may lead to human-planned optimizations. [[Profile-guided optimization]] is the machine-driven use of profiling data as input to an optimizing compiler or interpreter. Some programming languages are associated with tools for profile-guided optimization.<ref>{{cite web |url=https://doc.rust-lang.org/beta/rustc/profile-guided-optimization.html |title=Profile-guided Optimization |access-date=12 July 2025}}</ref> Some performance profiling methods emphasize enhancements based on [[cache (computing)|cache]] utilization.<ref>{{Cite book |last=The Valgrind Developers |url=https://www.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-40/Nice/RuleRefinement/bin/valgrind-3.2.0/docs/html/cl-manual.html#cl-manual.tools |title=Valgrind User Manual |section=5.2.2 |publisher=Network Theory Ltd. |year=2006 |language=en}}</ref> Other benefits of performance profiling may include improved resource management and an enhanced user experience.<ref>{{cite web |author= Kodlekere, Ranjana |title= Performance Profiling: Explained with Stages| url=https://testsigma.com/blog/performance-profiling/#benefits-of-performance-profiling |access-date=12 July 2025}}</ref>
An instance of a local machine dependent optimization: to set a register to 0, the obvious way is to use the constant 0 with the instruction that sets a register value to a constant. A less obvious way is to [[XOR]] a register with itself. It is up to the compiler to know which instruction variant to use. On many [[RISC]] machines, both instructions would be equally appropriate, since they would both be the same length and take the same time. On many other [[microprocessor]]s such as the [[Intel]] [[x86]] family, it turns out that the XOR variant is shorter and maybe faster (no need to decode an immediate operand nor use the internal "immediate operand register").
 
In some cases, adding more [[main memory|memory]] can help to make a program run faster. For example, a filtering program will commonly read each line and filter and output that line immediately. This only uses enough memory for one line, but performance is typically poor, due to the latency of each disk read. Caching the result is similarly effective, though also requiring larger memory use.
=== Factors affecting optimization ===
 
==When to optimize==
; ''The machine itself''
<!-- This section is linked from [[Python (programming language)]] -->
 
Typically, optimization involves choosing the best overall algorithms and data structures. <ref>{{cite web|url=https://ubiquity.acm.org/article.cfm?id=1513451|title=The Fallacy of Premature Optimization}}</ref> Frequently, algorithmic improvements can cause performance improvements of several orders of magnitude instead of micro-optimizations, which rarely improve performance by more than a few percent. <ref>{{cite web|url=https://ubiquity.acm.org/article.cfm?id=1513451|title=The Fallacy of Premature Optimization}}</ref> If one waits to optimize until the end of the development cycle, then changing the algorithm requires a complete rewrite.
It is sometimes possible to parametrize some of these machine dependent factors. A single piece of compiler code can be used to optimize different machines just by altering the machine description parameters. See [[GNU Compiler Collection|GCC]] for a compiler that exemplifies this approach.
 
Frequently, micro-optimization can reduce [[readability]] and complicate programs or systems. That can make programs more difficult to maintain and debug.
; ''The architecture of the target CPU''
 
[[Donald Knuth]] made the following two statements on optimization:
* Number of [[CPU]] [[register]]s: To a certain extent, the more registers, easier it is to optimize for performance. Local variables can be allocated in the registers and not on the [[stack]]. Temporary/intermediate results can be left in registers without writing to and reading back from memory.
* [[RISC]] vs. [[CISC]]: CISC instruction sets often have variable instruction lengths, often have a larger number of possible instructions that can be used, and each instruction could take differing amounts of time. RISC instruction sets attempt to limit the variability in each of these: instruction sets are usually constant length, with few exceptions, there are usually fewer combinations of registers and memory operations, and the instruction issue rate (the number of instructions completed per time period, usually an integer multiple of the clock cycle) is usually constant in cases where memory latency is not a factor. There may be several ways of carrying out a certain task, with CISC usually offering more alternatives than RISC. Compilers have to know the relative costs among the various instructions and choose the best instruction sequence.
* [[Instruction pipeline|Pipeline]]s: A pipeline is essentially an [[ALU]] broken up into an assembly line. It allows use of parts of the ALU for different instructions by breaking up the execution of instructions into various stages: instruction decode, address decode, memory fetch, register fetch, compute, register store, etc. One instruction could be in the register store stage, while another could be in the register fetch stage. Pipeline conflicts occur when an instruction in one stage of the pipeline depends on the result of another instruction ahead of it in the pipeline but not yet completed. Pipeline conflicts can lead to [[pipeline stall]]s: where the CPU wastes cycles waiting for a conflict to resolve.
: Compilers have to schedule instructions such that the pipelines don't stall, or stalls are reduced to a minimum.
* [[superscalar|number of functional units]]: Some CPUs have several ALUs and [[FPU]]s. This allows them to execute multiple instructions simultaneously. There may be restrictions on which instructions can pair with which other instructions ("pairing" is the simultaneous execution of two or more instructions), and which functional unit can execute which instruction. They also have issues similar to pipeline conflicts.
: Here again, instructions have to be scheduled so that the the various functional units are fully fed with instructions to execute.
 
<blockquote>"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%"<ref name="autogenerated268">{{cite journal | last = Knuth | first = Donald | citeseerx = 10.1.1.103.6084 | title = Structured Programming with go to Statements | journal = ACM Computing Surveys | volume = 6 | issue = 4 |date=December 1974 | page = 268 | doi = 10.1145/356635.356640 | s2cid = 207630080 }}</ref></blockquote>
; ''The architecture of the machine''
(He also attributed the quote to [[Tony Hoare]] several years later,<ref>''The Errors of [[TeX]]'', in ''Software—Practice & Experience'', Volume 19, Issue 7 (July 1989), pp. 607–685, reprinted in his book Literate Programming (p. 276).</ref> although this might have been an error as Hoare disclaims having coined the phrase.<ref><!--Tony Hoare, a 2004 email-->{{Cite web|title=Premature optimization is the root of all evil|url=https://hans.gerwitz.com/2004/08/12/premature-optimization-is-the-root-of-all-evil.html|access-date=2020-12-18|quote=Hoare, however, did not claim it when I queried him in January of 2004|website=hans.gerwitz.com|language=en}}</ref>)
<blockquote> "In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal and I believe the same viewpoint should prevail in software engineering"<ref name="autogenerated268"/></blockquote>
 
"Premature optimization" is often used as a rallying cry against all optimization in all situations for all purposes. <ref>{{cite web|url=https://ubiquity.acm.org/article.cfm?id=1513451|title=The Fallacy of Premature Optimization}}</ref><ref>{{cite web|url=https://www.javacodegeeks.com/2012/11/not-all-optimization-is-premature.html|title=Not All Optimization is Premature}}</ref><ref>{{cite web|url=https://www.infoworld.com/article/2165382/when-premature-optimization-isn-t.html|title=When Premature Optimization Is'nt}}</ref><ref>{{cite web|url=https://prog21.dadgum.com/106.html|title="Avoid Premature Optimization" Does Not Mean "Write Dump Code"}}</ref> Frequently, [[SOLID|Clean Code]] causes code to be more complicated than simpler more efficient code. <ref>{{cite web|url=https://devshift.substack.com/p/premature-abstractions|title=Premature Abstractions}}</ref>
* Cache size (256KB, 1MB) and type (fully associative, 4-way associative): Techniques like inlining and loop unrolling may increase the size of the generated code and reduce code locality. The program may slow down drastically if an oft-run piece of code (like inner loops in various algorithms) suddenly cannot fit in the cache. Also non-fully associative caches have higher chances of cache collisions even in an unfilled cache.
* Cache/Memory transfer rates: These give the compiler an indication of the penalty for cache misses. This is used only in specialized applications.
 
When deciding what to optimize, Amdahl's Law should be used to proritize parts based on the actual time spent in a certain part, which is not always clear from looking at the code without a [[Profiling (computer programming)|performance analysis]].
==== Intended use of the generated code ====
 
In practice, it is often necessary to keep performance goals in mind when first designing software, yet programmers must balance various tradeoffs. Development cost is significant, and hardware is fast.
; ''Debugging'' : A main factor is speed of compilation. Also, debugging code is usually stepped through in a [[symbolic debugger]] -- so it is useful not to apply transformations that make it difficult to identify the source code line numbers from which the code being stepped through was generated from.
 
Modern compilers are efficient enough that the intended performance increases sometimes fail to materialize. Since compilers perform many automatic optimizations, some optimizations may yield an identical executable. Also, sometimes hardware may reduce the impact of micro-optimization. For example, hardware may cache data that is cached at a software level.
; ''General purpose use'' : Prepackaged software is very often expected to be executed on a variety of machines and CPUs that may share the same instruction set, but have different timing, cache or memory characteristics. So, the code may not be tuned to any particular CPU, or may be tuned to work well on the most popular CPU and work reasonably on other CPUs.
 
==Macros==
; ''Special purpose use'' : If the software is compiled to be used on one or a few very similar machines, with known characteristics, then the compiler can heavily tune the generated code to those machines alone.
Optimization during code development using [[Macro (computer science)|macros]] takes on different forms in different languages.
 
In some procedural languages, such as [[C (programming language)|C]] and [[C++]], macros are implemented using token substitution. Nowadays, [[inline function]]s can be used as a [[type safe]] alternative in many cases. In both cases, the inlined function body can then undergo further compile-time optimizations by the compiler, including [[constant folding]], which may move some computations to compile time.
=== Optimization techniques ===
 
In many [[functional programming]] languages, macros are implemented using parse-time substitution of parse trees/abstract syntax trees, which it is claimed makes them safer to use. Since in many cases interpretation is used, that is one way to ensure that such computations are only performed at parse-time, and sometimes the only way.
==== Common Themes ====
 
[[Lisp programming language|Lisp]] originated this style of macro,{{Citation needed|date=September 2008}} and such macros are often called "Lisp-like macros". A similar effect can be achieved by using [[template metaprogramming]] in [[C++]].
To a large extent, optimization techniques have the following themes, which sometime conflict
 
In both cases, work is moved to compile-time. The difference between [[C (programming language)|C]] macros on one side, and Lisp-like macros and [[C++]] [[template metaprogramming]] on the other side, is that the latter tools allow performing arbitrary computations at compile-time/parse-time, while expansion of [[C (programming language)|C]] macros does not perform any computation, and relies on the optimizer ability to perform it. Additionally, [[C (programming language)|C]] macros do not directly support [[recursion (computer science)|recursion]] or [[iteration]], so are not [[Turing complete]].
; ''Less code'' : There is less work for the CPU, cache, and memory. So, likely to be faster.
 
As with any optimization, however, it is often difficult to predict where such tools will have the most impact before a project is complete.
; ''Straight line code, fewer jumps'' : Less complicated code. Jumps interfere with the prefetching of instructions, thus slowing down code.
 
==Automated and manual optimization==
; ''Code locality'' : Pieces of code executed close together in time should be placed close together in memory, which increases spatial [[locality of reference]].
{{Main|Optimizing compiler}}
''See also [[:Category:Compiler optimizations]]''
 
Optimization can be automated by compilers or performed by programmers. Gains are usually limited for local optimization, and larger for global optimizations. Usually, the most powerful optimization is to find a superior [[algorithm]].
; ''Extract more information from code'' : The more information the compiler has, the better it can optimize.
 
Optimizing a whole system is usually undertaken by programmers because it is too complex for automated optimizers. In this situation, programmers or [[system administrator]]s explicitly change code so that the overall system performs better. Although it can produce better efficiency, it is far more expensive than automated optimizations. Since many parameters influence the program performance, the program optimization space is large. Meta-heuristics and machine learning are used to address the complexity of program optimization.<ref>{{cite journal|last1=Memeti|first1=Suejb|last2=Pllana|first2=Sabri|last3=Binotto|first3=Alécio|last4=Kołodziej|first4=Joanna|last5=Brandic|author5-link= Ivona Brandić |first5=Ivona|title=Using meta-heuristics and machine learning for software optimization of parallel computing systems: a systematic literature review|journal=Computing|volume=101|issue=8|pages=893–936|date=26 April 2018|doi=10.1007/s00607-018-0614-9|publisher=Springer Vienna|arxiv=1801.09444|bibcode=2018arXiv180109444M|s2cid=13868111}}</ref>
==== Techniques ====
 
Use a [[Profiler (computer science)|profiler]] (or [[Profiling (computer programming)|performance analyzer]]) to find the sections of the program that are taking the most resources{{snd}} the ''bottleneck''. Programmers sometimes believe they have a clear idea of where the bottleneck is, but intuition is frequently wrong.{{citation needed|date=May 2012}} Optimizing an unimportant piece of code will typically do little to help the overall performance.
Some optimization techniques are:
 
When the bottleneck is localized, optimization usually starts with a rethinking of the algorithm used in the program. More often than not, a particular algorithm can be specifically tailored to a particular problem, yielding better performance than a generic algorithm. For example, the task of sorting a huge list of items is usually done with a [[quicksort]] routine, which is one of the most efficient generic algorithms. But if some characteristic of the items is exploitable (for example, they are already arranged in some particular order), a different method can be used, or even a custom-made sort routine.
; ''loop unrolling'' : Attempts to reduce the overhead inherent in testing a loop's condition by replicating the code in the loop body. Completely unrolling a loop, of course, requires that the number of iterations be known at compile time.
 
After the programmer is reasonably sure that the best algorithm is selected, code optimization can start. Loops can be unrolled (for lower loop overhead, although this can often lead to ''lower'' speed if it overloads the [[CPU cache]]), data types as small as possible can be used, integer arithmetic can be used instead of floating-point, and so on. (See [[algorithmic efficiency]] article for these and other techniques.)
; ''loop combining'' : Another technique which attempts to reduce loop overhead. When two adjacent loops would iterate the same number of times (whether or not that number is known at compile time), their bodies can be combined as long as they make no reference to each other's data.
 
Performance bottlenecks can be due to language limitations rather than algorithms or data structures used in the program. Sometimes, a critical part of the program can be re-written in a different [[programming language]] that gives more direct access to the underlying machine. For example, it is common for very [[High-level programming language|high-level]] languages like [[Python (programming language)|Python]] to have modules written in [[C (programming language)|C]] for greater speed. Programs already written in C can have modules written in [[assembly language|assembly]]. Programs written in [[D programming language|D]] can use the [[inline assembler]].
; ''loop interchange'' : (swapping inner and outer loops)
 
Rewriting sections "pays off" in these circumstances because of a general "[[rule of thumb]]" known as the 90/10 law, which states that 90% of the time is spent in 10% of the code, and only 10% of the time in the remaining 90% of the code. So, putting intellectual effort into optimizing just a small part of the program can have a huge effect on the overall speed{{snd}} if the correct part(s) can be located.
; ''common subexpression elimination'' : In the expression "(a+b)-(a+b)/4", "common subexpression" refers to the duplicated "(a+b)". Compilers implementing this technique realize that "(a+b)" won't change, and as such, only calculate its value once.
 
Manual optimization sometimes has the side effect of undermining readability. Thus code optimizations should be carefully documented (preferably using in-line comments), and their effect on future development evaluated.
; ''test reordering'' : if we have two tests that are the condition for something, we can first deal with the simpler tests (e.g. comparing a variable to something) and only then with the complex tests (e.g. those that require a function call). This technique complements [[lazy evaluation]], but can be used only when the tests are not dependent on one another.
 
The program that performs an automated optimization is called an '''optimizer'''. Most optimizers are embedded in compilers and operate during compilation. Optimizers can often tailor the generated code to specific processors.
; ''[[constant folding]] and [[constant propagation|propagation]]'' : replacing expressions consisting of constants (e.g. "3 + 5") which their final value ("8") rather than doing the calculation in run-time. Used in most modern languages.
 
Today, automated optimizations are almost exclusively limited to [[compiler optimization]]. However, because compiler optimizations are usually limited to a fixed set of rather general optimizations, there is considerable demand for optimizers which can accept descriptions of problem and language-specific optimizations, allowing an engineer to specify custom optimizations. Tools that accept descriptions of optimizations are called [[program transformation]] systems and are beginning to be applied to real software systems such as C++.
; ''inlining of procedures'' : when some codes invokes a [[procedure]], it is possible to put the body of the procedure inside this code rather than invoking it from another ___location. This saves the overhead related to procedure calls, but comes at the cost of duplicating the function body each time it's called inline. Generally, inlining is useful in performance-critical code that uses makes a lot of small procedure calls. If any parameters to the procedure are constants known at compile time, inlining may result in more constants to propagate.
 
Some high-level languages ([[Eiffel (programming language)|Eiffel]], [[Esterel]]) optimize their programs by using an [[intermediate language]].
; ''instruction scheduling'' :
 
[[Grid computing]] or [[distributed computing]] aims to optimize the whole system, by moving tasks from computers with high usage to computers with idle time.
; ''unreachable code elimination'': [[Dead code]] refers to code segments which can never be executed. Dead code elimination prevents the compiler from emitting code for such segments, saving on [[CPU instruction cache]].
 
; ''dead code elimination'': Remove instructions that will not affect the behaviour of the program. For example in "int myFunc(int a){int b,c; b=a*2; c=a*3; return b;}" the statement "c=a*3" doesn't do anything useful and can be removed.
 
; ''using CPU instructions with complex offsets to do math'' : on many processors in the [[68000 family]], for example, "lea 25(a1,d5*4),a0" assigns to the a0 register 25 + the contents of a1 + 4 * the contents of d5 in a single instruction and without an explicit move or overwriting a1 or d5
 
; ''code-block reordering'': reduce conditional branches and improve [[locality of reference]]
 
; ''factoring out of invariants'' : if an expression is carried out both when a condition is met and otherwise, it can be written just once outside of the conditional statement. Similarly, if certain types of expressions (e.g. the assignment of a constant into a variable) appear inside a loop, they can be moved out of it because their effect will be the same no matter if they're executed many times or just once. Also known as total redundancy elimination. A more powerful optimization is partial redundancy elimination (PRE).
 
; ''removing [[recursion]]'' : recursion is often expensive as it involves the function call overhead, and inconvenient as it can rapidly deplete the stack. [[Tail recursion|Tail recursive]] algorithms can be converted to [[iteration]], which does not have call overhead and uses a constant amount of the stack.
 
; ''strength reduction'' : a general term encompassing optimisations that replace complex or difficult or expensive operations with simpler ones. A classic example is array indexing in C. Say I have the following code:
<pre>
/* set all elements of a to x */
for(i=0; i<n; ++i) {
a[i] = x;
}
</pre>
: the semantics of C say that array indexing is equivalent to pointer arithmetic and dereferencing. So this is equivalent to:
<pre>
for(i=0; i<n; ++i) {
*(a+i) = x;
}
</pre>
:In C the expression <tt>(a+i)</tt> is likely to involve evaluating i, multiplying by the size (in bytes) of the elements of the array and then adding that to the array address. Let's say the elements of the array are 4 bytes long (which would be typical of int on a [[32-bit]] architecture). So in something not quite like C:
<pre>
for(i=0); i<n; ++i) {
t1 = address(a)
t2 = 4 * i
t3 = t1 + t2
*t3 = x;
}
</pre>
: The computation of t2 involves a multiplication, a <b>strength reduction</b> optimisation can reduce this to an addition. Notice that the multiplication is done each time we go round the loop. The optimisation relies on the fact that when we know <tt>4*i</tt> it is easy to compute <tt>4*(i+1)</tt> which we need next time we go round the loop. We just add 4 to the value that we already computed for <tt>4*i</tt>. So the new, strength reduced, code looks something like this:
<pre>
t2 = 0
for(i=0; i<n; ++i) {
t1 = address(a)
t3 = t1 + t2
*t3 = x
t2 = t2 + 4
}
</pre>
: (as it happens the computation of t1 is invariant and can be removed from the loop by "factoring out of invariants" mentioned above).
 
==Time taken for optimization==
: Strength reduction can apply to many things such as: turning divides into multiplies (for example, when dividing by a constant; it may be possible to replace this with a multiplication by the reciprocal of the constant), turning multiplies into shifts and adds, etc.
Sometimes, the time taken to undertake optimization therein itself may be an issue.
 
Optimizing existing code usually does not add new features, and worse, it might add new [[Software bug|bugs]] in previously working code (as any change might). Because manually optimized code might sometimes have less "readability" than unoptimized code, optimization might impact maintainability of it as well. Optimization comes at a price and it is important to be sure that the investment is worthwhile.
; ''reduction of cache collisions'' : (e.g. by disrupting alignment within a page)
 
An automatic optimizer (or [[optimizing compiler]], a program that performs code optimization) may itself have to be optimized, either to further improve the efficiency of its target programs or else speed up its own operation. A compilation performed with optimization "turned on" usually takes longer, although this is usually only a problem when programs are quite large.
; ''register optimization'': The most frequently used variables should be kept in processor registers for fastest access. To find which variables to put in registers a interference-graph is created. Each variable is a vertex and when two variables are used at the same time (have an intersecting liverange) they have an edge between them. This graph is colored using for example [[Gregory_Chaitin|Chaitin]]'s algorithm using the same number of colors as there are registers. If the coloring fails one variable is "spilled" to memory and the coloring is retried.
 
In particular, for [[just-in-time compiler]]s the performance of the [[Run time environment|run time]] compile component, executing together with its target code, is the key to improving overall execution speed.
; ''Stack height reduction'': Rearrange expression tree to minimize resources needed for expression evaluation.
 
==False optimization==
; ''Branch offset optimization (machine independent)'': Choose the shortest branch displacement that reaches target
 
Sometimes, "optimizations" may hurt performance. Parallelism and concurrency causes a significant overhead performance cost, especially energy usage. Keep in mind that C code rarely uses explicit multiprocessing, yet it typically runs faster than any other programming language. Disk caching, paging, and swapping often cause significant increases to energy usage and hardware wear and tear. Running processes in the background to improve startup time slows down all other processes.
== Subpages ==
* [[Optimization of Java]]
 
==See References also==
<!-- Please keep entries in alphabetical order & add a short description {{annotated link|WP:SEEALSO}} -->
{{div col|small=yes|colwidth=20em}}
* {{annotated link|Benchmark (computing)|Benchmark}}
* {{annotated link|Cache (computing)}}
* {{annotated link|Empirical algorithmics}}
* {{annotated link|Optimizing compiler}}
* {{annotated link|Performance engineering}}
* {{annotated link|Performance prediction}}
* {{annotated link|Performance tuning}}
* {{annotated link|Profile-guided optimization}}
* {{annotated link|Software development}}
* {{annotated link|Software performance testing}}
* {{annotated link|Static code analysis}}
{{div col end}}
<!-- please keep entries in alphabetical order -->
 
==References==
* ''Writing Efficient Programs'', Jon Louis Bentley, 1982, ISBN 0139702512.
{{Reflist}}
 
==Further reading==
== Related terms ==
{{wikibooks|Optimizing Code for Speed}}
* [[Abstract interpretation]]
* [[Jon Bentley (computer scientist)|Jon Bentley]]: ''Writing Efficient Programs'', {{ISBN|0-13-970251-2}}.
* [[Control flow graph]]
* [[Donald Knuth]]: ''[[The Art of Computer Programming]]''
* [[SSA form]]
* [http://www.ece.cmu.edu/~franzf/papers/gttse07.pdf How To Write Fast Numerical Code: A Small Introduction]
* [[Queueing theory]]
* [http://people.redhat.com/drepper/cpumemory.pdf "What Every Programmer Should Know About Memory"] by Ulrich Drepper{{snd}} explains the structure of modern memory subsystems and suggests how to utilize them efficiently
* [[Simulation]]
*[http://icl.cs.utk.edu/~mucci/latest/pubs/Notur2009-new.pdf "Linux Multicore Performance Analysis and Optimization in a Nutshell"], presentation slides by Philip Mucci
* [http://www.azillionmonkeys.com/qed/optimize.html Programming Optimization] by Paul Hsieh
* [http://www.new-npac.org/projects/cdroms/cewes-1999-06-vol1/nhse/hpccsurvey/orgs/sgi/bentley.html Writing efficient programs ("Bentley's Rules")] by [[Jon Bentley (computer scientist)|Jon Bentley]]
* [http://queue.acm.org/detail.cfm?id=1117403 "Performance Anti-Patterns"] by Bart Smaalders
{{Compiler optimizations}}
{{DEFAULTSORT:Program Optimization}}
[[Category:Software optimization|*]]
[[Category:Programming language topics]]
[[Category:Articles with example C code]]
[[Category:Computer optimization|*]]