Compiler-compiler: Difference between revisions

Content deleted Content added
Undid revision 1186519770 by CarterFendley (talk) WP:FAITH
Bender the Bot (talk | contribs)
m Variants: HTTP to HTTPS for SourceForge
 
(18 intermediate revisions by 16 users not shown)
Line 4:
{{Use dmy dates|date=January 2020|cs1-dates=y}}
 
In [[computer science]], a '''compiler-compiler''' or '''compiler generator''' is a [[programming tool]] that creates a [[parsingParsing#Computer_languages|parser]], [[interpreter (computer software)|interpreter]], or [[compiler]] from some form of formal description of a [[programming language]] and machine.
 
The most common type of compiler-compiler is called a '''parser generator'''.<ref>{{cite book |url=https://www.worldcat.org/oclc/70775643 |title=Compilers : principles, techniques, & tools |date=2007 |others=Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman, Alfred V. Aho |isbn=978-0-321-48681-3 |edition=Second |___location=Boston |page=287 |oclc=70775643}}</ref> It handles only [[syntactic analysis]].
 
A formal description of a language is usually a [[formal grammar|grammar]] file is suppliedused as thean input forto a parser generator. ThisIt is typically writtenoften inresembles [[Backus–Naur form]] (BNF) or, [[extended Backus–Naur form]] (EBNF), andor defineshas theits own syntax. Grammar files describe a [[Syntax (programming languages)|syntax]] of a generated compiler's target programming language and actions that should be taken against its specific constructs.
 
[[Source code]] for a parser of the programming language is returned as the parser generator's output. This source code can then be compiled into a parser, which may be either standalone or embedded. The compiled parser then accepts the source code of the target programming language as an input and performs an action or outputs an [[abstract syntax tree]] (AST).
Line 21:
One of the earliest (1964), surprisingly powerful, versions of compiler-compilers is [[META II]], which accepted an analytical grammar with output facilities [[Code generation (compiler)|that produce stack machine]] code, and is able to compile its own source code and other languages.
 
Among the earliest programs of the original [[Unix]] versions being built at [[Bell Labs]] was the two-part [[lex (software)|lex]] and [[yacc]] system, which was normally used to output [[C programming language]] code, but had a flexible output system that could be used for everything from programming languages to [[text file]] conversion. Their modern [[GNU Project|GNU]] versions are [[Flex (lexical analyser generator)|flex]] and [[GNU Bison|bison]].
 
Some experimental compiler-compilers take as input a formal description of programming language semantics, typically using [[denotational semantics]]. This approach is often called 'semantics-based compiling', and was pioneered by [[Peter Mosses]]' Semantic Implementation System (SIS) in 1978.<ref>Peter Mosses, "SIS: A Compiler-Generator System Using Denotational Semantics," Report 78-4-3, Dept. of Computer Science, University of Aarhus, Denmark, June 1978</ref> However, both the generated compiler and the code it produced were inefficient in time and space. No production compilers are currently built in this way, but research continues.
Line 27:
The Production Quality Compiler-Compiler ([[PQCC]]) project at [[Carnegie Mellon University]] does not formalize semantics, but does have a semi-formal framework for machine description.
 
Compiler-compilers exist in many flavors, including bottom-up rewrite machine generators (see [httphttps://jburg.sourceforge.net/ JBurg]) used to tile syntax trees according to a rewrite grammar for code generation, and [[attribute grammar]] parser generators (e.g. [[ANTLR]] can be used for simultaneous type checking, constant propagation, and more during the parsing stage).
 
===Metacompilers===
Line 41:
 
====The meaning of metacompiler====
In computer science, the prefix ''[[Meta (prefix)#Epistemology|meta]]'' is commonly used to mean ''about (its own category)''. For example, [[metadata]] are data that describe other data. A language that is used to describe other languages is a [[metalanguage]]. Meta may also mean [[meta#onMeta a higher level of abstraction(prefix)|''on a higher level of abstraction'']]. A [[metalanguage]] operates on a higher level of abstraction in order to describe properties of a language. [[Backus–Naur form]] (BNF) is a formal [[metalanguage]] originally used to define [[ALGOL 60]]. BNF is a weak [[metalanguage]], for it describes only the [[syntax]] and says nothing about the [[semantics]] or meaning. Metaprogramming is the writing of [[computer program]]s with the ability to treat [[Computer program|program]]s as their data. A metacompiler takes as input a [[metaprogramming|metaprogram]] written in a [[metalanguage|specialized metalanguages]] (a higher level abstraction) specifically designed for the purpose of metaprogramming.<ref name="CWIC" /><ref name="TMETA" /> The output is an executable object program.
 
An analogy can be drawn: That as a ''C++'' compiler takes as input a ''C++'' programming language program, a ''meta''compiler takes as input a [[metaprogramming|''meta''programming]] [[metalanguage]] program.
Line 51:
This Forth use of the term metacompiler is disputed in mainstream computer science. See [[Forth (programming language)#Self-compilation and cross compilation|Forth (programming language)]] and [[History of compiler construction#Forth|History of compiler construction]]. The actual Forth process of compiling itself is a combination of a Forth being a [[History of compiler construction#Self-hosting compilers|self-hosting]] [[extensible programming]] language and sometimes [[History of compiler construction#Cross compilation|cross compilation]], long established terminology in computer science. Metacompilers are a general compiler writing system. Besides the Forth metacompiler concept being indistinguishable from self-hosting and extensible language. The actual process acts at a lower level defining a minimum subset of forth ''words'', that can be used to define additional forth words, A full Forth implementation can then be defined from the base set. This sounds like a bootstrap process. The problem is that almost every general purpose language compiler also fits the Forth metacompiler description.
: When (self-hosting compiler) X processes its own source code, resulting in an executable version of itself, X is a metacompiler.
Just replace X with any common language, C, C++, [[Java (programming language)|Java]], [[Pascal (programming language)|Pascal]], [[COBOL]], [[Fortran]], [[Ada (programming language)|Ada]], [[Modula-2]], etc. And X would be a metacompiler according to the Forth usage of metacompiler. A metacompiler operates at an abstraction level above the compiler it compiles. It only operates at the same (self-hosting compiler) level when compiling itself. One has to see the problem with this definition of metacompiler. It can be applied to most any language.
 
However, on examining the concept of programming in Forth, adding new words to the dictionary, extending the language in this way is metaprogramming. It is this metaprogramming in Forth that makes it a metacompiler.
 
Programming in Forth is adding new words to the language. Changing the language in this way is [[metaprogramming]]. Forth is a metacompiler, because Forth is a language specifically designed for metaprogramming. Programming in Forth is extending Forth adding words to the Forth vocabulary creates a new Forth [[Dialect (computing)|dialect]]. Forth is a specialized metacompiler for Forth language dialects.
 
==History==
Design of the original Compiler Compilercompiler-compiler was started by [[Tony Brooker]] and Derrick Morris in 1959, with initial testing beginning in March 1962.<ref>{{Cite web |last=Lavington |first=Simon |date=April 2016 |title=Tony Brooker and the Atlas Compiler Compiler |url=http://curation.cs.manchester.ac.uk/atlas/elearn.cs.man.ac.uk/_atlas/docs/Tony%20Brooker%20and%20the%20Atlas%20Compiler%20Compiler.pdf |url-status=live |access-date=2023-09-29 |format=PDF |archive-date=2023-03-26 |archive-url=https://web.archive.org/web/20230326214708/http://curation.cs.manchester.ac.uk/atlas/elearn.cs.man.ac.uk/_atlas/docs/Tony%20Brooker%20and%20the%20Atlas%20Compiler%20Compiler.pdf }}</ref> The Brooker's Morris Compiler Compiler (BMCC) was used to create compilers for the new [[Atlas (computer)|Atlas]] computer at the [[University of Manchester]], for several languages: [[Mercury Autocode]], Extended Mercury Autocode, [[Atlas Autocode]], [[ALGOL 60]] and ASA [[Fortran]]. At roughly the same time, related work was being done by E. T. (Ned) Irons at Princeton, and Alick Glennie at the Atomic Weapons Research Establishment at Aldermaston whose "Syntax Machine" paper (declassified in 1977) inspired the META series of translator writing systems mentioned below.
 
The early history of metacompilers is closely tied with the history of SIG/PLAN Working group 1 on Syntax Driven Compilers. The group was started primarily through the effort of Howard Metcalfe in the Los Angeles area.<ref name="Metcalfe1"/> In the fall of 1962, Howard Metcalfe designed two compiler-writing interpreters. One used a bottom-to-top analysis technique based on a method described by Ledley and Wilson.<ref name="Ledleyl"/> The other used a top-to-bottom approach based on work by Glennie to generate random English sentences from a [[context-free grammar]].<ref name="Glenniel"/>
 
At the same time, Val Schorre described two "meta machines", one generative and one analytic. The generative machine was implemented and produced random algebraic expressions. Meta I the first metacompiler was implemented by Schorre on an IBM 1401 at UCLA in January 1963. His original interpreters and metamachines were written directly in a pseudo-machine language. [[META II]], however, was written in a higher-level metalanguage able to describe its own compilation into the pseudo-machine language.<ref name="METAII"/><ref name="SMALGOL"/><ref name="META1"/>
Line 92:
With the resurgence of ___domain-specific languages and the need for parser generators which are easy to use, easy to understand, and easy to maintain, metacompilers are becoming a valuable tool for advanced software engineering projects.
 
Other examples of parser generators in the yacc vein are [[ANTLR]], [[Coco/R]],<ref name="Rechenberg-Mössenböck_1985"/> CUP,{{Citation needed|date=March 2012}} [[GNU Bison]], Eli,<ref>{{cite journal|doi=10.1145/129630.129637 |title=Eli: A complete, flexible compiler construction system |year=1992 |last1=Gray |first1=Robert W. |last2=Levi |first2=Steven P. |last3=Heuring |first3=Vincent P. |last4=Sloane |first4=Anthony M. |last5=Waite |first5=William M. |journal=Communications of the ACM |volume=35 |issue=2 |pages=121–130 |s2cid=5121773 |doi-access=free }}</ref> FSL,{{Citation needed|date=March 2012}} [[SableCC]], SID (Syntax Improving Device),<ref>{{cite journal|doi=10.1093/comjnl/11.1.31 |doi-access=free |title=A syntax improving program |year=1968 |last1=Foster |first1=J. M. |journal=The Computer Journal |volume=11 |pages=31–34 }}</ref> and [[JavaCC]]. While useful, pure parser generators only address the parsing part of the problem of building a compiler. Tools with broader scope, such as [[PQCC]], [[Coco/R]] and [[DMS Software Reengineering Toolkit]] provide considerable support for more difficult post-parsing activities such as semantic analysis, code optimization and generation.
 
==Schorre metalanguages==
Line 103:
A syntax equation:
<pre><name> = <body>;</pre>
is a compiled ''test'' function returning ''success'' or ''failure''. <name> is the function name. <body> is a form of logical expression consisting of tests that may be grouped, have alternates, and output productions. A ''test'' is like a ''bool'' in other languages, ''success'' being ''true'' and ''failure'' being ''false''.
 
Defining a programming language analytically top down is natural. For example, a program could be defined as:
Line 137:
{{main|TREE-META}}
 
TREE-META introduced tree building operators ''':'''<''node_name''> and '''['''<''number''>''']''' moving the output production transforms to unparsed rules. The tree building operators were used in the grammar rules directly transforming the input into an [[abstract syntax tree]]. Unparse rules are also test functions that matched tree patterns. Unparse rules are called from a grammar rule when an [[abstract syntax tree]] is to be transformed into output code. The building of an [[abstract syntax tree]] and unparse rules allowed local optimizations to be performed by analyzing the parse tree.
 
Moving of output productions to the unparse rules made a clear separation of grammar analysis and code production. This made the programming easier to read and understand.
Line 144:
In 1968–1970, Erwin Book, Dewey Val Schorre, and Steven J. Sherman developed CWIC.<ref name="CWIC" /> (Compiler for Writing and Implementing Compilers) at [[System Development Corporation]] [http://special.lib.umn.edu/findaid/xml/cbi00090-098.xml#series6 Charles Babbage Institute Center for the History of Information Technology (Box 12, folder 21)],
 
CWIC is a compiler development system composed of three special-purpose, ___domain specific, languages, each intended to permit the description of certain aspects of translation in a straight forward manner. The syntax language is used to describe the recognition of source text and the construction from it to an intermediate [[Tree (data structure)|tree]] structure. The generator language is used to describe the transformation of the [[Tree (data structure)|tree]] into appropriate object language.
 
The syntax language follows Dewey Val Schorre's previous line of metacompilers. It most resembles TREE-META having [[Tree (data structure)|tree]] building operators in the syntax language. The unparse rules of TREE-META are extended to work with the object based generator language based on [[LISP 2]].
 
CWIC includes three languages:
* '''Syntax''': Transforms the source program input, into list structures using grammar transformation formula. A parsed expression structure is passed to a generator by placement of a generator call in a rule. A [[Tree (data structure)|tree]] is represented by a list whose first element is a node object. The language has operators, '''<''' and '''>''', specifically for making lists. The colon ''':''' operator is used to create node objects. ''':ADD''' creates an ADD node. The exclamation '''!''' operator combines a number of parsed entries with a node to make a [[Tree (data structure)|tree]] . Trees created by syntax rules are passed to generator functions, returning success or failure. The syntax language is very close to TREE-META. Both use a colon to create a node. CWIC's [[Tree (data structure)|tree]] building exclamation !<number> functions the same as TREE-META's [<number>].
* '''Generator''': a named series of transforming rules, each consisting of an unparse, pattern matching, rule. and an output production written in a LISP 2 like language. the translation was to IBM 360 binary machine code. Other facilities of the generator language generalized output.<ref name="CWIC" />
* '''[[MOL-360]]''': an independent [[system programming language|mid level implementation language]] for the IBM System/360 family of computers developed in 1968 and used for writing the underlying support library.
Line 160:
...</pre>
The code to process a given [[Tree (data structure)|tree]] included the features of a general purpose programming language, plus a form: &lt;stuff&gt;, which would emit (stuff) onto the output file.
A generator call may be used in the unparse_rule. The generator is passed the element of unparse_rule pattern in which it is placed and its return values are listed in (). For example:
<pre> expr_gen(ADD[expr_gen(x),expr_gen(y)]) =&gt;
&lt;AR + (x*16)+y;&gt;
Line 177:
return r1;
...</pre>
That is, if the parse [[Tree (data structure)|tree]] looks like (ADD[<something1>,<something2>]), expr_gen(x) would be called with <something1> and return x. A variable in the unparse rule is a local variable that can be used in the production_code_generator. expr_gen(y) is called with <something2> and returns y. Here is a generator call in an unparse rule is passed the element in the position it occupies. Hopefully in the above x and y will be registers on return. The last transforms is intended to load an atomic into a register and return the register. The first production would be used to generate the 360 "AR" (Add Register) instruction with the appropriate values in general registers. The above example is only a part of a generator. Every generator expression evaluates to a value that con then be further processed. The last transform could just as well have been written as:
<pre>
(x)=&gt; return load(getreg(), x);
Line 193:
* [[GNU Bison]]
* [[Coco/R]], Coco-2<ref name="Rechenberg-Mössenböck_1985"/>
* Copper <ref>{{Cite web |title=Copper {{!}} Minnesota Extensible Language Tools Group |url=https://melt.cs.umn.edu/copper/ |access-date=2025-03-25 |website=melt.cs.umn.edu}}</ref>
* [[DMS Software Reengineering Toolkit]], a program transformation system with parser generators
* Epsilon Grammar Studio
Line 200 ⟶ 201:
* [[Parboiled (Java)|parboiled]], a Java library for building parsers.
* [[Packrat parser]]
* [[PackCC]], a [[packrat parser]] with [[left recursion]] support.
* [[PQCC]], a compiler-compiler that is more than a parser generator.
* Syntax Improving Device (SID)
* [[SYNTAX]], an integrated toolset for compiler construction.
* tacc - The Alternative Compiler Compiler <ref>{{Cite web |url=http://legomatrix.com/tacc/tacc.htm |access-date=2025-03-25 |website=legomatrix.com}}</ref>
* [[TREE-META]]
* [[Yacc]]