Sawzall (programming language): Difference between revisions

Content deleted Content added
"Sawzall," not "SAWZALL," seems to be the capitalization used in Google research papers (and some of the time in this article), so I'm going to standardize on this convention.
Remove hatnote-like text: not necessary at this unambiguous title (WP:NAMB); term does not redirect here
 
(55 intermediate revisions by 49 users not shown)
Line 1:
{{Short description|Programming language}}
{{otheruses3|Sawzall}}
{{refimprove|date=April 2011}}
{{Infobox programming language
|name = Sawzall
|logo =
|paradigm =
|year = {{Start date and age|2003}}
|designer =
|developer = [[Google]]
|latest_release_version =
|latest_release_date =
|typing =
|implementations =
|dialects =
|influenced_by =
|influenced =
|current version =
|operating_system =
|license = [[Apache License 2.0]]
|website = {{URL|https://code.google.com/archive/p/szl/}}
}}
'''Sawzall''' is a procedural [[___domain-specific language|___domain-specific]] [[programming language]], used by [[Google]] to process large numbers of individual [[log file|log]] records. Sawzall was first described in 2003,<ref>Rob Pike, Sean Dorward, Robert Griesemer, Sean Quinlan. [http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/sv//archive/sawzall-sciprog.pdf Interpreting the Data: Parallel Analysis with Sawzall]</ref> and the szl runtime was open-sourced in August 2010.<ref>[http://code.google.com/p/szl/ Sawzall's open source project at Google Code].</ref> However, since the [[MapReduce]] table aggregators have not been released,<ref name="open-source-scope"/> the open-sourced runtime is not useful for large-scale data analysis of multiple log files off the shelf. Sawzall has been replaced by Lingo (logs in [[Go (programming language)|Go]]) for most purposes within Google.<ref>{{cite web|url=http://www.unofficialgoogledatascience.com/2015/12/replacing-sawzall-case-study-in-___domain.html|title=Replacing Sawzall|date=2015-12-04|access-date=2018-06-18}}</ref>
 
==Motivation==
'''Sawzall''' is an interpreted, procedural, ___domain-specific programming language, used specifically by [[Google]], to handle huge quantities of data. [[MapReduce]], Haskell are related powerful list processing functional programs.
Google's server logs are stored as large collections of records ([[Protocol Buffers]]) that are partitioned over many disks within [[Google File System|GFS]]. In order to perform calculations involving the logs, engineers can write [[MapReduce]] programs in C++ or Java. MapReduce programs need to be compiled and may be more verbose than necessary, so writing a program to analyze the logs can be time-consuming. To make it easier to write quick scripts, [[Rob Pike]] et al. developed the Sawzall language. A Sawzall script runs within the Map phase of a MapReduce and "emits" values to tables. Then the Reduce phase (which the script writer does not have to be concerned about) aggregates the tables from multiple runs into a single set of tables.
 
Currently, only the language runtime (which runs a Sawzall script once over a single input) has been open-sourced; the supporting program built on MapReduce has not been released.<ref name="open-source-scope">[http://groups.google.com/group/szl-users/browse_thread/thread/c0d90423d0fc27bd Discussion on which parts of Sawzall are open-source].</ref>
A discussion group exists at UCSC School of Engineering led by [http://groups-beta.google.com/group/ucsc-cmps-253-spring-2007 Cormac Flanagan].
 
==Features==
====Sawzall code====
Some interesting features include:
* A Sawzall script has a single input (a log record) and can output only by emitting to tables. The script can have no other side-effects.
* A script can define any number of output tables. Table types include:
** <code>collection</code> saves every value emitted
** <code>sum</code> saves the sum of every emitted value
** <code>maximum(n)</code> saves only the highest n values on a given weight.
*In addition, there are several statistical table types that give inexact results. The higher the parameter n, the more accurate the estimates are.
** <code>sample(n)</code> gives a random sample of n values from all the emitted values
** <code>quantile(n)</code> calculates a cumulative probability distribution of the given numbers.
** <code>top(n)</code> gives n values that are probably the most frequent of the emitted values.
** <code>unique(n)</code> estimates the number of unique values emitted.
 
Sawzall's design favors efficiency and engine simplicity over power:
* Sawzall is statically typed, and the engine compiles the script to [[x86]] before running it.
* Sawzall supports the [[compound data type]]s lists, maps, and structs. However, there are no references or pointers. All assignments and function arguments create copies. This means that [[recursive data structure]]s and cycles are impossible.
* Like C, functions can modify [[global variable]]s and [[local variable]]s but are not closures.
 
====Sawzall code====
This complete Sawzall program will read the input and produce three results: the number of records, the sum of the values,
and the sum of the squares of the values.
Line 17 ⟶ 57:
emit sum_of_squares <- x * x;
 
== ExternalSee referencesalso ==
* [[Pig (programming tool)|Pig]] – similar tool and language for use with [[Apache Hadoop]]
* S. Ghemawat, H. Gobioff, S.-T. Leung, The Google file system, in: 19th ACM Symposium on Operating Systems Principles, Proceedings,
* [[Sawmill (software)]]
17 ACM Press, 2003, pp. 29 – 43.
* MapReduce[http://www.soe.ucsc.edu/classes/cmps253/Spring07/notes/mapreduce.pdf]
 
==External linksReferences ==
{{Reflist}}
[http://labs.google.com/papers/sawzall-sciprog.pdf SAWZALL]
 
== Further reading ==
* S. Ghemawat, H. Gobioff, S.-T. Leung, The Google file system, in: 19th ACM Symposium on Operating Systems Principles, Proceedings, 17 ACM Press, 2003, pp.&nbsp;29–43.
 
== External links ==
* [https://code.google.com/archive/p/szl/ Google Code Archive - Long-term storage for Google Code Project Hosting.]
* MapReduce[https://web.archive.org/web/20110604204310/http://www.soe.ucsc.edu/classes/cmps253/Spring07/notes/mapreduce.pdf MapReduce]
 
{{Rob Pike navbox}}
{{Google FOSS}}
 
[[Category:Domain-specific programming languages]]
[[Category:Procedural programming languages]]
[[Category:Google software]]
 
[[Category:Programming languages created in 2003]]
{{compu-prog-stub}}
[[Category:Software using the Apache license]]
{{Software-type-stub}}