Cuneiform (programming language)

Cuneiform
Cuneiform
Paradigm	functional, scientific workflow
Designed by	Jörgen Brandt
First appeared	2013
Stable release	3.0.3 / April 17, 2018
Typing discipline	Simple Types
Implementation language	Erlang
OS	Linux, Mac OS
License	Apache License 2.0
Filename extensions	.cfl
Website	cuneiform-lang.org
Influenced by
	Taverna, Lisp

Cuneiform is an open-source workflow language for large-scale scientific data analysis.^[1]^[2] It is a workflow DSL in the form of a functional programming language promoting parallelizable algorithmic skeletons. External tools and libraries, in, e.g., R or Python, can be integrated via a foreign function interface. Cuneiform's data-driven evaluation model and integration of external software originate in scientific workflow languages like Taverna, KNIME, or Galaxy while its algorithmic skeletons (second-order functions) for parallel execution originate in data-parallel programming models like MapReduce or Pig Latin. Cuneiform is implemented in Erlang, and therefore must run on an Erlang virtual machine (BEAM) similar to the way Java must run on a JVM (Java virtual machine). Cuneiform scripts can be executed on top of Hadoop.^[3]^[4]^[5]^[6]^[7]

External software integration

External tools and libraries are integrated in a Cuneiform script through its foreign function interface. By defining a task in a foreign language it is possible to use the API of an external tool or library. This way, tools can be integrated directly without the need of writing a wrapper or reimplementing the tool.^[8]

Currently supported foreign programming languages are:

Parallel execution

The task applications in a Cuneiform script form a data dependency graph. This dependency graph constrains the order in which tasks can be evaluated. Apart from data dependencies tasks can be evaluated in any order, assuming tasks are always side effect-free and deterministic.

Map: Applies a task to each element in a list. Each task applications can run in parallel.
Cartesian product: Takes the Cartesian product of several lists and applies a task to each combination. Each task application can run in parallel.
Dot product: Given a pair of lists of equal sizes, each element of the first list is combined with its corresponding element in the second list. A task is applied to each combination. Each task application can run in parallel.
Aggregate: Applies a task to the list as a whole without decomposing it. Since the task is applied only once for the whole list, this skeleton leaves the parallelism potential unchanged.
Conditional: Evaluates a program branch, depending on a condition computed at runtime. This skeleton leaves the parallelism potential unchanged.

By partitioning input data and using parallelizable skeletons to process partitions the interpreter can exploit data parallelism even if the integrated tools are single-threaded. Workflows can be executed also in distributed compute environments.

Examples

A hello-world script:

def greet( person : Str ) -> <out : Str>
in Bash *{
  out="Hello $person"
}*

( greet( person = "world" )|out );

This script defines a task greet in Bash which prepends the string "Hello " to its string argument person. The function produces a record with a single string field out. Applying greet, binding the argument person to the string "world" produces the record <out = "Hello world">. Projecting this record to its field out evaluates the string "Hello world".

Command line tools can be integrated by defining a task in Bash:

def samtoolsSort( bam : File ) -> <sorted : File>
in Bash *{
  sorted=sorted.bam
  samtools sort -m 2G $bam -o $sorted
}*

In this example a task samtoolsSort is defined. It calls the tool SAMtools, consuming an input file, in BAM format, and producing a sorted output file, also in BAM format.

Release History

Version	Appearance	Implementation Language	Distribution Platform	Foreign Languages
3.0.x	Feb. 2018	Erlang	Distributed Erlang	Bash, Erlang, Java, MATLAB, GNU Octave, Perl, Python, R, Racket
2.2.x	Apr. 2016	Erlang	HTCondor, Apache Hadoop	Bash, Perl, Python, R
2.0.x	Mar. 2015	Java	HTCondor, Apache Hadoop	Bash, BeanShell, Common Lisp, MATLAB, GNU Octave, Perl, Python, R, Scala
1.0.0	May 2014	Java	Apache Hadoop	Bash, Common Lisp, GNU Octave, Perl, Python, R, Scala

In April 2016, Cuneiform's implementation language switched from Java to Erlang and, in February 2018, its major distributed execution platform changed from a Hadoop to distributed Erlang. Additionally, from 2015 to 2018 HTCondor had been maintained as an alternative execution platform.

Cuneiform's surface syntax was revised three times, as reflected in the major version number.

Version 3

The current version of Cuneiform's surface syntax, in comparison to earlier drafts, is an attempt to close the gap to mainstream functional programming languages. It features a simple, statically checked typesystem and introduces records in addition to lists as a second type of compound data structure. Booleans are a separate base data type.

def untar( tar : File ) -> <fileLst : [File]>
in Bash *{
  tar xf $tar
  fileLst=`tar tf $tar`
}*

let hg38Tar : File =
  'hg38/hg38.tar';

let <fileLst = faLst : [File]> =
  untar( tar = hg38Tar );

faLst;

Version 2

The second draft of the Cuneiform surface syntax, first published in March 2015, remained in use for three years surviving the transition from Java to Erlang as Cuneiform's implementation language. Evaluation differs from earlier approaches in that the interpreter reduces a query expression instead of traversing a static graph. During the time that the surface syntax remained in use, the interpreter was formalized and simplified which resulted in a first specification of Cuneiform's semantics. Conditionals were added as a language feature. However, Booleans were encoded as lists, recycling the empty list as Boolean false and the non-empty list as Boolean true. Recursion was added later as a byproduct of formalization. However, static type checking was introduced only in Version 3.

The following script decompresses a zipped file and splits it into evenly sized partitions.

deftask unzip( <out( File )> : zip( File ) ) in bash *{
  unzip -d dir $zip
  out=`ls dir | awk '{print "dir/" $0}'`
}*

deftask split( <out( File )> : file( File ) ) in bash *{
  split -l 1024 $file txt
  out=txt*
}*

sotu = "sotu/stateoftheunion1790-2014.txt.zip";
fileLst = split( file: unzip( zip: sotu ) );

fileLst;

Version 1

In its first draft published in May 2014, Cuneiform was closely related to Make in that it constructed a static data dependency graph which the interpreter traversed during execution. The major difference to later versions was the lack of conditionals, recursion, or static type checking. Files were distinguished from strings by juxtaposing single-quoted string values with a tilde ~. The script's query expression was introduced with the target keyword. Bash was the default foreign language. Function application had to be performed by using an apply form that took task as its first keyword argument. One year later, this surface syntax was replaced by a streamlined but similar version.

The following example script downloads a reference genome from an FTP server.

declare download-ref-genome;

deftask download-fa( fa : ~path ~id ) *{
    wget $path/$id.fa.gz
    gunzip $id.fa.gz
    mv $id.fa $fa
}*

ref-genome-path = ~'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes';
ref-genome-id = ~'chr22';

ref-genome = apply(
    task : download-fa
    path : ref-genome-path
    id : ref-genome-id
);

target ref-genome;

References

^ https://github.com/joergen7/cuneiform
^ Brandt, Jörgen; Bux, Marc N.; Leser, Ulf (2015). "Cuneiform: A functional language for large scale scientific data analysis" (PDF). Proceedings of the Workshops of the EDBT/ICDT. 1330: 17–26.
^ http://www.saasfee.io
^ "Scalable Multi-Language Data Analysis on Beam: The Cuneiform Experience by Jörgen Brandt". Erlang Central. Retrieved 28 October 2016.
^ Bux, Marc; Brandt, Jörgen; Lipka, Carsten; Hakimzadeh, Kamal; Dowling, Jim; Leser, Ulf (2015). "SAASFEE: scalable scientific workflow execution engine" (PDF). Proceedings of the VLDB Endowment. 8 (12): 1892–1895.
^ Bessani, Alysson; Brandt, Jörgen; Bux, Marc; Cogo, Vinicius; Dimitrova, Lora; Dowling, Jim; Gholami, Ali; Hakimzadeh, Kamal; Hummel, Michael; Ismail, Mahmoud; Laure, Erwin; Leser, Ulf; Litton, Jan-Eric; Martinez, Roxanna; Niazi, Salman; Reichel, Jane; Zimmermann, Karin (2015). "Biobankcloud: a platform for the secure storage, sharing, and processing of large biomedical data sets" (PDF). The First International Workshop on Data Management and Analytics for Medicine and Healthcare (DMAH 2015).
^ "Scalable Multi-Language Data Analysis on Beam: The Cuneiform Experience". Erlang-factory.com. Retrieved 28 October 2016.
^ "A Functional Workflow Language Implementation in Erlang" (PDF). Retrieved 28 October 2016.

[1] ttps://github.com/joergen7/cuneiform

[2] Brandt, Jörgen; Bux, Marc N.; Leser, Ulf (2015). "Cuneiform: A functional language for large scale scientific data analysis" (PDF). Proceedings of the Workshops of the EDBT/ICDT. 1330: 17–26.

[3] ttp://www.saasfee.io

[4] "Scalable Multi-Language Data Analysis on Beam: The Cuneiform Experience by Jörgen Brandt". Erlang Central. Retrieved 28 October 2016.

[5] Bux, Marc; Brandt, Jörgen; Lipka, Carsten; Hakimzadeh, Kamal; Dowling, Jim; Leser, Ulf (2015). "SAASFEE: scalable scientific workflow execution engine" (PDF). Proceedings of the VLDB Endowment. 8 (12): 1892–1895.

[6] Bessani, Alysson; Brandt, Jörgen; Bux, Marc; Cogo, Vinicius; Dimitrova, Lora; Dowling, Jim; Gholami, Ali; Hakimzadeh, Kamal; Hummel, Michael; Ismail, Mahmoud; Laure, Erwin; Leser, Ulf; Litton, Jan-Eric; Martinez, Roxanna; Niazi, Salman; Reichel, Jane; Zimmermann, Karin (2015). "Biobankcloud: a platform for the secure storage, sharing, and processing of large biomedical data sets" (PDF). The First International Workshop on Data Management and Analytics for Medicine and Healthcare (DMAH 2015).

[7] "Scalable Multi-Language Data Analysis on Beam: The Cuneiform Experience". Erlang-factory.com. Retrieved 28 October 2016.

[8] "A Functional Workflow Language Implementation in Erlang" (PDF). Retrieved 28 October 2016.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]