General Architecture for Text Engineering or GATE is a Java software toolkit originally developed at the University of Sheffield since 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for all sorts of natural language processing tasks, including information extraction in many languages.
GATE comprises an architecture, a free open source API, framework and graphical development environment.
GATE community and research is involved in several European research projects including TAO and SEKT.
The main part is Annie (a Nearly-New Information Extraction System) which is a set of modules comprising a tokenizer, a gazetteer, a sentence splitter, a part of speech tagger, a named entities transducer and a coreference tagger. Languages actually taken into account are English, Spanish, Chinese, Arabic, French, German, Hindi, Cebuano, Romanian. A lot of plugins exist. For machine learning with Weka, RASP, MAXENT, SVM Light, for managing Ontologies like WordNet, for querying search engines like Google or Yahoo, for part of speech tagging with Brill or TreeTager.
Gate can work at least with TXT, HTML, XML, Doc, PDF documents and Java Serial, PostgreSQL, Lucene, Oracle Databases with help of RDBMS storage over JDBC.
It also uses JAPE (Java Annotation Patterns Engine) language for building rules in order to annotate documents with tags. A debugger, corpus benchmark and annotations comparator tools are also present.