Apache Nutch

This is an old revision of this page, as edited by Techi2ee (talk | contribs) at 13:27, 13 October 2007 (External links: interwiki ca). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Nutch is an effort to build an open source search engine based on Lucene Java for the search and index component. The fetcher ("robot" or "web crawler") has been written from scratch solely for this project. Nutch has a highly modular architecture allowing developers to create plugins for the following activities: media-type parsing, data retrieval, querying and clustering. As of June 2005, Nutch has graduated from the Apache Incubator, and is now a subproject of Lucene. It is coded completely in the Java programming language, but data is written in language-independent formats. In June 2003, there was a successful 100 million page demo system. To meet the multimachine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. These two facilities have been spun out into their own subproject called Hadoop.

Lucene Nutch
Developer(s)Apache Software Foundation
Stable release
0.9.0 / April 02, 2007
Repository
Operating systemCross-platform
TypeSearch Engine
LicenseApache 2.0 Licence
Websitehttp://lucene.apache.org/nutch

Scalability

IBM Research studied the performance [1] of Nutch/Lucene as part of its Commercial Scale Out (CSO) project [2] . Their findings were that Nutch/Lucene could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the Power5.

Search engines built with Nutch

References