Search engine indexing: Difference between revisions

Content deleted Content added
m Restored revision 1189431309 by HeyElliott (talk): Rv link to spreadsheet
rmv non-WP:RS : content marketing blog
Line 104:
* The average number of characters in any given word on a page may be estimated at 5 ([[Wikipedia:Size comparisons]])
 
Given this scenario, an uncompressed index (assuming a non-[[conflation|conflated]], simple, index) for 2 billion web pages would need to store 500 billion word entries. At 1 byte per character, or 5 bytes per word, this would require 2500 gigabytes of storage space alone.<ref>{{Cite web |last=Perra |first=Wil |title=Computer Science and Engineering: Automatic Text Processing |url=https://www.worldwidebacklinks.com/blog/seo/how-to-use-server-logs-to-measure-how-google-crawls-indexes-content/ |website=Computer Science and Engineeringcn}}</ref> This space requirement may be even larger for a fault-tolerant distributed storage architecture. Depending on the compression technique chosen, the index can be reduced to a fraction of this size. The tradeoff is the time and processing power required to perform compression and decompression.{{cn}}
 
Notably, large scale search engine designs incorporate the cost of storage as well as the costs of electricity to power the storage. Thus compression is a measure of cost.{{cn}}
 
==Document parsing==