Search engine indexing: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 23:00, 31 October 2023 edit GraziePrego (talk \| contribs) Extended confirmed users, New page reviewers, Pending changes reviewers, Rollbackers 12,801 edits Adding local short description: "Method for data management", overriding Wikidata description "processing and storage of data to enable fast information querying" Tag: Shortdesc helper ← Previous edit		Latest revision as of 17:17, 4 August 2025 edit undo 2407:d000:8:73de:1174:ee2d:137c:fbc6 (talk) →Meta tag indexing: Citations needed.
(45 intermediate revisions by 36 users not shown)
Line 1: {{Short description\|Method for data management}} '''Search engine indexing''' is the collecting, [[parsing]], and storing of data to facilitate fast and accurate [[information retrieval]]. Index design incorporates interdisciplinary concepts from [[linguistics]], [[cognitive psychology]], mathematics, [[informatics]], and [[computer science]]. An alternate name for the process, in the context of [[search engine]]s designed to find [[Web page\|web pages]] on the Internet, is ''[[web indexing]]''.▼ ▲'''Search engine indexing''' is the collecting, [[parsing]], and storing of data to facilitate fast and accurate [[information retrieval]]. Index design incorporates interdisciplinary concepts from [[linguistics]], [[cognitive psychology]], mathematics, [[informatics]], and [[computer science]]. An alternate name for the process, in the context of [[search engine]]s designed to find [[~~Web~~web page~~\|web pages~~]]s on the Internet, is ''[[web indexing]]''. Popular search engines focus on the [[Full-text search\|full-text]] indexing of online, [[Natural language processing\|natural language]] documents.<ref>Clarke, C., Cormack, G.: Dynamic Inverted Indexes for a Distributed Full-Text Retrieval System. TechRep MT-95-01, University of Waterloo, February 1995.</ref> [[Media type]]s such as pictures, video,<ref>{{cite journal \|last=Sikos \|first=L. F. \|date=August 2016 \|title=RDF-powered semantic video annotation tools with concept mapping to Linked Data for next-generation video indexing \|journal=Multimedia Tools and Applications \|doi=10.1007/s11042-016-3705-7 \|s2cid=254832794 \|url=https://ap01.alma.exlibrisgroup.com/view/delivery/61USOUTHAUS_INST/12165436490001831 }}{{Dead link\|date=August 2023 \|bot=InternetArchiveBot \|fix-attempted=yes }}</ref> audio,<ref>http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf {{Bare URL PDF\|date=March 2022}}</ref> and graphics<ref>Charles E. Jacobs, Adam Finkelstein, David H. Salesin. [http://grail.cs.washington.edu/projects/query/mrquery.pdf Fast Multiresolution Image Querying]. Department of Computer Science and Engineering, University of Washington. 1995. Verified Dec 2006</ref> are also searchable.▼ ▲Popular search engines focus on the [[Full-text search\|full-text]] indexing of online, [[Natural language processing\|natural language]] documents.<ref>Clarke, C., Cormack, G.: Dynamic Inverted Indexes for a Distributed Full-Text Retrieval System. TechRep MT-95-01, University of Waterloo, February 1995.</ref> [[Media type]]s such as pictures, video, audio,<ref>{{~~cite journal~~Cite web\|~~last=Sikos~~ ~~\|first~~title=L.An ~~F. \|date=August 2016 \|title=RDF~~Industrial-~~powered~~Strength ~~semantic~~Audio ~~video~~Search ~~annotation tools with concept mapping to Linked Data for next-generation video indexing~~Algorithm \|~~journal=Multimedia~~ ~~Tools and Applications \|doi~~url=10http://www.~~1007~~ee.columbia.edu/~~s11042~~~dpwe/papers/Wang03-~~016-3705-7~~shazam.pdf \|~~s2cid=254832794~~ \|archive-url=https://~~ap01~~web.~~alma~~archive.~~exlibrisgroup.com~~org/~~view~~web/~~delivery~~20060512074748/~~61USOUTHAUS_INST/12165436490001831 }}{{Dead link\|date=August 2023 \|bot=InternetArchiveBot \|fix-attempted=yes }}</ref> audio,<ref>~~http://www.ee.columbia.edu:80/~dpwe/papers/Wang03-shazam.pdf ~~{{Bare~~\| ~~URL PDF\|~~archive-date=~~March 2022~~2006-05-12}}</ref> and graphics<ref>Charles E. Jacobs, Adam Finkelstein, David H. Salesin. [http://grail.cs.washington.edu/projects/query/mrquery.pdf Fast Multiresolution Image Querying]. Department of Computer Science and Engineering, University of Washington. 1995. Verified Dec 2006</ref> are also searchable. [[Metasearch engine\|Meta search engines]] reuse the indices of other services and do not store a local index whereas cache-based search engines permanently store the index along with the [[text corpus\|corpus]]. Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs, while [[Intelligent agent\|agent]]-based search engines index in [[Real time business intelligence\|real time]]. ==Indexing== The purpose of storing an index is to optimize speed and performance in finding [[relevance (information retrieval)\|relevant]] documents for a search query. Without an index, the search engine would [[Lexical analysis\|scan]] every document in the [[Text corpus\|corpus]], which would require considerable time and computing power. For example, while an index of 10,000 documents can be queried within milliseconds, a sequential scan of every word in 10,000 large documents could take hours. The additional [[Computer Storage\|computer storage]] required to store the index, as well as the considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval. ===Index design factors=== Line 29 ⟶ 30: \| title = Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology \| publisher = Cambridge University Press \| ___location = ~~USA~~US \| isbn = 0-521-58519-8}}. </ref> An alternate representation is a [[suffix array]], which is considered to require less virtual memory and supports data compression such as the [[Burrows–Wheeler transform\|BWT]] algorithm. Line 61 ⟶ 62: \|} This index can only determine whether a word exists within a particular document, since it stores no information regarding the frequency and position of the word; it is therefore considered to be a [[~~boolean~~Boolean ~~datatype~~data type\|~~boolean~~Boolean]] index. Such an index determines which documents match a query but does not rank matched documents. In some designs the index includes additional information such as the frequency of each word in each document or the positions of a word in each document.<ref>Grossman, Frieder, Goharian. [http://www.cs.clemson.edu/~juan/CPSC862/Concept-50/IR-Basics-of-Inverted-Index.pdf IR Basics of Inverted Index]. 2002. Verified Aug 2011.</ref> Position information enables the search algorithm to identify word proximity to support searching for phrases; frequency can be used to help in ranking the relevance of documents to the query. Such topics are the central research focus of [[information retrieval]]. The inverted index is a [[sparse matrix]], since not all words are present in each document. To reduce [[Computer Storage\|computer storage]] memory requirements, it is stored differently from a two dimensional [[Array data structure\|array]]. The index is similar to the [[document-term matrix\|term document matrices]] employed by [[latent semantic analysis]]. The inverted index can be considered a form of a hash table. In some cases the index is a form of a [[binary tree]], which requires additional storage but may reduce the lookup time. In larger indices the architecture is typically a [[distributed hash table]].<ref>Tang, Hunqiang. [[Sandhya Dwarkadas\|Dwarkadas, Sandhya]]. "Hybrid Global Local Indexing for Efficient Peer to Peer Information Retrieval". University of Rochester. Pg 1. http://www.cs.rochester.edu/u/sandhya/papers/nsdi04.ps</ref> ==== Implementation of Phrase Search Using an Inverted Index ==== For phrase searching, a specialized form of an inverted index called a positional index is used. A positional index not only stores the ID of the document containing the token but also the exact position(s) of the token within the document in the [[postings list]]. The occurrences of the phrase specified in the query are retrieved by navigating these postings list and identifying the indexes at which the desired terms occur in the expected order (the same as the order in the phrase). So if we are searching for occurrence of the phrase "First Witch", we would: # Retrieve the postings list for "first" and "witch" # Identify the first time that "witch" occurs after "first" # Check that this occurrence is immediately after the occurrence of "first". # If not, continue to the next occurrence of "first". The postings lists can be navigated using a binary search in order to minimize the time complexity of this procedure.<ref>{{Cite book \|last=Büttcher \|first=Stefan \|title=Information retrieval: implementing and evaluating search engines \|last2=Clarke \|first2=Charles L. A. \|last3=Cormack \|first3=Gordon V. \|date=2016 \|publisher=The MIT Press \|isbn=978-0-262-52887-0 \|edition=First MIT Press paperback \|___location=Cambridge, Massachusetts London, England}}</ref> ===Index merging=== Line 91 ⟶ 102: Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge. Many search engines utilize a form of [[data compression\|compression]] to reduce the size of the indices on [[computer storage\|disk]].<ref>H.S. Heaps. Storage analysis of a compression coding for a document database. 1NFOR, I0(i):47-61, February 1972.</ref> Consider the following scenario for a full text, Internet search engine. * It takes 8 bits (or 1 [[byte]]) to store a single character. Some [[character encoding\|encodings]] use 2 bytes per character<ref>[https://www.unicode.org/faq/basic_q.html#15 The Unicode Standard - Frequently Asked Questions]. Verified Dec 2006.</ref><ref>[https://web.archive.org/web/20010209140313/http://www.uplink.freeuk.com/data.html Storage estimates]. Verified Dec 2006.</ref> * The average number of characters in any given word on a page may be estimated at 5 ([[Wikipedia:Size comparisons]]) Given this scenario, an uncompressed index (assuming a non-[[conflation\|conflated]], simple, index) for 2 billion web pages would need to store 500 billion word entries. At 1 byte per character, or 5 bytes per word, this would require 2500 gigabytes of storage space alone.~~<ref>~~{{~~Cite web~~ cn\|~~last~~date=Perra \|first=Wil \|title=Computer Science and Engineering: Automatic Text Processing \|url=https://www.worldwidebacklinks.com/blog/seo/how-to-use-server-logs-to-measure-how-google-crawls-indexes-content/ \|website=Computer Science andDecember ~~Engineering~~2023}}~~</ref>~~ This space requirement may be even larger for a fault-tolerant distributed storage architecture. Depending on the compression technique chosen, the index can be reduced to a fraction of this size. The tradeoff is the time and processing power required to perform compression and decompression.{{cn\|date=December 2023}} Notably, large scale search engine designs incorporate the cost of storage as well as the costs of electricity to power the storage. Thus compression is a measure of cost.{{cn\|date=December 2023}} ==Document parsing== Line 104 ⟶ 115: ===Challenges in natural language processing=== ; Word boundary ambiguity: Native [[English language\|English]] speakers may at first consider tokenization to be a straightforward task, but this is not the case with designing a [[multilingual]] indexer. In digital form, the texts of other languages such as [[Chinese language\|Chinese]], or [[Japanese language\|Japanese~~]] or [[Arabic language\|Arabic~~]] represent a greater challenge, as words are not clearly delineated by [[Whitespace (computer science)\|whitespace]]. The goal during tokenization is to identify words for which users will search. Language-specific logic is employed to properly identify the boundaries of words, which is often the rationale for designing a parser for each language supported (or for groups of languages with similar boundary markers and syntax). ; Language ambiguity: To assist with properly ranking matching documents, many search engines collect additional information about each word, such as its [[language]] or [[lexical category]] ([[part of speech]]). These techniques are language-dependent, as the syntax varies among languages. Documents do not always clearly identify the language of the document or represent it accurately. In tokenizing the document, some search engines attempt to automatically identify the language of the document. ; Diverse file formats: In order to correctly identify which bytes of a document represent characters, the file format must be correctly handled. Search engines ~~which~~that support multiple file formats must be able to correctly open and access the document and be able to tokenize the characters of the document. ; Faulty storage: The quality of the natural language data may not always be perfect. An unspecified number of documents, particularly on the Internet, do not closely obey proper file protocol. [[Binary data\|Binary]] characters may be mistakenly encoded into various parts of a document. Without recognition of these characters and appropriate handling, the index quality or indexer performance could degrade. Line 115 ⟶ 126: Unlike [[literacy\|literate]] humans, computers do not understand the structure of a natural language document and cannot automatically recognize words and sentences. To a computer, a document is only a sequence of bytes. Computers do not 'know' that a space character separates words in a document. Instead, humans must program the computer to identify what constitutes an individual or distinct word referred to as a token. Such a program is commonly called a [[tokenizer]] or [[parser]] or [[Lexical analysis\|lexer]]. Many search engines, as well as other natural language processing software, incorporate [[Comparison of parser generators\|specialized programs]] for parsing, such as [[YACC]] or [[Lex programming tool\|Lex]]. During tokenization, the parser identifies sequences of characters ~~which~~that represent words and other elements, such as punctuation, which are represented by numeric codes, some of which are non-printing control characters. The parser can also identify [[Entity extraction\|entities]] such as [[email]] addresses, phone numbers, and [[Uniform Resource Locator\|URL]]s. When identifying each token, several characteristics may be stored, such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number. ===Language recognition=== Line 145 ⟶ 156: * [[Gzip]] - File compressed with gzip * [[Bzip2\|BZIP]] - File compressed using bzip2 * [[~~Tar~~tar (~~file format~~computing)\|Tape ARchive (TAR)]], [[Unix]] archive file, not (itself) compressed * TAR.Z, TAR.GZ or TAR.BZ2 - [[Unix]] archive files compressed with Compress, GZIP or BZIP2 Format analysis can involve quality improvement methods to avoid including 'bad information' in the index. Content can manipulate the formatting information to include additional content. Examples of abusing document formatting for [[spamdexing]]: * Including hundreds or thousands of words in a section ~~which~~that is hidden from view on the computer screen, but visible to the indexer, by use of formatting (e.g. hidden [[Span and div\|"div" tag]] in [[HTML]], which may incorporate the use of [[CSS]] or [[JavaScript]] to do so). * Setting the foreground font color of words to the same as the background color, making words hidden on the computer screen to a person viewing the document, but not hidden to the indexer. ===Section recognition=== Some search engines incorporate section recognition, the identification of major parts of a document, prior to tokenization. Not all the documents in a corpus read like a well-written book, divided into organized chapters and pages. Many documents on the [[Internet\|web]], such as newsletters and corporate reports, contain erroneous content and side-sections ~~which~~that do not contain primary material (that which the document is about). For example, ~~this~~articles on the Wikipedia ~~article~~website ~~displays~~display a side menu with links to other web pages. Some file formats, like HTML or PDF, allow for content to be displayed in columns. Even though the content is displayed, or rendered, in different areas of the view, the raw markup content may store this information sequentially. Words that appear sequentially in the raw source content are indexed sequentially, even though these sentences and paragraphs are rendered in different parts of the computer screen. If search engines index this content as if it were normal content, the quality of the index and search quality may be degraded due to the mixed content and improper word proximity. Two primary problems are noted: * Content in different sections is treated as related in the index, when in reality it is not * Organizational ''side bar'' content is included in the index, but the side bar content does not contribute to the meaning of the document, and the index is filled with a poor representation of its documents. Section analysis may require the search engine to implement the rendering logic of each document, essentially an abstract representation of the actual document, and then index the representation instead. For example, some content on the Internet is rendered via JavaScript. If the search engine does not render the page and evaluate the JavaScript within the page, it would not 'see' this content in the same way and would index the document incorrectly. Given that some search engines do not bother with rendering issues, many web page designers avoid displaying content via JavaScript or use the [https://worldwidenews.ru/2020/05/27/noscript-tag/ Noscript] tag to ensure that the web page is indexed properly. At the same time, this fact can also be [[spamdexing\|exploited]] to cause the search engine indexer to 'see' different content than the viewer. Line 163 ⟶ 174: ===HTML priority system=== {{Original research section\|date=November 2013}} Indexing often has to recognize the [[HTML]] tags to organize priority. Indexing low priority to high margin to labels like ''strong'' and ''link'' to optimize the order of priority if those labels are at the beginning of the text could not prove to be relevant. Some indexers like [[Google]] and [[Bing (search engine)\|Bing]] ensure that the [[search engine]] does not take the large texts as relevant source due to ~~[[Search engine indexing\|~~strong type system]] compatibility.<ref>Google Webmaster Tools, "Hypertext Markup Language 5", Conference for SEO January 2012.</ref> ===Meta tag indexing=== Meta tag indexing plays an important role in organizing and categorizing web content. Specific documents often contain embedded meta information such as author, keywords, description, and language. For HTML pages, the [[meta tag]] contains keywords which are also included in the index. Earlier Internet [[search engine technology]] would only index the keywords in the meta tags for the forward index; the full document would not be parsed. At that time full-text indexing was not as well established, nor was [[computer hardware]] able to support such technology. The design of the HTML markup language initially included support for meta tags for the very purpose of being properly and easily indexed, without requiring tokenization.<ref>Berners-Lee, T., "Hypertext Markup Language - 2.0", RFC 1866, Network Working Group, November 1995.</ref> As the Internet grew through the 1990s, many [[brick and mortar business\|brick-and-mortar corporations]] went 'online' and established corporate websites. The keywords used to describe webpages (many of which were corporate-oriented webpages similar to product brochures) changed from descriptive to marketing-oriented keywords designed to drive sales by placing the webpage high in the search results for specific search queries. The fact that these keywords were subjectively specified was leading to [[spamdexing]], which drove many search engines to adopt full-text indexing technologies in the 1990s. Search engine designers and companies could only place so many 'marketing keywords' into the content of a webpage before draining it of all interesting and useful information. Given that conflict of interest with the business goal of designing user-oriented websites which were 'sticky', the [[customer lifetime value]] equation was changed to incorporate more useful content into the website in hopes of retaining the visitor. In this sense, full-text indexing was more objective and increased the quality of search engine results, as it was one more step away from subjective control of search engine result placement, which in turn furthered research of full-text indexing technologies.{{cn\|date=August 2025}} In [[desktop search]], many solutions incorporate meta tags to provide a way for authors to further customize how the search engine will index content from various files that is not evident from the file content. Desktop search is more under the control of the user, while Internet search engines must focus more on the full text index.{{cn\|date=August 2025} ==See also== * [[Controlled vocabulary]] * [[Database index]] * [[Full -text search]] * [[Information extraction]] * [[Instant indexing]] * [[Key Word in Context]] * [[Selection-based search]]