Revision as of 23:41, 18 September 2024 edit Naruyoko (talk \| contribs) Extended confirmed users 618 edits →Section recognition: Explicitly mention the Wikipedia website rather than "this article" for the layout Tag: 2017 wikitext editor ← Previous edit		Revision as of 14:43, 22 September 2024 edit undo Kvng (talk \| contribs) Extended confirmed users, New page reviewers 116,060 edits m grammar Next edit →
Line 1: {{Short description\|Method for data management}} '''Search engine indexing''' is the collecting, [[parsing]], and storing of data to facilitate fast and accurate [[information retrieval]]. Index design incorporates interdisciplinary concepts from [[linguistics]], [[cognitive psychology]], mathematics, [[informatics]], and [[computer science]]. An alternate name for the process, in the context of [[search engine]]s designed to find [[web page]]s on the Internet, is ''[[web indexing]]''. Line 118 ⟶ 119: ; Language ambiguity: To assist with properly ranking matching documents, many search engines collect additional information about each word, such as its [[language]] or [[lexical category]] ([[part of speech]]). These techniques are language-dependent, as the syntax varies among languages. Documents do not always clearly identify the language of the document or represent it accurately. In tokenizing the document, some search engines attempt to automatically identify the language of the document. ; Diverse file formats: In order to correctly identify which bytes of a document represent characters, the file format must be correctly handled. Search engines ~~which~~that support multiple file formats must be able to correctly open and access the document and be able to tokenize the characters of the document. ; Faulty storage: The quality of the natural language data may not always be perfect. An unspecified number of documents, particularly on the Internet, do not closely obey proper file protocol. [[Binary data\|Binary]] characters may be mistakenly encoded into various parts of a document. Without recognition of these characters and appropriate handling, the index quality or indexer performance could degrade. Line 125 ⟶ 126: Unlike [[literacy\|literate]] humans, computers do not understand the structure of a natural language document and cannot automatically recognize words and sentences. To a computer, a document is only a sequence of bytes. Computers do not 'know' that a space character separates words in a document. Instead, humans must program the computer to identify what constitutes an individual or distinct word referred to as a token. Such a program is commonly called a [[tokenizer]] or [[parser]] or [[Lexical analysis\|lexer]]. Many search engines, as well as other natural language processing software, incorporate [[Comparison of parser generators\|specialized programs]] for parsing, such as [[YACC]] or [[Lex programming tool\|Lex]]. During tokenization, the parser identifies sequences of characters ~~which~~that represent words and other elements, such as punctuation, which are represented by numeric codes, some of which are non-printing control characters. The parser can also identify [[Entity extraction\|entities]] such as [[email]] addresses, phone numbers, and [[Uniform Resource Locator\|URL]]s. When identifying each token, several characteristics may be stored, such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number. ===Language recognition=== Line 160 ⟶ 161: Format analysis can involve quality improvement methods to avoid including 'bad information' in the index. Content can manipulate the formatting information to include additional content. Examples of abusing document formatting for [[spamdexing]]: * Including hundreds or thousands of words in a section ~~which~~that is hidden from view on the computer screen, but visible to the indexer, by use of formatting (e.g. hidden [[Span and div\|"div" tag]] in [[HTML]], which may incorporate the use of [[CSS]] or [[JavaScript]] to do so). * Setting the foreground font color of words to the same as the background color, making words hidden on the computer screen to a person viewing the document, but not hidden to the indexer. ===Section recognition=== Some search engines incorporate section recognition, the identification of major parts of a document, prior to tokenization. Not all the documents in a corpus read like a well-written book, divided into organized chapters and pages. Many documents on the [[Internet\|web]], such as newsletters and corporate reports, contain erroneous content and side-sections ~~which~~that do not contain primary material (that which the document is about). For example, articles on the Wikipedia website ~~displays~~display a side menu with links to other web pages. Some file formats, like HTML or PDF, allow for content to be displayed in columns. Even though the content is displayed, or rendered, in different areas of the view, the raw markup content may store this information sequentially. Words that appear sequentially in the raw source content are indexed sequentially, even though these sentences and paragraphs are rendered in different parts of the computer screen. If search engines index this content as if it were normal content, the quality of the index and search quality may be degraded due to the mixed content and improper word proximity. Two primary problems are noted: * Content in different sections is treated as related in the index, when in reality it is not * Organizational ''side bar'' content is included in the index, but the side bar content does not contribute to the meaning of the document, and the index is filled with a poor representation of its documents. Section analysis may require the search engine to implement the rendering logic of each document, essentially an abstract representation of the actual document, and then index the representation instead. For example, some content on the Internet is rendered via JavaScript. If the search engine does not render the page and evaluate the JavaScript within the page, it would not 'see' this content in the same way and would index the document incorrectly. Given that some search engines do not bother with rendering issues, many web page designers avoid displaying content via JavaScript or use the [https://worldwidenews.ru/2020/05/27/noscript-tag/ Noscript] tag to ensure that the web page is indexed properly. At the same time, this fact can also be [[spamdexing\|exploited]] to cause the search engine indexer to 'see' different content than the viewer.

Search engine indexing: Difference between revisions