Revision as of 15:06, 15 September 2024 edit 81.2.123.64 (talk) despam Tag: Undo ← Previous edit		Revision as of 23:41, 18 September 2024 edit undo Naruyoko (talk \| contribs) Extended confirmed users 618 edits →Section recognition: Explicitly mention the Wikipedia website rather than "this article" for the layout Tag: 2017 wikitext editor Next edit →
Line 164: ===Section recognition=== Some search engines incorporate section recognition, the identification of major parts of a document, prior to tokenization. Not all the documents in a corpus read like a well-written book, divided into organized chapters and pages. Many documents on the [[Internet\|web]], such as newsletters and corporate reports, contain erroneous content and side-sections which do not contain primary material (that which the document is about). For example, ~~this~~articles ~~article~~on Wikipedia website displays a side menu with links to other web pages. Some file formats, like HTML or PDF, allow for content to be displayed in columns. Even though the content is displayed, or rendered, in different areas of the view, the raw markup content may store this information sequentially. Words that appear sequentially in the raw source content are indexed sequentially, even though these sentences and paragraphs are rendered in different parts of the computer screen. If search engines index this content as if it were normal content, the quality of the index and search quality may be degraded due to the mixed content and improper word proximity. Two primary problems are noted: * Content in different sections is treated as related in the index, when in reality it is not

Search engine indexing: Difference between revisions