Content deleted Content added
despam |
→Section recognition: Explicitly mention the Wikipedia website rather than "this article" for the layout |
||
Line 164:
===Section recognition===
Some search engines incorporate section recognition, the identification of major parts of a document, prior to tokenization. Not all the documents in a corpus read like a well-written book, divided into organized chapters and pages. Many documents on the [[Internet|web]], such as newsletters and corporate reports, contain erroneous content and side-sections which do not contain primary material (that which the document is about). For example,
* Content in different sections is treated as related in the index, when in reality it is not
|