Search engine indexing: Difference between revisions

Content deleted Content added
Shamsaseo (talk | contribs)
m Added new information about search engine indexing including sources
rmv odd, poorly written addition to lead
Line 3:
 
Popular search engines focus on the [[Full-text search|full-text]] indexing of online, [[Natural language processing|natural language]] documents.<ref>Clarke, C., Cormack, G.: Dynamic Inverted Indexes for a Distributed Full-Text Retrieval System. TechRep MT-95-01, University of Waterloo, February 1995.</ref> [[Media type]]s such as pictures, video,<ref>{{cite journal |last=Sikos |first=L. F. |date=August 2016 |title=RDF-powered semantic video annotation tools with concept mapping to Linked Data for next-generation video indexing |journal=Multimedia Tools and Applications |doi=10.1007/s11042-016-3705-7 |s2cid=254832794 |url=https://ap01.alma.exlibrisgroup.com/view/delivery/61USOUTHAUS_INST/12165436490001831 }}{{Dead link|date=August 2023 |bot=InternetArchiveBot |fix-attempted=yes }}</ref> audio,<ref>http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf {{Bare URL PDF|date=March 2022}}</ref> and graphics<ref>Charles E. Jacobs, Adam Finkelstein, David H. Salesin. [http://grail.cs.washington.edu/projects/query/mrquery.pdf Fast Multiresolution Image Querying]. Department of Computer Science and Engineering, University of Washington. 1995. Verified Dec 2006</ref> are also searchable.
 
'''How Google boots Take data from websites?'''
 
Google [[Googlebot|bots]], also known as web crawlers, collect data from websites through a process called crawling and indexing. They visit web pages, follow links, and download the content of the pages they discover. The data is then parsed and stored in Google’s index, which helps Google understand and organize the information on the web. Website owners can influence this process using <code>[[robots.txt]]</code> files and sitemaps.
 
[[Metasearch engine|Meta search engines]] reuse the indices of other services and do not store a local index whereas cache-based search engines permanently store the index along with the [[text corpus|corpus]]. Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs, while [[Intelligent agent|agent]]-based search engines index in [[Real time business intelligence|real time]].
 
==Indexing==
The purpose of storing an index is to optimize speed and performance in finding [[relevance (information retrieval)|relevant]] documents for a search query. Without an index, the search engine would [[Lexical analysis|scan]] every document in the [[Text corpus|corpus]], which would require considerable time and computing power. For example, while an index of 10,000 documents can be queried within milliseconds, a sequential scan of every word in 10,000 large documents could take hours. The additional [[Computer Storage|computer storage]] required to store the index, as well as the considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval.
 
===Index design factors===