Search engine indexing: Difference between revisions

Content deleted Content added
https://foryouapk.com/
Tags: Reverted Visual edit
m Reverted edit by 39.34.187.197 (talk) to last version by FrescoBot
Line 1:
'''Search engine indexing''' is the collecting, [[parsing]], and [https://foryouapk.com/ storing of data t]oto facilitate fast and accurate [[information retrieval]]. Index design incorporates interdisciplinary concepts from [[linguistics]], [[cognitive psychology]], mathematics, [[informatics]], and [[computer science]]. An alternate name for the process, in the context of [[search engine]]s designed to find [[Web page|web pages]] on the Internet, is ''[[web indexing]]''.
 
Popular search engines focus on the [[Full-text search|full-text]] indexing of online, [[Natural language processing|natural language]] documents.<ref>Clarke, C., Cormack, G.: Dynamic Inverted Indexes for a Distributed Full-Text Retrieval System. TechRep MT-95-01, University of Waterloo, February 1995.</ref> [[Media type]]s such as pictures, video,<ref>{{cite journal |last=Sikos |first=L. F. |date=August 2016 |title=RDF-powered semantic video annotation tools with concept mapping to Linked Data for next-generation video indexing |journal=Multimedia Tools and Applications |doi=10.1007/s11042-016-3705-7 |s2cid=254832794 |url=https://ap01.alma.exlibrisgroup.com/view/delivery/61USOUTHAUS_INST/12165436490001831 }}{{Dead link|date=August 2023 |bot=InternetArchiveBot |fix-attempted=yes }}</ref> audio,<ref>http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf {{Bare URL PDF|date=March 2022}}</ref> and graphics<ref>Charles E. Jacobs, Adam Finkelstein, David H. Salesin. [http://grail.cs.washington.edu/projects/query/mrquery.pdf Fast Multiresolution Image Querying]. Department of Computer Science and Engineering, University of Washington. 1995. Verified Dec 2006</ref> are also searchable.
Line 5:
[[Metasearch engine|Meta search engines]] reuse the indices of other services and do not store a local index whereas cache-based search engines permanently store the index along with the [[text corpus|corpus]]. Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs, while [[Intelligent agent|agent]]-based search engines index in [[Real time business intelligence|real time]].
 
==[https://foryouapk.com/ Indexing]==
The purpose of storing an index is to optimize speed and performance in finding [[relevance (information retrieval)|relevant]] documents for a search query. Without an index, the search engine would [[Lexical analysis|scan]] every document in the [[Text corpus|corpus]], which would require considerable time and computing power. For example, while an index of 10,000 documents can be queried within milliseconds, a sequential scan of every word in 10,000 large documents could take hours. The additional [[Computer Storage|computer storage]] required to store the index, as well as the considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval.
 
===[https://foryouapk.com/ Index design factors]===
Major factors in designing a search engine's architecture include:
 
Line 18:
;Fault tolerance: How important it is for the service to be reliable. Issues include dealing with index corruption, determining whether bad data can be treated in isolation, dealing with bad hardware, [[partition (database)|partitioning]], and schemes such as [[hash function|hash-based]] or composite partitioning,<ref>[http://dev.mysql.com/doc/refman/5.1/en/partitioning-linear-hash.html Linear Hash Partitioning]. MySQL 5.1 Reference Manual. Verified Dec 2006</ref> as well as [[Replication (computer science)|replication]].
 
===[https://foryouapk.com/ Index data structures]===
Search engine architectures vary in the way indexing is performed and in methods of index storage to meet the various design factors.
 
Line 38:
;[[Document-term matrix]]: Used in latent semantic analysis, stores the occurrences of words in documents in a two-dimensional [[sparse matrix]].
 
===[https://foryouapk.com/ Challenges in parallelism]===
A major challenge in the design of search engines is the management of serial computing processes. There are many opportunities for [[race conditions]] and coherent faults. For example, a new document is added to the corpus and the index must be updated, but the index simultaneously needs to continue responding to search queries. This is a collision between two competing tasks. Consider that authors are producers of information, and a [[web crawler]] is the consumer of this information, grabbing the text and storing it in a cache (or [[Text corpus|corpus]]). The forward index is the consumer of the information produced by the corpus, and the inverted index is the consumer of information produced by the forward index. This is commonly referred to as a '''producer-consumer model'''. The indexer is the producer of searchable information and users are the consumers that need to search. The challenge is magnified when working with distributed storage and distributed processing. In an effort to scale with larger amounts of indexed information, the search engine's architecture may involve [[distributed computing]], where the search engine consists of several machines operating in unison. This increases the possibilities for incoherency and makes it more difficult to maintain a fully synchronized, distributed, parallel architecture.<ref>Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Google, Inc. OSDI. 2004.</ref>
 
===[https://foryouapk.com/ Inverted indices]===
{{Main|Inverted index}}