XML retrieval: Difference between revisions

Content deleted Content added
rm prod - article has reliable citations, does not present original theories or unreferenced claims
m Reflist
 
(67 intermediate revisions by 42 users not shown)
Line 1:
{{Short description|Content-based retrieval of XML documents}}
{{orphan}}
'''XML-Retrieval retrieval''', or <b>'''XML Informationinformation Retrieval</b>retrieval''', is the content-based retrieval of documents structured with the [[XML]] (eXtensible Markup Language]] (XML). As such it is used for computing [[Relevance (information retrieval)|relevance]] of XML-documents documents.<ref>Winter,{{Cite Judith;web Drobnik,|last=Lalmas Oswald:|first=Mounia An Architecture for|date=2009 |title=XML Information Retrieval in|url=https://fi.wikipedia.org/wiki/XML-tiedonhaku#L%C3%A4hteet a|publisher=Morgan Peer-to-Peer& Environment.Claypool}}</ref>.
 
== Queries ==
Most XML-Retrieval retrieval approaches do so based on techniques from the [[information retrieval]] (IR) area, e.g. by computing the similarity between a query consisting of keywords (= [[query terms]]) and the document. However, in XML-Retrieval the query can also contain [[Data structure|structural]] [[Hint (SQL)|hints]]. So-called "content and structure" (CAS) queries enable users to specify what structure the requested content can or must have. <br>
 
== Exploiting XML-Structure structure==
== Queries ==
Taking advantage of the [[Self-documenting|self-describing]] structure of XML- documents can improve the search for XML- documents significantly. This includes the use of CAS- queries, the weighting of different XML- elements differently and the focused retrieval of sub-documentssubdocuments. <br>
 
== Ranking ==
Most XML-Retrieval approaches do so based on techniques from the information retrieval (IR) area, e.g. by computing the similarity between a query consisting of keywords (= [[query terms]]) and the document. However, in XML-Retrieval the query can also contain structural hints. So-called "content and structure" (CAS) queries enable users to specify what structure the requested content can or must have. <br>
Ranking in XML-Retrieval can incorporate both content relevance and structural similarity, which is the resemblance between the structure given in the query and the structure of the document. Also, the retrieval units resulting from an XML query may not always be entire documents, but can be any deeply nested XML elements, i.e. dynamic documents. The aim is to find the smallest retrieval unit that is highly relevant. Relevance can be defined according to the notion of specificity, which is the extent to which a retrieval unit focuses on the topic of request .<ref name="INEX2006">{{Cite web|url=http://www.cs.otago.ac.nz/homepages/andrew/2006-10.pdf |title=Overview of INEX 2006 |last=Malik, Sadia;|first=Saadia |author2=Trotman, Andrew; |author3=Lalmas, Mounia; |author4=Fuhr, Norbert: Overview|year=2007 |work=Proceedings of INEXthe Fifth Workshop of the INitiative for the Evaluation of XML Retrieval |access-date=2009-02-10 |archive-url=https://web.archive.org/web/20081016101202/http://www.cs.otago.ac.nz/homepages/andrew/2006-10.pdf |archive-date=October 16, 2008 }}<br/ref>
 
== Existing XML search engines ==
An overview of two potential approaches is available.<ref>{{Cite journal|url=http://www.sigmod.org/record/issues/0612/p16-article-yahia.pdf|title=XML Search: Languages, INEX and Scoring|last=Amer-Yahia|first=Sihem|author2=Lalmas, Mounia |year=2006|journal=SIGMOD Rec. |volume=35 |issue=4|access-date=2009-02-10|doi=10.1145/1228268.1228271|s2cid=17300151}} {{Dead link|date=October 2010|bot=H3llBot}}</ref><ref>{{Cite CiteSeerX |citeseerx = 10.1.1.109.5986|title=XML Retrieval: A Survey|last=Pal|first=Sukomal|date=June 30, 2006}}</ref> The INitiative for the Evaluation of XML-Retrieval (''INEX'') was founded in 2002 and provides a platform for evaluating such [[algorithm]]s.<ref name="INEX2006" /> Three different areas influence XML-Retrieval:<ref name="INEX2002">{{Cite web|url=http://www.is.informatik.uni-duisburg.de/bib/pdf/ir/Fuhr_etal:02a.pdf |title=INEX: Initiative for the Evaluation of XML Retrieval |last=Fuhr |first=Norbert |author2=Gövert, N. |author3=Kazai, Gabriella |author4=Lalmas, Mounia |year=2003 |work=Proceedings of the First INEX Workshop, Dagstuhl, Germany, 2002 |publisher=ERCIM Workshop Proceedings, France |access-date=2009-02-10 |archive-url=https://web.archive.org/web/20081121135758/http://www.is.informatik.uni-duisburg.de/bib/pdf/ir/Fuhr_etal:02a.pdf |archive-date=November 21, 2008 }}</ref>
 
• '''===Traditional XML query languages:'''<br>===
== Exploiting XML-Structure ==
[[Query language]]s such as the [[W3C]] standard [[XQuery]]<ref>{{Cite web|url=http://www.w3.org/TR/2007/REC-xquery-20070123/|title=XQuery 1.0: An XML Query Language|last=Boag|first=Scott|author2=Chamberlin, Don |author3=Fernández, Mary F. |author4=Florescu, Daniela |author5=Robie, Jonathan |author6= Siméon, Jérôme |date=23 January 2007|work=W3C Recommendation|publisher=World Wide Web Consortium|access-date=2009-02-10}}</ref> supply complex queries, but only look for exact matches. Therefore, they need to be extended to allow for vague search with relevance computing. Most XML-centered approaches imply a quite exact knowledge of the documents' [[Database schema|schemas]].<ref name="Schlieder2002">{{Cite journal|url=http://www.cis.uni-muenchen.de/people/Meuss/Pub/JASIS02.ps.gz |title=Querying and Ranking XML Documents |last=Schlieder |first=Torsten |author2=Meuss, Holger |year=2002 |journal=Journal of the American Society for Information Science and Technology |volume=53 |issue=6 |pages=489–503 |access-date=2009-02-10 |archive-url=https://web.archive.org/web/20070610002349/http://www.cis.uni-muenchen.de/people/Meuss/Pub/JASIS02.ps.gz |archive-date=June 10, 2007 |doi=10.1002/asi.10060 |url-access=subscription }}</ref>
 
===Databases===
Taking advantage of the self-describing structure of XML-documents can improve the search for XML-documents significantly. This includes the use of CAS-queries, the weighting of different XML-elements differently and the focused retrieval of sub-documents. <br>
Classic [[database]] systems have adopted the possibility to store [[Semi-structured model|semi-structured data]]<ref name="INEX2002" /> and resulted in the development of [[XML database]]s. Often, they are very formal, concentrate more on searching than on ranking, and are used by experienced users able to formulate complex queries.
 
===Information retrieval===
Classic information retrieval models such as the [[vector space model]] provide relevance ranking, but do not include document structure; only flat queries are supported. Also, they apply a static document concept, so retrieval units usually are entire documents.<ref name="Schlieder2002"/> They can be extended to consider structural information and dynamic document retrieval. Examples for approaches extending the vector space models are available: they use document [[subtree]]s (index terms plus structure) as dimensions of the vector space.<ref>{{Cite web|url=http://www.cobase.cs.ucla.edu/tech-docs/sliu/SIGIR04.pdf|title=Configurable Indexing and Ranking for XML Information Retrieval|last=Liu|first=Shaorong|author2=Zou, Qinghua |author3=Chu, Wesley W. |year=2004|work=SIGIR'04|publisher=ACM|access-date=2009-02-10}}</ref>
 
== Data-centric XML datasets ==
== Ranking ==
For data-centric XML datasets, the unique and distinct keyword search method, namely, XDMA<ref>{{Cite journal|last1=Selvaganesan|first1=S.|last2=Haw|first2=Su-Cheng|last3=Soon|first3=Lay-Ki|title=XDMA: A Dual Indexing and Mutual Summation Based Keyword Search Algorithm for XML Databases|journal=International Journal of Software Engineering and Knowledge Engineering|language=en-US|volume=24|issue=4|pages=591–615|doi=10.1142/s0218194014500223|year=2014}}</ref> for XML databases is designed and developed based on dual indexing and mutual summation.
 
==See also==
Ranking in XML-Retrieval can incorporate both content relevance and structural similarity, which is the resemblance between the structure given in the query and the structure of the document. Also, the retrieval units resulting from an XML query may not always be entire documents, but can be any deeply nested XML elements, i.e. dynamic documents. The aim is to find the smallest retrieval unit that is highly relevant. Relevance can be defined according to the notion of specificity, which is the extent to which a retrieval unit focuses on the topic of request <ref> Malik, Sadia; Trotman, Andrew; Lalmas, Mounia; Fuhr, Norbert: Overview of INEX 2006. <br>
*[[Document retrieval]]
In: Proc. of the Fifth Workshop of the INitiative for the Evaluation of XML Retrieval, Germany, 2007.</ref>.<br>
*[[Information retrieval applications]]
 
== References ==
{{Reflist}}
 
{{DEFAULTSORT:Xml-Retrieval}}
== Existing XML search engines ==
[[Category:XML]]
 
[[Category:Information Retrievalretrieval genres]]
An overview of two potential approaches is available <ref>Amer-Yahia, S.; Lalmas, Mounia: XML Search: Languages, INEX and Scoring. SIGMOD Rec. Vol. 35, No. 4, 2006.</ref> <ref>Pal, Sukomal: XML Retrieval – A Survey. 2007, Technical Report, CVPR, [http://www.isical.ac.in/~sukomal_r/survey.pdf].</ref>. The INitiative for the Evaluation of XML-Retrieval (''INEX'') was founded in 2002 and provides a platform for evaluating such algorithms <ref> Malik, Sadia; Trotman, Andrew; Lalmas, Mounia; Fuhr, Norbert: Overview of INEX 2006. <br>
In: Proc. of the Fifth Workshop of the INitiative for the Evaluation of XML Retrieval, Germany, 2007.</ref>. Three different areas influence XML-Retrieval <ref>Fuhr, Norbert; Gövert, N.; Kazai, Gabriella; Lalmas, Mounia (eds.): INitiative for the Evaluation of XML Retrieval (INEX). <br>
In: Proc. of the First INEX Workshop, Dagstuhl, Germany, 2002, ERCIM Workshop Proceedings, France, 2003.</ref>:
 
• '''Traditional XML query languages:'''<br>
Query languages such as the [[W3C standard XQuery]] <ref>World Wide Web Consortium: XQuery 1.0: An XML Query Language. W3C Recommendation, 23. Jan. 2007, http://www.w3.org/TR/xquery/</ref> supply complex queries, but only look for exact matches. Therefore, they need to be extended to allow for
vague search with relevance computing. Most XML-centred approaches imply a quite exact knowledge of the documents’ schemas <ref>Schlieder, Torsten; Meuss, H.: Querying and Ranking XML Documents. Journal of the American Society for Information Science and Technology, Vol. 53, No. 6, 2002</ref>.<br>
• '''Databases:'''<br>
Classic [[database]] systems have adopted the possibility to store semi-structured data <ref>Fuhr, Norbert; Gövert, N.; Kazai, Gabriella; Lalmas, Mounia (eds.): INitiative for the Evaluation of XML Retrieval (INEX). <br>
In: Proc. of the First INEX Workshop, Dagstuhl, Germany, 2002, ERCIM Workshop Proceedings, France, 2003 </ref>and resulted in the development of XML-databases. Often, they are very
formal, concentrate more on searching than on ranking, and are used by experienced users able to formulate complex queries.<br>
• '''Information Retrieval:'''<br>
Classic Information Retrieval models such as the [[Vector Space Model]] provide relevance ranking, but do not include document structure; only flat queries are
supported. Also, they apply a static document concept, so retrieval units usually are entire documents <ref>Schlieder, Torsten; Meuss, H.: Querying and Ranking XML Documents. Journal of the American Society for Information Science and Technology, Vol. 53, No. 6, 2002.</ref>. They can be extended to consider structural
information and dynamic document retrieval. Examples for approaches extending the Vector Space Models are available: they use document subtrees (index
terms plus structure) as dimensions of the vector space <ref>Liu, S.; Zou, Q.; Chu, W.: Configurable Indexing and Ranking for XML Information Retrieval</ref>.
 
 
 
== References ==
{{reflist}}
 
[[Category:Information Retrieval]]