Apriori algorithm: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 13:41, 8 May 2023 edit DanCherek (talk \| contribs) Autopatrolled, Administrators 74,716 edits Restored revision 1144380472 by Aidan9382 (talk): Sockpuppet citespam Tags: Twinkle Undo ← Previous edit		Latest revision as of 14:04, 16 April 2025 edit undo Citation bot (talk \| contribs) Bots 5,863,276 edits Add: doi, pages. \| Use this bot. Report bugs. \| Suggested by Dominic3203 \| Linked from User:LinguisticMystic/cs/outline \| #UCB_webform_linked 93/2277
(16 intermediate revisions by 12 users not shown)
Line 4: == Overview == The Apriori algorithm was proposed by Agrawal and Srikant in 1994. Apriori is designed to operate on [[database]]s containing transactions (for example, collections of items bought by customers, or details of a website frequentation or [[IP address]]es<ref>{{usurped\|1=[https://web.archive.org/web/20210822191810/https://deductive.com/blogs/data-science-ip-matching/ The data science behind IP address matching]}} Published by deductive.com, September 6, 2018, retrieved September 7, 2018</ref>). Other algorithms are designed for finding association rules in data having no transactions ([[Winepi]] and Minepi), or having no timestamps ([[DNA sequencing]]). Each transaction is seen as a set of items (an ''itemset''). Given a threshold <math>C</math>, the Apriori algorithm identifies the item sets which are subsets of at least <math>C</math> transactions in the database. Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as ''candidate generation''), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found. Line 13: '''Apriori'''(T, ε) L<sub>1</sub> ← {large ~~1 -~~singleton itemsets} k ← 2 '''while''' L<sub>k−1</sub> '''is not''' empty C<sub>k</sub> ← ~~Apriori_gen~~Generate_candidates(L<sub>k−1</sub>, k) '''for''' transactions t '''in''' T D<sub>t</sub> ← {c in C<sub>k</sub> : c ⊆ t} Line 25: k ← k + 1 '''return''' Union(L<sub>k</sub>) '''over all''' k '''~~Apriori_gen~~Generate_candidates'''(L, k) result ← ~~list~~empty_set() '''for all''' p ∈ L, q ∈ L '''where''' p~~<sub>1</sub>~~ =and q~~<sub>1</sub>,~~ ~~p<sub>2</sub>~~differ =in ~~q<sub>2</sub>,~~exactly ~~...,~~one ~~p<sub>k-2</sub> = q<sub>k-2</sub> and p<sub>k-1</sub> < q<sub>k-1</sub>~~element c =← p ∪ {q~~<sub>k-1</sub>}~~ '''if''' u ∈ L '''for all''' u ⊆ c '''where''' \|u\| = k-1 result.add(c) '''return''' result == Examples == Line 43: {\| class="wikitable" \|- \| α \|\| β \|\| ~~theta~~ε▼ \|- \| α \|\| β \|\| ~~epsilon~~θ \|- \| α \|\| β \|\| ~~theta~~ε \|- \| &alpha ; \|\| β \|\| ~~epsilon~~θ \|- ▲\| alpha\|\| beta \|\| theta \|} The association rules that can be determined from this database are the following: # 100% of sets with α also contain β # 50% of sets with α, β also have ε # 50% of sets with α, β also have θ we can also illustrate this through a variety of examples. Line 142 ⟶ 141: Also, both the time and space complexity of this algorithm are very high: <math>O\left(2^{\|D\|}\right)</math>, thus exponential, where <math>\|D\|</math> is the horizontal width (the total number of items) present in the database. Later algorithms such as [[Max-Miner]]<ref>{{cite journal\|author=Bayardo Jr, Roberto J.\|title=Efficiently mining long patterns from databases\|journal=ACM SIGMOD Record \|volume=27\|issue=2\|year=1998\|pages=85–93 \|doi=10.1145/276305.276313 \|url=http://www.cs.sfu.ca/CourseCentral/741/jpei/readings/baya98.pdf}}</ref> try to identify the maximal frequent item sets without enumerating their subsets, and perform "jumps" in the search space rather than a purely bottom-up approach. == References ==