Apriori algorithm: Difference between revisions

Content deleted Content added
HyRotor (talk | contribs)
m Overview: Consistent assignment operator
Citation bot (talk | contribs)
Add: doi, pages. | Use this bot. Report bugs. | Suggested by Dominic3203 | Linked from User:LinguisticMystic/cs/outline | #UCB_webform_linked 93/2277
 
(9 intermediate revisions by 6 users not shown)
Line 4:
== Overview ==
 
The Apriori algorithm was proposed by Agrawal and Srikant in 1994. Apriori is designed to operate on [[database]]s containing transactions (for example, collections of items bought by customers, or details of a website frequentation or [[IP address]]es<ref>[https://deductive.com/blogs/data-science-ip-matching/ The data science behind IP address matching] {{Webarchiveusurped|url1=[https://web.archive.org/web/20210822191810/https://deductive.com/blogs/data-science-ip-matching/ |date=2021-08-22The data science behind IP address matching]}} Published by deductive.com, September 6, 2018, retrieved September 7, 2018</ref>). Other algorithms are designed for finding association rules in data having no transactions ([[Winepi]] and Minepi), or having no timestamps ([[DNA sequencing]]). Each transaction is seen as a set of items (an ''itemset''). Given a threshold <math>C</math>, the Apriori algorithm identifies the item sets which are subsets of at least <math>C</math> transactions in the database.
 
Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as ''candidate generation''), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found.
Line 13:
 
'''Apriori'''(T, ε)
L<sub>1</sub> ← {large 1 -singleton itemsets}
k ← 2
'''while''' L<sub>k−1</sub> '''is not''' empty
C<sub>k</sub> ← Apriori_genGenerate_candidates(L<sub>k−1</sub>, k)
'''for''' transactions t '''in''' T
D<sub>t</sub> ← {c in C<sub>k</sub> : c ⊆ t}
Line 25:
k ← k + 1
'''return''' Union(L<sub>k</sub>) '''over all''' k
'''Apriori_genGenerate_candidates'''(L, k)
result ← listempty_set()
'''for all''' p ∈ L, q ∈ L '''where''' p<sub>1</sub> =and q<sub>1</sub>, p<sub>2</sub>differ =in q<sub>2</sub>,exactly ...,one p<sub>k-2</sub> = q<sub>k-2</sub> and p<sub>k-1</sub> < q<sub>k-1</sub>element
c ← p ∪ {q<sub>k-1</sub>}
'''if''' u ∈ L '''for all''' u ⊆ c '''where''' |u| = k-1
result.add(c)
Line 43:
{| class="wikitable"
|-
| &alpha; || &beta; || theta&epsilon;
|-
| &alpha; || &beta; || epsilon&theta;
|-
| &alpha; || &beta; || theta&epsilon;
|-
| &alpha ; || &beta; || epsilon&theta;
|-
| alpha|| beta || theta
|}
 
The association rules that can be determined from this database are the following:
# 100% of sets with &alpha; also contain &beta;
# 50% of sets with &alpha;, &beta; also have &epsilon;
# 50% of sets with &alpha;, &beta; also have &theta;
 
we can also illustrate this through a variety of examples.
Line 142 ⟶ 141:
Also, both the time and space complexity of this algorithm are very high: <math>O\left(2^{|D|}\right)</math>, thus exponential, where <math>|D|</math> is the horizontal width (the total number of items) present in the database.
 
Later algorithms such as [[Max-Miner]]<ref>{{cite journal|author=Bayardo Jr, Roberto J.|title=Efficiently mining long patterns from databases|journal=ACM SIGMOD Record |volume=27|issue=2|year=1998|pages=85–93 |doi=10.1145/276305.276313 |url=http://www.cs.sfu.ca/CourseCentral/741/jpei/readings/baya98.pdf}}</ref> try to identify the maximal frequent item sets without enumerating their subsets, and perform "jumps" in the search space rather than a purely bottom-up approach.
 
== References ==