Association Rule mining, Apriori Principle and Brute force

Gaurav Kumar
4 min readApr 20, 2023

--

Source: Google Image

Association rule mining is a data mining technique that identifies the patterns and relationships between different items in a dataset. It helps in finding the correlation between items and their co-occurrence in a transaction or event. This technique is used in many areas such as market basket analysis, recommendation systems, and web mining.

Brute force approach is a simple and straightforward method to find all possible combinations of items in a dataset. It involves generating all possible itemsets and then calculating their support and confidence values to identify frequent itemsets and association rules. However, the brute force approach is not feasible for large datasets as it requires a lot of computational resources and time.

Apriori algorithm is a popular method for association rule mining that is efficient and scalable for large datasets. The algorithm is based on the principle that any subset of a frequent itemset must be frequent. It works by iteratively generating frequent itemsets and association rules based on the support and confidence values. The algorithm starts with finding the frequent 1-itemsets and then uses them to generate 2-itemsets, which are further used to generate 3-itemsets and so on. The process continues until no more frequent itemsets can be generated.

Apriori algorithm is efficient because it avoids generating all possible itemsets and only focuses on the frequent ones, which reduces the computational overhead. It is widely used in market basket analysis and recommendation systems to identify the association between different products and their co-occurrence in transactions.

Let’s see an example of how Apriori algorithm works:

  1. Set a minimum support threshold: The algorithm starts by setting a minimum support threshold, which is a user-defined value that determines the minimum frequency of an itemset to be considered frequent. Any itemset that appears less frequently than the support threshold is considered infrequent and can be ignored.
  2. Generate frequent 1-itemsets: The algorithm scans the dataset to count the occurrences of each item and generates frequent 1-itemsets, which are itemsets that meet the minimum support threshold.
  3. Generate candidate 2-itemsets: The algorithm uses the frequent 1-itemsets to generate candidate 2-itemsets, which are all possible combinations of 2 items that appear in the frequent 1-itemsets.
  4. Count the occurrences of candidate 2-itemsets: The algorithm scans the dataset again to count the occurrences of each candidate 2-itemset and generates frequent 2-itemsets, which are 2-itemsets that meet the minimum support threshold.
  5. Generate candidate k-itemsets: The algorithm uses the frequent (k-1)-itemsets to generate candidate k-itemsets, which are all possible combinations of k items that appear in the frequent (k-1)-itemsets.
  6. Count the occurrences of candidate k-itemsets: The algorithm scans the dataset again to count the occurrences of each candidate k-itemset and generates frequent k-itemsets, which are k-itemsets that meet the minimum support threshold.
  7. Generate association rules: Once the frequent itemsets have been generated, the algorithm generates association rules by applying the confidence threshold. An association rule is considered significant if it meets both the support and confidence thresholds.

Let’s say we have a dataset of customer transactions at a grocery store, and we want to use the Apriori algorithm to find frequent itemsets and association rules. The dataset contains the following transactions:

Transaction 1: Bread, Milk, Cheese

Transaction 2: Bread, Milk, Sugar

Transaction 3: Bread, Cheese, Sugar

Transaction 4: Bread, Milk

Transaction 5: Milk, Sugar

Step 1: Set a minimum support threshold Let’s set the minimum support threshold to 2, which means an itemset must appear in at least 2 transactions to be considered frequent.

Step 2: Generate frequent 1-itemsets We scan the dataset to count the occurrences of each item and generate frequent 1-itemsets:

Bread: 4 Milk: 4 Cheese: 2 Sugar: 2

Since Bread and Milk appear in 4 transactions, they are frequent 1-itemsets. Cheese and Sugar appear in only 2 transactions each, so they are not frequent.

Step 3: Generate candidate 2-itemsets We use the frequent 1-itemsets to generate candidate 2-itemsets:

{Bread, Milk} {Bread, Cheese} {Milk, Cheese} {Bread, Sugar} {Milk, Sugar} {Cheese, Sugar}

Step 4: Count the occurrences of candidate 2-itemsets We scan the dataset again to count the occurrences of each candidate 2-itemset and generate frequent 2-itemsets:

{Bread, Milk}: 3 {Bread, Sugar}: 2 {Milk, Sugar}: 2

The other candidate 2-itemsets do not appear in at least 2 transactions, so they are not frequent.

Step 5: Generate candidate 3-itemsets We use the frequent 2-itemsets to generate candidate 3-itemsets:

{Bread, Milk, Sugar}

Step 6: Count the occurrences of candidate 3-itemsets We scan the dataset again to count the occurrences of the candidate 3-itemset and generate frequent 3-itemsets:

{Bread, Milk, Sugar}: 2

Step 7: Generate association rules We generate association rules by applying the confidence threshold. Let’s set the confidence threshold to 0.5, which means the rule must have at least 50% confidence to be considered significant. We can generate the following rules:

{Bread, Milk} => {Sugar} (confidence: 2/3 = 0.67) {Bread, Sugar} => {Milk} (confidence: 2/2 = 1.0) {Milk, Sugar} => {Bread} (confidence: 2/2 = 1.0)

These rules indicate that customers who buy Bread and Milk are likely to also buy Sugar, and vice versa. Customers who buy Bread and Sugar are likely to also buy Milk, and customers who buy Milk and Sugar are likely to also buy Bread.

--

--

Gaurav Kumar
Gaurav Kumar

No responses yet