Discover by Association
Cross link analysis is very powerful in data analysis. Many useful information is uncovered by finding and building a link or an association between the entities.
To perform data analysis through association discovery method, the user generally need to perform two steps.
The first step is to build the association that the user is interested in with the general syntax looks like following:
CREATE ASSOCIATION association_name (attribute) BY hint_attribute FROM source_table_name
[WHERE clause]
The attribute within the association_name specifies the link (entities within the table) that the user wants to create the association with. The hint_attribute is the basis of the link from which the total count of each association is generated. Therefore, this association contains information on total count matrix of each attribute pair and exactly which hint attribute contribute to the corresponding attribute pair.
After the association is built, the user can query data from the association he just created with GET statement. The general syntax for such query is following:
GET ACTION(entity) FROM association_name
ACTION is the specific information that the user wants to get from the association. The available ACTION commands in the GET statement are FREQ (frequency), PROB(probability) and FACT. entity within ACTION specifies the items to be searched in the association.
Finally, the user can store the search result into a table. The general syntax for this is:
CREATE TABLE table_name AS (GET statement)
Let's apply the association scheme to ABC company's business. As holidays season approaches, the staff of ABC company would like to give out special bundle deal to its members. The staff will be interested to know which products bundling will be more effective. He can first build the association between two products based on customers.
CREATE ASSOCIATION prod2prod (Product.name) BY Customer.id FROM sales
The staff of ABC company is able to check the full matrix of frequency of each product pair within prod2prod association with a GET FREQ() statement like this
GET FREQ() FROM prod2prod
Since GET FREQ() extracts all the data within the association, the information may be too much for the staff. To take a glance of the result, use GET with LIMIT clause:
GET FREQ() FROM prod2prod limit 10
+--------------+--------------------------------+------+
| from | to | freq |
+--------------+--------------------------------+------+
| Fanta orange | Soda | 1 |
| Fanta orange | Cheese Flavored Snacks Crunchy | 2 |
| Fanta orange | Snacks | 2 |
| Fanta orange | Water | 1 |
| Fanta orange | Cola | 2 |
| Fanta orange | Lemon-lime Soda Diet | 1 |
| Fanta orange | Energy Drink | 1 |
| Fanta orange | Soda Diet | 1 |
| Fanta orange | Lemon-lime Soda | 1 |
| Fanta orange | Cola Diet Wild Cherry | 1 |
+--------------+--------------------------------+------+
Alternatively, he may also specify the product name in his search within the association. If the staff of ABC company wants to know specifically which products bundle with Tropical Berry may have the best sales result, he may do following action with Get Statement.
GET FREQ('Tropical Berry') FROM prod2prod
+----------------+-----------------------------------------------------------------+------+
| from | to | freq |
+----------------+-----------------------------------------------------------------+------+
| Tropical Berry | Coca-Cola | 1 |
| Tropical Berry | Soda | 2 |
| Tropical Berry | Snacks | 1 |
| Tropical Berry | Cola | 1 |
| Tropical Berry | Lemon-lime Soda Diet | 1 |
| Tropical Berry | Soda Lemon-lime | 1 |
| Tropical Berry | Root Beer 3 lt,3.17 qt | 1 |
| Tropical Berry | Root Beer Diet | 1 |
| Tropical Berry | Root Beer Diet 2.1 qt,2 lt | 1 |
| ... | ... | ... |
The staff will get each total count of Tropical Berry bundle with the other products from above statement.
Since above query may return a long list of information, the staff may further organize above query with specific quantifier or instruction such as following:
statement | description |
---|---|
GET TOP ALL FREQ('Tropical Berry') FROM prod2prod |
Return all results and list from top |
GET 10 FREQ('Tropical Berry') FROM prod2prod |
Return 10 results |
GET TOP 10 FREQ('Tropical Berry') FROM prod2prod |
Return top 10 results |
GET 10% FREQ('Tropical Berry') FROM prod2prod |
Return 10% of the results |
GET TOP 10% FREQ('Tropical Berry') FROM prod2prod |
Return first 10% of the results |
GET BOTTOM 10 FREQ('Tropical Berry') FROM prod2prod |
Return bottom 10 results |
GET BOTTOM 10% FREQ('Tropical Berry') FROM prod2prod |
Return bottom 10% of the results |
Below is the outcome of GET TOP 10 FREQ('Tropical Berry') FROM prod2prod
+----------------+----------------------------------------------+------+
| from | to | freq |
+----------------+----------------------------------------------+------+
| Tropical Berry | Tortilla Chips | 4 |
| Tropical Berry | Potato Crisps | 3 |
| Tropical Berry | Barbecue Flavor Potato Chips | 2 |
| Tropical Berry | Cheese Flavored Snacks | 2 |
| Tropical Berry | Code Red Diet Soda Fridge Mate | 2 |
| Tropical Berry | Cookies | 2 |
| Tropical Berry | Diet Voltage Soda | 2 |
| Tropical Berry | Fit Water Beverage Natural Black Cherry 6 ea | 2 |
| Tropical Berry | G2 Grape | 2 |
| Tropical Berry | Instant Oatmeal | 2 |
+----------------+----------------------------------------------+------+
Furthermore, the staff of ABC company can also apply a filter condition to narrow down the Tropical Berry bundle pair.
For example, he can filter to get only those frequency above 2 with this command:
GET FREQ('Tropical Berry') FROM prod2prod WHERE freq > 2
+----------------+----------------+------+
| from | to | freq |
+----------------+----------------+------+
| Tropical Berry | Tortilla Chips | 4 |
| Tropical Berry | Potato Crisps | 3 |
+----------------+----------------+------+
Alternatively, if the staff of ABC company wants to propose a special bundle deal to product managers of Tropical Berry and Potato Crisps, he can pull out the past record of frequency of buying both Tropical Berry and Potato Chips by issuing a joining command which has the similar effect as "A" AND "B". Below is the example:
GET FREQ('Tropical Berry' to 'Potato Crisps') FROM prod2prod
The result is shown below:
+----------------+---------------+------+
| from | to | freq |
+----------------+---------------+------+
| Tropical Berry | Potato Crisps | 3 |
+----------------+---------------+------+
Finally, if ABC company staff is curious about whether Tropical Berry is a good choice for the bundle option in the first place. He can search for the past total frequency of any products with Tropical Berry using following command:
GET TOTAL FREQ('Tropical Berry') FROM prod2prod
The result is the summation of all products associate with Tropical Berry like this:
+----------------+------------+
| from | total_freq |
+----------------+------------+
| Tropical Berry | 227 |
+----------------+------------+
Instead of actual frequency, the staff of ABC company may want to know the probability distribution of each product association pair. In this case, he may use GET PROB statement with syntax and rules that are the same as GET FREQ above. For example if he is curious about which product maybe better bundling with Fanta Orange probability wise, he can issue following command:
GET TOP 10 PROB('Fanta orange') FROM prod2prod
He will get a table that looks like below:
+--------------+---------------------------------+----------+
| from | to | prob |
+--------------+---------------------------------+----------+
| Fanta orange | Potato Chips | 0.241379 |
| Fanta orange | Granola Bars | 0.137931 |
| Fanta orange | AQUARIUS LEMON | 0.103448 |
| Fanta orange | Cereal | 0.103448 |
| Fanta orange | Flavored Potato Chips | 0.103448 |
| Fanta orange | Potato Crisps | 0.103448 |
| Fanta orange | Barley 100% Natural Whole Grain | 0.068966 |
| Fanta orange | Bruce Lee Playing Cards | 0.068966 |
| Fanta orange | Cheese Flavored Snacks | 0.068966 |
| Fanta orange | Cheese Flavored Snacks Crunchy | 0.068966 |
+--------------+---------------------------------+----------+
GET PROB ('A' to 'B') FROM association and GET PROB ('B' to 'A') FROM association may get two different values because the base (distinct count) used in the probability calculation (A vs B) may be different.
If after the promotion period, ABC company staff wants to draw a prize for people who specifically buying the special bundle of Tropical Berry and Potato Crisps, he may do a quick search on which customers actually buying such bundle with following command:
GET FACT('Tropical Berry' to 'Potato Crisps') FROM prod2prod
The customers who bought such bundle will be shown.
+----------------+---------------+------+
| from | to | fact |
+----------------+---------------+------+
| Tropical Berry | Potato Crisps | 1644 |
| Tropical Berry | Potato Crisps | 1939 |
| Tropical Berry | Potato Crisps | 2228 |
+----------------+---------------+------+
During and after the analysis, ABC company staff can always store the association finding into another table for future use. Following is the example:
CREATE TABLE top10 AS (GET TOP 10 FREQ() FROM prod2prod)
A table named top10 will be generated as below:
+------------------------+------------------------+------+
| from | to | freq |
+------------------------+------------------------+------+
| Potato Chips | Tortilla Chips | 427 |
| Tortilla Chips | Potato Chips | 427 |
| Cheese Flavored Snacks | Potato Chips | 365 |
| Potato Chips | Cheese Flavored Snacks | 365 |
| Cheese Flavored Snacks | Tortilla Chips | 247 |
| Tortilla Chips | Cheese Flavored Snacks | 247 |
| Potato Chips | Potato Crisps | 207 |
| Potato Crisps | Potato Chips | 207 |
| Cereal | Potato Chips | 143 |
| Potato Chips | Cereal | 143 |
+------------------------+------------------------+------+