Tree Programming

BigObject allows users to apply their functions in FIND statement. Concept wise, BigObject will first construct a tree object based on the FIND description, apply user's function, and then place those nodes evaluated to TRUE in the result table. The whole process is illustrated as in the following figure.

As an example, following FIND statement can be used to find the number of distinct customers for each product brand:

FIND dcount() Product.brand IN Customer.id FROM sales

How the statement works is depicted below. A hierarchical tree object is created such that all transactions which belong to the same brand, and then to the same customers are grouped together. Function dcount() is evaluated for each node at the Product.brand level. Finally, the number of distinct customers for each brand is reported in the result table.

The fucntion dcount() used in the above FIND statement is as follows:

function dcount(n)
  dc = n:count()
  n:setValue(0 , dc)
  return true
end

The argument n denotes the tree node that is currently evaluated. Function dcount() applies the built-in function count() to obtain the number of children (i.e., this distinct number of customers). And then it uses assign() function to place the distinct value in the first data field (denoted as 0) of node n. Note that function dcount() always returns TRUE, which means all nodes at level Product.brand will be reported in the result table. An example result table is listed below:

Product.brand	_VALUE
Fanta	1186
Coca-Cola	5327
Pepsi	8397
...	...

The distinct number of customers for each product brand is listed in the column _VALUE. It is the first data field for each tree node and created by FIND implicitly. Its value will be included in the result table by default.

Note

The root of sub-tree that is currently being evaluated is always specified as the first argument.

Now assume that a user only wants to report those product brands having more distinct customers than a threshold. We can modify the above dcount() function as follows:

function dcount(n, ts)
  dc = n:count()
  tr = tonumber(ts)
  n:setValue(0 , dc)
  if (dc > tr) then 
    return true 
  else 
    return false 
  end 
end

The threshold value is passed via argument ts. It can be easily seen that only those nodes with more children than the threshold will return TRUE. Note that all arguments besides the tree node n are passed in string type. Therefore, one has to use tonumber() to convert it into a number first.

Thus, the FIND statement to obtain all product brands with more than 1500 distinct customers is straightforward:

FIND dcount(1500) Product.brand IN Customer.id FROM sales

Depth-first-traveral

Function children() returns an iterator over the children of the given node. It is easy to perform a depth-first-traversal of a given node as shown below:

function dfs(n)
  for c in n:children() do
    dfs(c)
  end
  return true
end

A Concrete Example: 80-20 Rule (Pareto Principle)

This example demonstrates how to use in-place programming for FIND to implement a general function "Pareto principle (the 80–20 rule) " as seen in our find by pattern example.

Assume the following FIND statement is used:

FIND pareto(80,20) Product.brand IN Customer.gender BY SUM(total_price) FROM sales

First, the above statement creates a tree object with a hierarchy from Product.brand to Customer.gender. Furthermore, each node includes two data fields: _VALUE and SUM(total_price). The _VALUE field is used to store the computed result. The tree creation is equal to the following CREATE TREE statement:

CREATE TREE temp(_VALUE DOUBLE, SUM(total_price)) FROM sales GROUP BY Product.brand, Customer.gender

The function "pareto()" used is listed as below:

function pareto(a, threshold, percent)
  sv = a:getChildrenValue(1);
  value = nBinPercent(sv, percent)
  a:setValue(0, value)
  if value > (threshold / 100) then
    return true
  end
  return false
end

Line 1: The first argument a is the node to be evaluated. The second argument threshold denotes the threshold (80 in the above example). And the third argument percent denotes the percentage (20 in this case).
Line 2 : The getChildrenValue() function is used to retrieve the date field indexed at 1 of all children, which is data field SUM(total_price).
Line 3 : The function nBinPercent() is used to compute the accumulated percentage of the first n% bin.
Line 4 : It assigns the result to the _VALUE field.
Line 5 ~ 8 : It returns TRUE when the result is greater than the threshold.

Function nBinPercent() is defined as follows:

function nBinPercent(data_array, npercent)
  npercent = npercent / 100
  sum = 0.0
  for i,v in ipairs(data_array) do
    sum = sum + v
  end
  check_size = npercent * #data_array 
  table.sort(data_array, function(a,b) return a>b end)
  temp_sum = 0.0
  for i = 1, math.floor(check_size) do
    temp_sum = temp_sum + data_array[i]
  end 
  return temp_sum / sum
end

Please refer to our Lua on GitHub for more examples.