Aggregate functions list

date functions

avgdtdiff function
avgdtdiff(date)

Returns the average difference date values of the group in milliseconds. Input dates should be in ISO 8601 format.

Examples:

>>> table1('''
... '2007-01-01 00:04:37'
... '2007-01-01 00:04:39'
... '2007-01-01 00:04:40'
... '2007-01-01 00:04:49'
... ''')
>>> sql("select avgdtdiff(a) from table1")
avgdtdiff(a)
------------
3000.0
dategroupduration function
dategroupduration(date)

Returns the duration of the group of dates in seconds. Input dates should be in ISO 8601 format.

Examples:

>>> table1('''
... '2007-01-01 00:04:37'
... '2007-01-01 00:04:39'
... '2007-01-01 00:04:40'
... '2007-01-01 00:04:49'
... ''')
>>> sql("select dategroupduration(a) from table1")
dategroupduration(a)
--------------------
12
>>> sql("select dategroupduration(a) from (select '2005-01-01' as a) ")
dategroupduration(a)
--------------------
0
frecencyindex function
frecencyindex(date)

Returns the frecency Index which is computed based on a set of date values, using predifend time-windows. Input dates should be in ISO 8601 format.

Examples:

>>> table1('''
... '2011-04-01 00:04:37'
... '2011-01-01 00:04:39'
... '2011-02-12 00:04:40'
... '2011-02-14 00:04:49'
... ''')
>>> sql("select frecencyindex(a) from table1")
frecencyindex(a)
----------------
2.9
mindtdiff function
mindtdiff(date)

Returns the minimum difference date values of the group in milliseconds. Input dates should be in ISO 8601 format.

Examples:

>>> table1('''
... '2007-01-01 00:03:13'
... '2007-01-01 00:03:27'
... '2007-01-01 00:03:36'
... '2007-01-01 00:04:39'
... '2007-01-01 00:04:40'
... '2007-01-01 00:04:49'
... ''')
>>> sql("select mindtdiff(a) from table1")
mindtdiff(a)
------------
1000

graph functions

graphcliques function
graphcliques(node1, node2) → graph cliques

Finds and returns the cliques in the graph defined by the node1<->node2 links.

Examples:

>>> table1('''
... n1   n2
... n2   n3
... n1   n3
... n3   n4
... n4   n5
... n5   n3
... n1   n6
... ''')
>>> sql("select graphcliques(a,b) from table1")  
cliqueid | nodeid
-----------------
0        | n1
0        | n2
0        | n3
1        | n3
1        | n4
1        | n5
graphpowerhash function
graphpowerhash(steps, [undirected_edge, ]node1, node2[, node1_details, edge_details, node2_details]) → jpack of graph node hashes

Graph power hashing is based on a power iteration algorithm that calculates hashes on every processing step. The produced output, contains for every node in the input graph a hash that “describes” its “surroundings”.

Parameters:

steps:

The steps option controls the number of steps that the power hashing will be executed for. Another way to conceptualize the steps parameter is to think of it as the radius of the graph around a particular node that the node’s hash covers.

Steps parameter’s possible value are:

  • null (default). When steps=null, then steps is automatically set to number_of_nodes/2
  • Positive integer value.
  • -1 . Steps is set to number_of_nodes
  • Negative integers, steps is set to number_of_nodes / absolute_value(steps)
undirected_edge’:
 

This option can only have the null value.

  • Parameter absent. The graph is assumed to be directed.
  • Parameter present and having a null value. The graph is assumed to be undirected
node1,node2:

Node1 connects to Node2. If node1 doesn’t connect to any node, then node2’s value should be null.

node and edge details:
 

Optional details, that are processed with the graph’s structure. In essence these parameters define “tags” on the nodes and edges of the graph.

Note

The graph power hash algorithm is an experimental algorithm created by me, Lefteris Stamatogiannakis. I haven’t proved its correctness, so please use it with care. Due to its hash usage, there is a (very low probability) that two different graphs could hash to the same power hash.

I would be very very thankfull to anyone knowledgable in graph theory, who could prove it to be wrong (or correct). If the reader knows of a paper that describes another algorithm similar to this algorithm, i would be glad to be pointed towards it.

Note

The computational complexity of the powerhash algorithm is O(n * steps * average_node_degree). The optimal value for the hash to fully cover the graph, is to set the steps parameter to graph_diameter / 2.

Right now for steps=null, we take the worse upper bound of n / 2, so the computational complexity becomes O(n * ~(n/2) * average_node_degree).

Examples:

Directed graph:

>>> table1('''
... 1   2
... 2   3
... 3   4
... 4   5
... 5   3
... ''')
>>> sql("select graphpowerhash(null, a,b) from table1")
graphpowerhash(null, a,b)
------------------------------------------------------------------------------------------------------------------------------
["OaNj+OtIZPqcwjc3QVvKpg","Um7OU79ApcRNA2TKrdcBcA","ZyQT/AoKyjIkwWMNvceK2A","3vaHWSLU/H32HvHTVBkpUQ","+3uZjYUMSXwyZs7HFHKNVg"]

Above graph having its nodes renumbered (its powerhash is the same as above):

>>> table2('''
... 2   5
... 5   4
... 4   1
... 1   3
... 3   4
... ''')
>>> sql("select graphpowerhash(null, a,b) from table2")
graphpowerhash(null, a,b)
------------------------------------------------------------------------------------------------------------------------------
["OaNj+OtIZPqcwjc3QVvKpg","Um7OU79ApcRNA2TKrdcBcA","ZyQT/AoKyjIkwWMNvceK2A","3vaHWSLU/H32HvHTVBkpUQ","+3uZjYUMSXwyZs7HFHKNVg"]

Above graph with a small change (its hash differs from above graphs):

>>> table3('''
... 2   5
... 5   4
... 4   1
... 1   3
... 3   5
... ''')
>>> sql("select graphpowerhash(null, a,b) from table3")
graphpowerhash(null, a,b)
------------------------------------------------------------------------------------------------------------------------------
["APq1eISun1GpYjgUhiMrLA","NPPh9FLzC5cUedxldXV77Q","VVZ93zo6gePuMeRf6f00Zg","df/4yDABlitCTfOGut0NvA","lqo+lY4fcjqujlgsYr+3Yw"]

Actual testing of equality or inequality of above graphs:

>>> sql("select hashmd5( (select graphpowerhash(null, a,b) from table1) )=hashmd5( (select graphpowerhash(null, a,b) from table2) ) as grapheq")
grapheq
-------
1
>>> sql("select hashmd5( (select graphpowerhash(null, a,b) from table1) )=hashmd5( (select graphpowerhash(null, a,b) from table3) ) as grapheq")
grapheq
-------
0

Graph with only one node:

>>> sql("select graphpowerhash(null, a, null) from (select * from table1 limit 1)")
graphpowerhash(null, a, null)
-----------------------------
["TOiuilAk4RLkg01tIwyvcg"]

Undirected version of table1’s graph:

>>> sql("select graphpowerhash(null, null, a,b) from table1")
graphpowerhash(null, null, a,b)
------------------------------------------------------------------------------------------------------------------------------
["JudlYSkYV7rFHjk94abY/A","W88IN4kgDSeVX9kaY36SJg","W88IN4kgDSeVX9kaY36SJg","6ez9ee0N2ogdvKJVQ8VKWA","7gz+LT/LtsyFc+GxMUlL8g"]

Same graph as above, but some of the edges have been reversed (the undirected powerhash matches the powerhash above):

>>> table4('''
... 2   1
... 2   3
... 3   4
... 4   5
... 3   5
... ''')
>>> sql("select graphpowerhash(null, null, a,b) from table4")
graphpowerhash(null, null, a,b)
------------------------------------------------------------------------------------------------------------------------------
["JudlYSkYV7rFHjk94abY/A","W88IN4kgDSeVX9kaY36SJg","W88IN4kgDSeVX9kaY36SJg","6ez9ee0N2ogdvKJVQ8VKWA","7gz+LT/LtsyFc+GxMUlL8g"]

Graph similarity, using the step parameter (value of step defines the radius of the similar subgraphs that can be found):

>>> sql("select jaccard( (select graphpowerhash(3, a, b) from table1), (select graphpowerhash(3, a, b) from table3) ) as jacsim")
jacsim
------
0.0
>>> sql("select jaccard( (select graphpowerhash(1, a, b) from table1), (select graphpowerhash(1, a, b) from table3) ) as jacsim")
jacsim
------
0.25

Powerhash of graph having details (using a chemical composition):

>>> table5('''
... 1   2   O   =   C
... 2   3   C   =   O
... ''')

First without details:

>>> sql("select graphpowerhash(null, null, a, b) from table5")
graphpowerhash(null, null, a, b)
----------------------------------------------------------------------------
["Rw3sDN24TI7YARBNOOmYSg","9m5wcZf9iUxDwgzQkzu6Ag","9m5wcZf9iUxDwgzQkzu6Ag"]

Second with all details:

>>> sql("select graphpowerhash(null, null, a, b, c, d, e) from table5")
graphpowerhash(null, null, a, b, c, d, e)
----------------------------------------------------------------------------
["CPebw+eZYzw5bWgx47/tkg","CPebw+eZYzw5bWgx47/tkg","WNn4aDDBKcoMMi+nrz5JEA"]
graphtodot function
graphtodot(graphname, [undirected_edge, ]node1, node2[, node1_details, edge_details, node2_details]) → graphviz dot graph

Returns the Graphviz DOT representation of the input graph.

Examples:

Directed graph:

>>> table1('''
... 1   2
... 2   3
... 3   4
... 4   5
... 5   3
... ''')
>>> sql("select graphtodot(null, a,b) from table1")
graphtodot(null, a,b)
------------------------------------------------------------------------
digraph  {
"1" -> "2";
"2" -> "3";
"3" -> "4";
"4" -> "5";
"5" -> "3";
}

Undirected graph:

>>> table2('''
... 2   5
... 5   4
... 4   1
... 1   3
... 3   4
... ''')
>>> sql("select graphtodot(null, null, a,b) from table2")
graphtodot(null, null, a,b)
----------------------------------------------------------------------
graph  {
"1" -- "3";
"2" -- "5";
"3" -- "4";
"4" -- "1";
"5" -- "4";
}

Graph with details:

>>> table5('''
... 1   2   O   =   C
... 2   3   C   =   O
... ''')
>>> sql("select graphtodot('chem_comp_1', null, a, b, c, d, e) from table5")
graphtodot('chem_comp_1', null, a, b, c, d, e)
------------------------------------------------------------------------------------------------------------------------
graph chem_comp_1 {
"1" [label="O"];
"1" -- "2" [label="="];
"2" [label="C"];
"2" -- "3" [label="="];
"3" [label="O"];
}
graphtotgf function
graphtotgf(node1, node2[, node1_details, edge_details, node2_details]) → TGF graph

Returns the TGF representation of the input graph.

Examples:

>>> table1('''
... 1   2
... 2   3
... 3   4
... 4   5
... 5   3
... ''')
>>> sql("select graphtotgf(a,b) from table1")  
graphtotgf(a,b)
------------------------------------------
1
2
3
4
5
#
1 2
2 3
3 4
4 5
5 3

Graph with details:

>>> table5('''
... 1   2   O   =   C
... 2   3   C   =   O
... ''')
>>> sql("select graphtotgf(a, b, c, d, e) from table5")
graphtotgf(a, b, c, d, e)
--------------------------
1 O
2 C
3 O
#
1 2 =
2 3 =

jpacks functions

jdictgroup function
jdictgroup(columns)

Groups columns of a group into a jdict.

Example:

>>> table1('''
... word1   1
... word2   1
... word3   2
... word4   2
... ''')
>>> sql("select jdictgroup(a) from table1 group by b")
jdictgroup(a)
---------------------------
{"word1":null,"word2":null}
{"word3":null,"word4":null}
>>> sql("select jdictgroup(a,b) from table1")
jdictgroup(a,b)
-----------------------------------------
{"word1":1,"word2":1,"word3":2,"word4":2}
>>> table2('''
... [1,2]   1
... [3,4]   1
... [5,6]   2
... [7,8]   2
... ''')
>>> sql("select jdictgroup(a) from table2")
jdictgroup(a)
-----------------------------------------------------
{"[1,2]":null,"[3,4]":null,"[5,6]":null,"[7,8]":null}
>>> sql("select jdictgroup(a,b) from table2")
jdictgroup(a,b)
-----------------------------------------
{"[1,2]":1,"[3,4]":1,"[5,6]":2,"[7,8]":2}
jdictgroupunion function
jgroupunion(jdicts) → jdict

Calculates the union of all jdicts inside a group. The returned jdict’s key values, are calculated as the max length of the lists (or dictionaries) that have been found inside the individual jdicts of the group.

Example:

>>> table1('''
... '{"b":1, "a":1}'
... '{"c":1, "d":[1,2,3]}'
... '{"b":{"1":2,"3":4}, "d":1}'
... ''')
>>> sql("select jdictgroupunion(a) from table1")
jdictgroupunion(a)
-------------------------
{"b":2,"a":1,"c":1,"d":3}
jgroup function
jgroup(columns)

Groups columns of a group into a jpack.

Example:

>>> table1('''
... word1   1
... word2   1
... word3   2
... word4   2
... ''')
>>> sql("select jgroup(a) from table1 group by b")
jgroup(a)
-----------------
["word1","word2"]
["word3","word4"]
>>> sql("select jgroup(a,b) from table1")
jgroup(a,b)
-------------------------------------------------
[["word1",1],["word2",1],["word3",2],["word4",2]]
>>> table2('''
... [1,2]   1
... [3,4]   1
... [5,6]   2
... [7,8]   2
... ''')
>>> sql("select jgroup(a) from table2")
jgroup(a)
-------------------------
[[1,2],[3,4],[5,6],[7,8]]
>>> sql("select jgroup(a,b) from table2")
jgroup(a,b)
-----------------------------------------
[[[1,2],1],[[3,4],1],[[5,6],2],[[7,8],2]]
>>> sql("select jgroup(jdict('a',a,'b',b)) from table2")
jgroup(jdict('a',a,'b',b))
-------------------------------------------------------------------------
[{"a":[1,2],"b":1},{"a":[3,4],"b":1},{"a":[5,6],"b":2},{"a":[7,8],"b":2}]
jgroupintersection function
jgroupintersection(columns) → jpack

Calculates the intersection of all jpacks (by treating them as sets) inside a group.

Example:

>>> table1('''
... '[1,2]' 2
... '[2,3]' 2
... '[2,4]' '[2,11]'
... 2 2
... ''')
>>> sql("select jgroupintersection(a,b) from table1")
jgroupintersection(a,b)
-----------------------
2
>>> sql("select jgroupintersection(1)")
jgroupintersection(1)
---------------------
1
jgroupunion function
jgroupunion(columns) → jpack

Calculates the union of the jpacks (by treating them as sets) inside a group.

Example:

>>> table1('''
... '[1,2]' 6
... '[2,3]' 7
... '[2,4]' '[8,11]'
... 5 9
... ''')
>>> sql("select jgroupunion(a,b) from table1")
jgroupunion(a,b)
----------------------
[1,2,6,3,7,4,8,11,5,9]
>>> sql("select jgroupunion(1)")
jgroupunion(1)
--------------
1
jgroupunionkeys function
jgroupunionkeys(columns) → jpack

Calculates the union of the jdict keys. Use it with care, because for performance reasons the input data are not checked at all. They should all be jdicts.

Example:

>>> table1('''
... '{"1":1, "2":3}' '{"a":5}'
... '{"2":1, "3":3}' '{}'
... ''')
>>> sql("select jgroupunionkeys(a,b) from table1")
jgroupunionkeys(a,b)
--------------------
["1","2","a","3"]
>>> sql("select jgroupunionkeys('{}')")
jgroupunionkeys('{}')
---------------------
[]
jgroupuniquelimit function
jgroupuniquelimit(jpack, k, limit) → jpack

Returns the k where the unique values inside all jpacks have reached limit.

Example:

>>> table1('''
... '[1,2]' 1
... '[2,3,4,5]' 2
... '[2,4]' 3
... 5 4
... ''')
>>> sql("select jgroupuniquelimit(a,b,3) from table1")
jgroupuniquelimit(a,b,3)
------------------------
2

mining functions

freqitemsets function
freqitemsets(datacol[, threshold, noautothres, stats, maxlen]) → [itemset_id:int, itemset_length:int, itemset_frequency:int, item:text]

Calculates frequent itemsets on a given column (datacol). The algorithm is tuned for the case when we have many different items (in the order of millions), many input itemsets, but small itemset length (10-20).

Returned table schema:

itemset_id:Automatic itemset id
itemset_length:Length of itemset
itemset_frequency:
 How many times an itemset has been found
item:Itemset’s item value

Parameters:

datacol:

Column on which to calculate frequent itemsets

threshold:

Default is 2

How many times an freq. itemset must appear for it to appear in the results

noautothres:

1/0 (Default is 0)

Do not calculate the threshold automatically

stats:

1/0 (Default is 0)

Return frequent itemset statistics

maxlen:

NUMBER (Default is no limit at all)

Maximum itemset length to search

Examples:

>>> table1('''
... 'car wood bike' 'first group'
... 'car car wood'  'first group'
... 'car wood'      'first group'
... 'car wood ice'  'first group'
... 'ice'           'second group'
... 'car ice'       'second group'
... 'car cream toy' 'second group'
... 'icecream ice car toy'  'second group'
... ''')
>>> sql("select b,freqitemsets(a, 'threshold:2', 'noautothres:1', 'maxlen:2') from table1 group by b")
b            | itemset_id | itemset_length | itemset_frequency | item
---------------------------------------------------------------------
first group  | 1          | 1              | 4                 | wood
first group  | 2          | 1              | 4                 | car
first group  | 3          | 2              | 4                 | car
first group  | 3          | 2              | 4                 | wood
second group | 1          | 1              | 3                 | ice
second group | 2          | 1              | 3                 | car
second group | 3          | 1              | 2                 | toy
second group | 4          | 2              | 2                 | car
second group | 4          | 2              | 2                 | ice
second group | 5          | 2              | 2                 | car
second group | 5          | 2              | 2                 | toy
>>> sql("select b,freqitemsets(a, 'stats:1') from table1 group by b")
b            | MaxTransactionLength | CombinationCount | PassedTransactions | ValidKeywords
-------------------------------------------------------------------------------------------
first group  | 3                    | 2                | 3                  | 2
first group  | 3                    | 1                | 1                  | 2
first group  | 3                    | 0                | 0                  | 0
second group | 4                    | 3                | 3                  | 3
second group | 4                    | 0                | 3                  | 0

sampledistvals function

sampledistvals(sample_size, C1, C2, C3) → [C1, C2, C3]

Sampledistvals returns sample_size distinct values for each of the input C1..Cn columns.

>>> table1('''
... test1 2 3
... test1 2 3
... test2 4 2
... test4 2 t
... ''')
>>> sql("select sampledistvals(3, a, b, c) from table1")
C1                        | C2    | C3
---------------------------------------------
["test1","test2","test4"] | [2,4] | [2,3,"t"]

samplegroup function

samplegroup(sample_size, C1, C2, C3)

Returns a random sample_size set of rows.

>>> table1('''
... test1 2 3
... test1 2 3
... test2 4 2
... test4 2 t
... ''')
>>> sql("select samplegroup(2, a, b, c) from table1") 
C1    | C2 | C3
---------------
...
>>> sql("select samplegroup(2) from (select 5 where 5=6)") 

partialsort functions

partialsort function

partialsort(n, col1, col2, col3, ....)

sorts the first n columns of its input

Returned multiset schema:
 Columns are automatically named as col1, col2 …

Examples:

>>> table1('''
... aa  43
... ac  34
... ab  21
... as  23
... ''')
>>> sql("select partialsort(1,a,b) from table1")
c1 | c2
-------
aa | 43
ab | 21
ac | 34
as | 23

selection functions

groupdiff function
groupdiff(compare, values)

Compares group members over the first argument (i.e. compare). It keeps only the rows where compare changes. It appends a column at the end containing how many times compare repeats.

Examples:

>>> table1('''
... 0  a
... 0  b
... 1  c
... 1  d
... 2  e
... 3  e
... 3  f
... 3  g
... ''')
>>> sql("select groupdiff(a,b) as b from table1")
b1 | b2 | b3
------------
0  | a  | 2
1  | c  | 2
2  | e  | 1
3  | e  | 3
>>> sql("select groupdiff(a) as a from table1")
a1 | a2
-------
0  | 2
1  | 2
2  | 1
3  | 3
>>> sql("select groupdiff(b,a) as a from table1")
a1 | a2 | a3
------------
a  | 0  | 1
b  | 0  | 1
c  | 1  | 1
d  | 1  | 1
e  | 2  | 2
f  | 3  | 1
g  | 3  | 1

imax function

imax(i, value)

Returns the i-th max value of the group.

Examples:

>>> table1('''
... 34  la
... 18   lo
... 120.0  all
... ''')
>>> sql("select imax(1,a) as first from table1")
first
-----
120
>>> sql("select imax(3,a) as third from table1")
third
-----
18
>>> sql("select imax(2,cast( a as text))  as secstr from table1")
secstr
------
18
>>> sql("select imax(4,a) from table1")
imax(4,a)
---------
None
>>> sql("select imax(4,a) from (select 5 as a where a!=5)")
imax(4,a)
---------
None
maxrow function
maxrow(compare, value)

Compares group members over the first argument (i.e. compare). When the maximum is located, it returns the corresponding value in the second argument (i.e. value).

Examples:

>>> table1('''
... 34  la
... 18   lo
... 120.0  all
... ''')
>>> sql("select maxrow(a,b) as b from table1")
b
---
all
>>> sql("select maxrow(a,a) as a from table1")
a
---
120
>>> sql("select maxrow(b,a) as a from table1")
a
--
18

minrow function

minrow(compare, value)

Compares group members over the first argument (i.e. compare). When the minimum is located, it returns the corresponding value in the second argument (i.e. value).

Examples:

>>> table1('''
... 34  la
... 18   lo
... 120.0  all
... ''')
>>> sql("select minrow(a,b) as b from table1")
b
--
lo
>>> sql("select minrow(a,a) as a from table1")
a
--
18
>>> sql("select minrow(a,a) as a from (select 5 as a where a!=5)")
a
----
None

ontop function

ontop(n, compare, value1, value2, ....) → [colname1, colname2 ...]

Compares group members over the second argument (i.e. compare), so as to locate the top n members (specified in the first argument) and then returns the corresponding data under the specified columns value1, value2, …..

Returned multiset schema:
 Columns are automatically named as colname1 text, colname2 text…

See also

Examples:

>>> table1('''
... 34  la
... 18   lo
... 120.0  all
... ''')
>>> sql("select ontop(1,a,b) from table1")
top1
----
all
>>> sql("select ontop(2,a,a,b) from table1")
top1 | top2
-----------
120  | all
34   | la
>>> sql("select ontop(2,a,a,b,a,b) from table1")
top1 | top2 | top3 | top4
-------------------------
120  | all  | 120  | all
34   | la   | 34   | la
>>> sql("select ontop(pk) from (select 5 as pk where pk!=5)")
top1
-

skcluster functions

skcluster function
skcluster(str, rowid cols)
NOTE:The present operator requires the following packages: ‘numpy’, ‘scipy’, ‘sklearn’ numpy & scipy: https://www.scipy.org/scipylib/download.html sklearn: http://scikit-learn.org/stable/install.html

Implements the clustering algorithm initialized by str and returns the sample id, the coordinates in space and its cluster label. Cols refer to features except for the first one which includes each dato id. In case the algorithm computes centroids in clusters, #features more columns are returned: The coordinates of cluster centroids in which the dato belongs to. For instance, if dataset consists of 4 columns (= 4 features) 4 more columns will be added in result table in case centroids have been computed

Parameters:

Initstr:Initialization string of the algorithm with optional parameters (from scikit-learn api, ie: KMeans(n_clusters=3)).
Cols:Names of the input-variables

Examples: A dataset with 3 columns (1 column for the id and 2 features):

>>> table1('''
... 0.52   0.40
... 0.63   0.33
... 0.77   0.15
... 0.81   0.74
... ''')
>>> sql("select skcluster('AffinityPropagation()',c1,c2) from table1")
---------------------------
id | label | center1 | center2
--------------
1 | 0 | 0.63 | 0.33
2 | 0 | 0.63 | 0.33
3 | 0 | 0.63 | 0.33
4 | 1 | 0.81 | 0.74

In case algorithm doesn’t compute centroids, only the ‘id’ and ‘label’ columns appeared in result

>>> sql("select skcluster('SpectralClustering(n_clusters=2)',c1,c2) from table1")
---------------------------
id | label
--------------
1 | 0 |
2 | 0 |
3 | 0 |
4 | 1 |

skdimred functions

skdimred function
skdimred(initstr, cols)
NOTE:The operator requires the following packages: ‘numpy’, ‘scipy’, ‘sklearn’ numpy & scipy: https://www.scipy.org/scipylib/download.html sklearn: http://scikit-learn.org/stable/install.html

Implements dimensionality reduction on table t (based on algorithms from Machine Learning package scikit-learn.org). Standarization is also performed to the features by removing the mean and scaling to unit variance Returns: the transformed data in the new space.

Parameters:

Initstr:Initialization string of the algorithm with optional parameters (from scikit-learn api, ie: PCA(n_components=3)). Three methods are supported: PCA, SVD and TSNE)
Cols:Names of the input-variables

Examples: Sample from the iris dataset with 4 columns (SepalLength, SepalWidth, PetalLength, PetalWidth):

>>> table1('''
... 5.1 3.5     1.4     0.2
... 4.9 3       1.4     0.2
... 4.7 3.2     1.3     0.2
... 4.6 3.1     1.5     0.2
... 5   3.6     1.4     0.2

… — [0|Column names — … [1|SL [2|SW [3|PL [4]PW

… ‘’‘)

>>> sql("select skdimred('PCA(n_components=2)',SL,SW,PL,PW) from table1;")
eig1            |  eig2
------------------------------
-1.52434877924  |  -0.436029188708
0.669710216202  |  0.234613817817
0.378259496001  |  1.74252845419
1.9247054679    |  -1.10077422234
-1.44832640086  |  -0.440338860953

statistics functions

amean function
amean(X) → [amean float]

Computes the arithmetic mean, i.e. the average, thus providing an alternative choise to traditional avg offered by SQL.

Examples:

>>> table1('''
... 1
... 2
... 2
... 3
... 'text is ignored, as well as null values'
... 'none'
... ''')
>>> sql("select amean(a) from table1")
amean(a)
--------
2.0
frecency function
frecency(actiondate[, points[, now:date]])

Returns a float weighted sum assigning to each action points or less, depending on the actiondate distance to the current date (or now:date). In detail the action points decrease 30% at distance 10-30 days, 50% at 1-3 months, 70% at 3-6 months and 90% at greater distance. Date parameters should be in ISO8601 format.

ISO 8601 format :
Year:
YYYY (eg 1997)
Year and month:
YYYY-MM (eg 1997-07)
Complete date:
YYYY-MM-DD (eg 1997-07-16)
Complete date plus hours and minutes:
YYYY-MM-DD hh:mmTZD (eg 1997-07-16 19:20+01:00)
Complete date plus hours, minutes and seconds:
YYYY-MM-DD hh:mm:ssTZD (eg 1997-07-16 19:20:30+01:00)
Complete date plus hours and minutes:
YYYY-MM-DDThh:mmTZD (eg 1997-07-16T19:20+01:00)
Complete date plus hours, minutes and seconds:
YYYY-MM-DDThh:mm:ssTZD (eg 1997-07-16T19:20:30+01:00)

Examples:

>>> table1('''
... 2009-06-01      1
... 2009-08-28      2
... 2009-09-17      3
... ''')
>>> sql("select frecency(a,'now:2009-09-26 04:38:30') from table1")
frecency(a,'now:2009-09-26 04:38:30')
-------------------------------------
200.0
fsum function
fsum(X) → json

Computes the sum using fractional computation. It return the result in json format

Examples:

>>> table1('''
... 1
... 2
... 2
... 10
... ''')
>>> sql("select fsum(a) from table1")
fsum(a)
-------
[15, 1]
>>> table1('''
... 0.99999999
... 3.99999999
... 0.78978989
... 1.99999999
... ''')
>>> sql("select fsum(a) from table1")
fsum(a)
-------------------------------------
[70164189421580937, 9007199254740992]
gmean function
gmean(X[, m]) → [gmean float]

Computes the genaralized mean (also known as the power mean or Holder mean), which is an abstraction of the Pythagorean means including arithmetic, geometric, and harmonic means.

It is defined for a set of n positive real numbers as follows:

gmean_{\mathrm} = \Big ( {1 \over N} \sum_{i=1}^{N} x_i ^p \Big ) ^{1/p}

The (optional) second argument stands for the p paramteter, thus determining the exact mean type:

  • p=2 : Quadratic mean (computed for both negative and positive values)
  • p=1 : Artihmetic mean
  • p=0 : Geometric mean (only for positive real numbers)
  • p=-1: Harmonian mean (only for positive real numbers)

By default, i.e. in absence of second argument, p is set to 0, computing the geometric mean.

Examples:

>>> table1('''
... 6
... 50
... 9
... 1200
... 'text is ignored, as well as None values'
... 'None'
... ''')
>>> sql("select gmean(a) from table1")
gmean(a)
-------------
42.4264068712
>>> table2('''
... 34
... 27
... 45
... 55
... 22
... 34
... ''')
>>> sql("select gmean(a,1) from table2")
gmean(a,1)
-------------
36.1666666667
>>> sql("select gmean(a,0) from table2")
gmean(a,0)
-------------
34.5451100372
>>> sql("select gmean(a) from table2")
gmean(a)
-------------
34.5451100372
>>> sql("select gmean(a,-1) from table2")
gmean(a,-1)
-------------
33.0179836512
>>> sql("select gmean(a,2) from table2")
gmean(a,2)
-------------
37.8043207407
median function
median(X) → [median float]

Returns the median, i.e.numeric value separating the higher half of a sample, a population, or a probability distribution, from the lower half. It is computed by arranging all the observations from lowest value to highest value and picking the middle one. If there is an even number of observations, then there is no single middle value, so the mean of the two middle values is obtained. Incoming textual values are simply ignored.

Examples:

>>> table1('''
... 1
... 3
... 6
... 6
... 6
... 6
... 7
... 7
... 7
... 7
... 12
... 12
... 17
... ''')
>>> sql("select median(a) from table1")
median(a)
---------
7.0
>>> table2('''
... 1
... 2
... 2
... 3
... 3
... 9
... ''')
>>> sql("select median(a) from table2")
median(a)
---------
2.5
>>> table3('''
... 1
... 2
... maria
... lala
... null
... 'None'
... 3
... 9
... ''')
>>> sql("select median(a) from table3")
median(a)
---------
2.5
modeop function
modeop(X) → [ModeOpElements int/str, ModeOpValue int]

Returns the mode (i.e. the value that occurs the most frequently in a data set), along with the modevalue (i.e. the maximum frequency of occurrence) When more than one modes are found in a data set (i.e. when more than one values appear with the maximum frequency), all values are returned.

For a sample from a continuous distribution, such as [0.935…, 1.211…, 2.430…, 3.668…, 3.874…], the concept of mode is unusable in its raw form, since each value will occur precisely once. Following the usual practice, data is discretized by rounding to the closer int value. For a textual sample, values are first converted to lowercase.

Returned multiset schema:
 Columns are automatically named as ModeOpElements, ModeOpValue

See also

Examples:

>>> table1('''
... 1
... 3
... 6
... 6
... 6
... 6
... 7
... 7
... 7
... 7
... 12
... 12
... 17
... ''')
>>> sql("select modeop(a) from table1")
ModeOpElements | ModeOpValue
----------------------------
6              | 4
7              | 4
>>> table2('''
... 1.1235
... 1
... 5.1
... 5.2
... 5.3
... 5.5
... 5.6
... 5.7
... ''')
>>> sql("select modeop(a) from table2")
ModeOpElements | ModeOpValue
----------------------------
5              | 3
6              | 3
>>> table3('''
... leuteris
... maria
... marialena
... Meili
... meili
... ''')
>>> sql("select modeop(a) from table3")
ModeOpElements | ModeOpValue
----------------------------
meili          | 2
pearson function
pearson(X, Y) → float

Computes the pearson coefficient of X and Y datasets

Examples:

>>> sql("select pearson(c1,1/c1) from range(1,91)")
pearson(c1,1/c1)
----------------
-0.181568259801
>>> sql("select pearson(c1,17*c1+5) from range(1,91)")
pearson(c1,17*c1+5)
-------------------
1.0
>>> sql("select pearson(c1,pyfun('math.pow',2,c1)) from range(1,41)")
pearson(c1,pyfun('math.pow',2,c1))
----------------------------------
0.456349821382
>>> sql("select pearson(a,b) from (select 1 as a, 2 as b)")
pearson(a,b)
------------
0
rangef function
rangef(X) → [rangef float]

Computes the numerical range for a dataset X, substracting the minimum value from the maximum value. Textal and NULL data entries are simply ignored.

Examples:

>>> table1('''
... 1
... 3
... 6
... 6
... 7
... 12
... 12
... 17
... 'text is ignored'
... 'None'
... ''')
>>> sql("select rangef(a) from table1")
rangef(a)
---------
16.0
stdev function
stdev(X[, type]) → [stdev float]

Computes standard deviation of a dataset X, i.e. the square root of its variance. Sample standard deviation is determined by default; population standard deviation can be determined by setting the (optional) second argument to values ‘true’ or ‘population’. When values ‘false’ or ‘sample’ are entered for type, the default sample standard deviation computation is performed.

Examples:

>>> table1('''
... 3
... 7
... 7
... 19
... 'text is ignored'
... 'none'
... ''')
>>> sql("select stdev(a) from table1")
stdev(a)
-------------
6.92820323028
>>> sql("select stdev(a,'population') from table1")
stdev(a,'population')
---------------------
6.0
>>> sql("select stdev(a,'true') from table1")
stdev(a,'true')
---------------
6.0
variance function
variance(X[, type]) → [variance float]

Determine the measure of the spread of the data set about the mean. Sample variance is determined by default; population variance can be determined by setting the (optional) second argument to values ‘true’ or ‘population’. When values ‘false’ or ‘sample’ are entered for type, the default sample variance computation is performed.

Examples:

>>> table1('''
... 1
... 2
... 3
... 4
... 5
... 6
... 'text is ignored'
... 'none'
... ''')
>>> sql("select variance(a) from table1")
variance(a)
-----------
3.5
>>> sql("select variance(a,'false') from table1")
variance(a,'false')
-------------------
3.5
>>> sql("select variance(a,'sample') from table1")
variance(a,'sample')
--------------------
3.5
>>> sql("select variance(a,'True') from table1")
variance(a,'True')
------------------
2.91666666667
>>> sql("select variance(a,'Population') from table1")
variance(a,'Population')
------------------------
2.91666666667
wamean function
wamean(W, X) → [wamean float]

Computes the weighted arithmetic mean, i.e. the weighted average. First column contains the weights and second column contains the actual data values.

wamean_{\mathrm} = \sum_{i=1}^{N} w_i x_i / \sum_{i=1}^{N} w_i

Examples:

>>> table1('''
... 2 1
... 2 2
... 1 2
... 'text is ignored, as well as null values' 3
... 'none' 2
... 1 'text is ignored, as well as null values'
... 2 'none'
... 2 3
... ''')
>>> sql("select wamean(a,b) from table1")
wamean(a,b)
-----------
2.0

subgroup functions

condbreak function

condbreak(groupid, C1, C2, ...., condition, orderby) → [bgroupid,C1,C2....]

Returns an expanded groupid and the value1…valueN, perfoming new groupings when condition is true. Rows grouped together are the ones that order by orderby column have no intermediate true values for condition.

Returned multiset schema:
 
  • bgroupid
    groupid appended with an integer value indicating the subgroup of the row.
  • C1, C2 ..
    The input values of the row.

See also

>>> table1('''
... 1 user1  open
... 2 user1  read
... 3 user1  close
... 4 user1  open
... 5 user1  write
... 6 user1  close
... 7 user2  open
... 8 user2  write
... ''')
>>> sql("select condbreak(b,c,c='open',a) from table1 group by b")
bgroupid | C1
----------------
user11   | open
user11   | read
user11   | close
user12   | open
user12   | write
user12   | close
user21   | open
user21   | write
>>> sql("select condbreak(b,c,c='open',a) from (select 4 as a, 6 as b, 9 as c where c!=9)")

datediffbreak function

datediffbreak(groupid, C1, C2, ...., date, maxdiff[, 'order', orderbycol1, orderbycol2, ...]) → [bgroupid,C1,C2....]

Returns an expanded groupid and the value1…valueN, perfoming new groupings when subsequent rows date values differ more than maxdiff milliseconds. Rows grouped together are the ones that order by orderby column or if ommited by the given order have less date distance than maxdiff. Input dates should be in ISO 8601 format.

Returned multiset schema:
 
  • bgroupid
    groupid appended with an integer value indicating the subgroup of the row.
  • C1, C2 ..
    The input values of the row.

See also

>>> table1('''
... 1 session1 '2007-01-01 00:03:13'
... 2 session1 '2007-01-01 00:03:27'
... 3 session1 '2007-01-01 00:03:36'
... 4 session2 '2007-01-01 00:04:39'
... 5 session2 '2007-01-01 00:04:40'
... 6 session3 '2007-01-01 00:04:49'
... 7 session3 '2007-01-01 00:04:59'
... ''')
>>> sql("select datediffbreak(b,a,c,10*1000,'order',c,a) from table1 group by b")
bgroupid  | C1
--------------
session10 | 1
session11 | 2
session11 | 3
session20 | 4
session20 | 5
session30 | 6
session30 | 7

datedifffilter function

datedifffilter(maxdiff, date, C1, C2 ....) → [date,C1,C2....]

Returns only a subset of the provided entries, performing a sort of entry clustering based on the entries date difference. Each cluster is represented by the latest entry. The first argument defines the time differnece threshold that is employed for entry clustering, and it is provided in seconds. The second argument is assumed to contain the date column. Entries are assumed to be provided in an ascending order by the date column. Input dates should be in ISO 8601 format. All subsequent columns remain unchanged.

Returned multiset schema:
 
  • date, C1, C2 ..
    The selected input values of the row.

See also

>>> table1('''
... 2010-01-01T01:32:03Z value1
... 2010-01-01T01:32:04Z value2
... 2010-01-01T01:32:06Z value3
... 2010-01-01T01:32:08Z value4
... 2010-01-01T01:32:29Z value5
... 2010-01-01T02:35:03Z value6
... 2010-01-01T02:35:04Z value7
... 2010-01-01T03:55:04Z value8
... ''')
>>> sql("select datedifffilter(20, a,b) from table1")
date                 | C1
-----------------------------
2010-01-01T01:32:08Z | value4
2010-01-01T01:32:29Z | value5
2010-01-01T02:35:04Z | value7
2010-01-01T03:55:04Z | value8
>>> table1('''
... 2010-01-01T01:32:03Z value1
... ''')
>>> sql("select datedifffilter(20, a,b) from table1")
date                 | C1
-----------------------------
2010-01-01T01:32:03Z | value1
>>> table1('''
... '2010-01-01 01:32:03' value1
... '2010-01-01 01:32:04' value2
... '2010-01-01 01:32:06' value3
... '2010-01-01 01:32:08' value4
... '2010-01-01 01:32:29' value5
... '2010-01-01 02:35:03' value6
... '2010-01-01 02:35:04' value7
... '2010-01-01 03:55:04' value8
... ''')
>>> sql("select datedifffilter(30, a,b) from table1")
date                | C1
----------------------------
2010-01-01 01:32:29 | value5
2010-01-01 02:35:04 | value7
2010-01-01 03:55:04 | value8

datediffgroup function

datediffgroup(maxdiff, date, C1, C2 ....) → [groupid, date,C1,C2....]

Performing a sort of entry clustering based on the entries date difference. The cluster id that is assigned to each entry is returned in the first column, and it is followed by the entry’s original contents.

The first argument defines the time differnece threshold that is employed for entry clustering, and it is provided in seconds. The second argument is assumed to contain the date column. Entries are assumed to be provided in an ascending order by the date column. Input dates should be in ISO 8601 format. All subsequent columns remain unchanged.

Returned multiset schema:
 
  • date, C1, C2 ..
    The selected input values of the row.

See also

>>> table1('''
... 2010-01-01T01:32:03Z value1
... 2010-01-01T01:32:04Z value2
... 2010-01-01T01:32:06Z value3
... 2010-01-01T01:32:08Z value4
... 2010-01-01T01:32:29Z value5
... 2010-01-01T02:35:03Z value6
... 2010-01-01T02:35:04Z value7
... 2010-01-01T03:55:04Z value8
... ''')
>>> sql("select datediffgroup(20,a,b) from table1")
groupid | date                 | C1
---------------------------------------
1       | 2010-01-01T01:32:03Z | value1
1       | 2010-01-01T01:32:04Z | value2
1       | 2010-01-01T01:32:06Z | value3
1       | 2010-01-01T01:32:08Z | value4
2       | 2010-01-01T01:32:29Z | value5
3       | 2010-01-01T02:35:03Z | value6
3       | 2010-01-01T02:35:04Z | value7
4       | 2010-01-01T03:55:04Z | value8

datediffnewsesid function

datediffnewsesid(maxdiff, date, groupid, C1, C2, ....) → [bgroupid, C1, C2, ...]

Returns only the C1, C2, … that should be updated with bgroupid, so as for the input groups to be effectively broken on maxdiff times. Input dates should be in ISO 8601 format.

Returned multiset schema:
 
  • bgroupid
    groupid appended with an integer value indicating the subgroup of the row.
  • C1, C2 ..
    The input values of the row.
>>> table1('''
... 1 session1 '2007-01-01 00:03:13'
... 2 session1 '2007-01-01 00:03:27'
... 3 session1 '2007-01-01 00:03:36'
... 4 session2 '2007-01-01 00:04:39'
... 5 session2 '2007-01-01 00:05:40'
... 6 session3 '2007-01-01 00:04:49'
... 7 session3 '2007-01-01 00:06:59'
... 8 session3 '2007-01-01 00:06:59'
... 9 session4 '2007-01-01 00:04:59'
... ''')
>>> sql("select datediffnewsesid(10, c, b, a) from table1 group by b")
bgroupid  | C1
--------------
session11 | 2
session11 | 3
session21 | 5
session31 | 7
session31 | 8
groupmax function
groupmax(n, col1, col2, col3, ....)

It groups by the first n columns of the input, and returns the maximum value of the rest.

Returned schema:
 Columns are automatically named as col1, col2 …

Examples:

>>> table1('''
... aa  t1 43
... ac  t2 34
... aa  t3 12
... ab  t4 21
... ac  t5 14
... as  t6 23
... ''')
>>> sql("select groupmax(1,a,b,c) from table1")
c1 | c2 | c3
------------
ac | t5 | 34
aa | t3 | 43
ab | t4 | 21
as | t6 | 23
groupsum function
groupsum(n, col1, col2, col3, ....)

It groups by the first n columns of the input, and sums/jsets the rest.

Returned schema:
 Columns are automatically named as col1, col2 …

Examples:

>>> table1('''
... aa  t1 43
... ac  t2 34
... aa  t3 12
... ab  t4 21
... ac  t5 14
... as  t6 23
... ''')
>>> sql("select groupsum(1,a,b,c) from table1")
c1 | c2          | c3
---------------------
ac | ["t2","t5"] | 48
aa | ["t1","t3"] | 55
ab | t4          | 21
as | t6          | 23

text functions

concatgroup function
concatgroup(X)

Concatenates strings in a group/dataset X.

Example:

>>> table1('''
... word1   1
... word2   1
... word3   2
... word4   2
... ''')
>>> sql("select concatgroup(a) from table1 group by b")
concatgroup(a)
--------------
word1word2
word3word4
concatlines function
concatlines(X)

Concatenates strings in a group/dataset X.

Example:

>>> table1('''
... word1
... word2
... word3
... word4
... ''')
>>> sql("select concatlines(a) from table1")
concatlines(a)
-----------------------
word1
word2
word3
word4
concatterms function
concatterms(text1, text2, ...)

Concatenates strings in a group/dataset X, while keeping them disjoint, i.e. using the single space delimiter.

Examples:

>>> table1('''
... word1   1
... word2   1
... word3   2
... word4   2
... ''')
>>> sql("select concatterms(a) from table1 group by b")
concatterms(a)
--------------
word1 word2
word3 word4

util functions

expandgroups function
expandgroups(args) → args

Expands the contents of groups. Usefull for debugging group related queries.

Examples:

Directed graph:

>>> table1('''
... 1   2
... 2   3
... 1   4
... 2   5
... 3   3
... ''')
>>> sql("select expandgroups(a,b) from table1")
C1 | C2
-------
1  | 2
2  | 3
1  | 4
2  | 5
3  | 3
>>> sql("select expandgroups(a,b) as gr from table1")
gr1 | gr2
---------
1   | 2
2   | 3
1   | 4
2   | 5
3   | 3
>>> sql("select a,expandgroups(b) as gr from table1 group by a")
a | gr
------
1 | 2
1 | 4
2 | 3
2 | 5
3 | 3
showgroups function
showgroups(args) → string

Shows the contents of groups. Usefull for debugging group related queries.

Examples:

Directed graph:

>>> table1('''
... 1   2
... 2   3
... 1   4
... 2   5
... 3   3
... ''')
>>> sql("select showgroups(a,b) from table1") 
showgroups(a,b)
--------------------

1        2
2        3
1        4
2        5
3        3
>>> sql("select showgroups(b) as gr from table1 group by a")
gr
----

2
4

3
5

3