Detailed Specifications¶
Functions¶
Functions are the main mechanism with which extensions are implemented. The are three kinds of functions:
- Scalar functions
- Aggregate functions
- Polymorphic tables
YeSQL also provides syntactic extensions, in its supported SQL, that simplify the use of above functions.
Scalar functions
Programmable scalar functions work in a similar way as standard SQL scalar functions such as abs(), lower() and upper(). Their implementation is done in Python, and they are able to use all Python facilities and libraries.
An example of an SQL statement using an external scalar function is:
yesql> select detectlang('Il en est des livres comme du feu de nos foyers');
french
Above statement executes the detectlang function, which tries to detect the language of a snippet of text by analysing its statistical properties. On the above example the snippet is a Voltaire quote, and the correct answer from detectlang is that it is in french.
If we wished to do the same for multiple quotes then we could use the following SQL statement (assuming the existence of a quotes_table which contains the column quote):
yesql> select detectlang(quote) from quotes_table;
french
english
...
See also
Aggregate functions
Programmable aggregate functions work in a similar way as standard SQL aggregate functions such as sum(), min() and max(). Their implementation is done in Python, and as is also the case for scalar functions, all Python facilities and libraries are available.
An example of an SQL statement using an external aggregate function is:
yesql> select "term1+term2" as a UNION select "term2 term3" as a;
term1+term2
term2 term3
yesql> select concatterms(a)
from (select "term1+term2" as a UNION select "term2 term3" as a);
term1+term2 term2 term3
The statement above, executes the concatterms()
function, which concatenates strings of terms together, while keeping the terms disjoint.
Using the aggregate function above, together with a scalar function is also possible. To concatenate together only the input string keywords we could execute the following SQL:
yesql> select concatterms(keywords(a))
from (select "term1+term2" as a UNION select "term2 term3" as a);
term1 term2 term2 term3
Polymorphic tables
Polymorphic tables are actually functions that take parameters and output table like data. They can be used in the SQL syntax wherever a regular table would be placed.
Polymorphic tables are one of the most powerful function classes in YeSQL. They can function in a regular table fashion, where the output data is finite in number. In case YeSQL is integrated with SQLite API, the output data can be in a streaming fashion, where the output stream can be infinite.
This, together with the SQL syntax extensions of YeSQL, creates a very powerful environment with which a variety of standard YeSQL features have been developed, such as the multisets, workflow engine, direct data load from filesystem or network, etc.
Typical examples of polymorphic tables sources are files, SQL query resultsets, or even external programs output.
Using polymorphic tables can be done in a variety of ways. The first is parametric table:
% python yesql.py
yesql> select * from file('./demo/continents.tsv') limit 2;
Asia|AF
Europe|AL
What the example above does, is to read the tab separated “continents.tsv” filesystem file as a table. Using the standard limit SQL directive, the output is limited to only 2 rows.
An example showing the streaming nature of YeSQL's polymorphic tables on top of SQLITE's API is:
yesql> select C1 from pipe('query:date');
Fri Feb 1 12:34:55 EET 2000
""
yesql> select strsplit(C1,'delimiter: ') from pipe('query:date');
Mon|Feb||1|12:34:56|EET|2000
The first of the above examples executes system’s “date” command through the use of the pipe
polymorphic table.
Notice that the pipe
polymorphic table has a default returned schema, having one column named C1.
The second of the above examples does the same as the previous one, however it also splits pipe
’s output,
using space as a delimiter, into multiple columns through strsplit()
scalar function.
At this point, we should skip a little ahead, and mention two peculiarities on the queries above.
The first one is the “query:date” parameter of pipe
.
This in YeSQL is called a named parameter, where the part before the “:” is a parameter’s name and the second part is a parameter’s value.
The second peculiarity is the that the scalar function strsplit()
returns multiple columns.
This is a multiset function, able to return as many columns and rows as it wishes.
To see the column names and column types of second query above, we can execute:
yesql> coltypes select strsplit(C1,'delimiter: ') from pipe('query:date');
C1|text
C2|text
C3|text
C4|text
C5|text
C6|text
C7|text
The “coltypes” in the above query may look like a special function of the terminal but in essence it is also a polymorphic table that takes as input a query and returns a table having as first column the column names of the inside query and as second column the column types of the inside query.
Note
coltypes
is able to be used in front of a query due to YeSQL's language extensions.
See also
Multisets¶
Early in the development of YeSQL, the need to return multiple rows and columns from scalar and aggregate functions, arose. This need was satisfied through the use of expand polymorphic table.
Polymorphic table expand works by receiving in its input the results of a function that returns multiple rows and columns. It expands its input to multiple rows/columns and returns them as a table.
To make multiset usage transparent to the user, YeSQL’ SQL preprocessor, inserts automatically the expand polymorphic table whenever a multiset producing function is met.
An example of a multiset scalar function which produces multiple columns is:
yesql> select strsplit("one,two,three");
one|two|three
And one which produces multiple rows is:
yesql> select strsplitv("one,two,three");
one
two
three
Both of the functions above, break their input on “comma” by default.
See also
Workflows¶
From the beginning YeSQL was designed for building complete data processing systems with it. In YeSQL, workflows are viewed as a series of queries to be executed sequentially.
The main function that implements YeSQL’ workflow engine is exec
polymorphic table:
yesql> select 'select 5';
select 5
yesql> exec select 'select 5';
1
Polymorphic table exec
, takes as input an SQL query, executes it and then operates on the results of the input query, executing them sequentially and returning “1” if all the queries executed successfully, or throwing an error if any queries failed to execute correctly.
This design enables exec
to work in tandem with other polymorphic tables that feed to it the workflows to be executed.
The most frequent combination is:
yesql> exec flow file 'workflow.sql';
What the query above does, is to open and read the “workflow.sql” filesystem file via
the file polymorphic table, then feed its results into the flow polymorphic table
which collects in one line the “broken into multiple lines” queries, and finally
to feed the resulting queries to exec
polymorphic table which will execute them.
The above combination of polymorphic tables presents, one the most powerful aspects of YeSQL. Namely the ability to create complex functionality (in this instance a workflow engine), by quickly combining simple entities.
Note
Another benefit of using a relational database to implement a workflow engine, is that all ACID properties of the database are still present during the workflow processing.
Designing and building a data processing system with YeSQL¶
Concerning the design of a data processing system. The required steps that have to be made when designing a data processing system with YeSQL are:
- Define the problem you are trying to solve
- Break the problem into workflows
- Break the workflows into queries
- Find the functions that the queries will need
- Find the Python libraries that the functions will need
The steps above, are mostly self evident, and should be made in the order that they are presented.
To build the data processing system, the above steps have to be followed, finding/building the required entities (functions, queries), for each step, going from bottom to top (Step 5 to Step 2),
In practice we have found that there is always some overlap between the designing and building stages and between steps. Great care should be given, when designing, on the most difficult and time consuming step, which is step 1. When step 1 is successfully executed, the speed with which the rest of the steps are designed and following that, building them from bottom to top, is extremely fast in our experience.