Run SQL Queries on TileDB Arrays via PrestoDB
Bridging the SQL world and scientific computing
Today we are excited to release the TileDB connector for PrestoDB. PrestoDB is a popular open source distributed SQL engine for analytic queries against large amounts of data. TileDB connector support means that users can run SQL queries directly on TileDB arrays and join data stored in TileDB with external data sources. TileDB is an optimized open source storage engine for multi-dimensional structured data which can be queried directly using a growing set of interfaces and APIs from C/C++ to Python to R and now SQL.

TileDB models traditional relational database data as multi-dimensional arrays, taking advantage of data that exhibits a natural sorted order along one or more dimensions. As an example, the figure below shows sample New York Stock Exchange (NYSE) quote data as both a logical relational table in PrestoDB and as a two dimensional sparse array in TileDB. Stocks quotes have a natural sorted order on datetime. A common operation is to perform aggregation queries of quotes by stock symbol across a limited time window. By modeling quote data as a 2D array, this type of query reduces to an aggregation over a small two dimensional slice of the original data.
Modeling NYSE quote tabular data as a TileDB 2D sparse array
The TileDB connector for PrestoDB supports projection (dimension / attribute subselection) and range predicate pushdown. Projection queries are efficiently handled in TileDB due to its "columnar" format (which groups, compresses and stores the values of each attribute separately). Behind the scenes, SQL predicates are converted into a subarray slicing query over a subset of attributes. TileDB materializes only the results of the projected subarray, hence speeding up selective range queries on large volumes of sorted data. Features supported by the TileDB storage engine, such as parallel compression / decompression, at-rest encryption, and optimized access to cloud object stores such as AWS S3, are fully supported. In addition, any future TileDB feature under development (e.g., multi-cloud support, fine-grained access control, etc.) will be directly inherited by PrestoDB via the TileDB connector.

Distributed SQL capability on the same storage format, natively accessible by high-level technical computing environments such as Python and R, allows users to mix and match technologies that best address the problem at hand. ETL can be run in SQL, persisted, and read into Python to run a machine learning operation, or a complex time series statistic only available in R. Bridging the SQL world and the environments used by data scientists with a common directly accessible data format helps smooth over a long-standing pain point between traditional relational database storage and scientific computing.
Utilize SQL engines and data science tools on data in a common powerful format
Try it out today!

We have included a NYSE data ingestor tool to easily load the NYSE data into TileDB arrays. We also have a full-fledged demo in Jupyter Notebook that showcases SQL queries with PrestoDB on TileDB data using the TileDB connector, combined with direct data access via the TileDB Python API.

Useful Links:

We are hiring!

Join a highly skilled and passionate team that seeks to disrupt big data management and analytics. See our open positions here.
© 2018 TileDB, Inc. All rights reserved.
[email protected]