Frootlab | Corporate

Old wine in smart bottles

2019-03-25T00:00:00+00:00

Since almost 10 years the rate of data science publications has been growing enormously! For scientists and developers, it is therefore becoming more and more difficult to keep track of suitable current approaches.

The accelerated publication rate, is not only caused by an increased demand, but also due to the very nature of this area: The more you already know about your data domain (wise people call it “belief”), the better your estimations for the observed sample can be. This simple Bayesian wisdom has great implications for data science: The publications are not only getting more in numbers, but also in the numbers of partially overlapping domains!

So it should be quite clear, what the issue is. But what are the current tools, to address it? Corresponding to their individual data domains, the research papers in data science are distributed among different platforms. A typical example, that I would like to pick is arXiv (pronounced archive). Since the early days of the web, this ‘Grande Dame’ essentially serves as a plain repository for PDF pre-prints. And since data science papers usually deal with algorithms, these PDFs are quite often endowed with pseudo code, which basically allows their implementation in any programming language of choice.

It is quite obvious, that this organically grown paper bottleneck has substantial drawbacks: Due to the limited space, the provided pseudo code often loses valuable details over the original algorithm. But also if the original algorithm is provided online, it can be grueling to properly identify it’s scope and adapt it to the underlying prerequisites - only to decide about it’s suitability!

What is Vivid Store?

In a nutshell: The development- and exploration process in data science is currently heavily impaired by the detour experienced by publications in paper form. While it became easier and easier to publish, it became harder and harder to get an overview. This is why we want to provide a better solution!

Vivid Store (alias Brea) is a smart algorithm repository server, that enforces unified data interfaces for different algorithm categories. This allows Vivid Server not only to automatically evaluate and compare the hosted algorithms with respect to given metrics but thereupon also to determine, which algorithm of a given category and data domain is the currently best fitting (CBF) algorithm with respect to some metric. An example for such a metric would be the average prediction accuracy within a fixed set of gold standard samples of the respective domain of application (e.g. latin handwriting samples, spoken word samples, TCGA gene expression data, etc.).

According to our convictions Vivid Store is free software, based on the Python programming language and actively developed as part of our Vivid Code framework.

How to tame the Plug Jumble

2019-03-23T00:00:00+00:00

Online analytical processing and predictive analytics in combination with machine learning provides a new challenge in data-warehousing: The response time for large transactions of data from different domains.

Today’s data analysis and enterprise analytical applications, increasingly utilize complex statistical models, like artificial neural networks, which demand large amounts of raw unaggregated data. Since for many of such applications, however, the response time is critical, it is becoming quite clear, that the traditional approach, to dump day-to-day data into a huge repository for data analysis, needs to be revised.

So how to create a high throughput data transaction structure with low latency? Of course, the key is decentralization! The simple idea is to directly “plug” the analysis applications into the source data systems they require. On a closer perspective, however, this idea turns out to be horrible! It not only spawns an absolutely unmanageable jumble of data interfaces (which first of all have to be implemented), but also does not provide any flexibility to the underlying structure. … Nevertheless - At no time people have been deterred of any simple idea by the argument of “a bad idea”! The result is, where we find ourselves today: In a Plug Jumble!

What is Vivid DB?

In order to bring a little more order into the chaos, we have decided not to follow the most simple idea, but that one right after it: A multi plug!

Vivid DB (alias Deet) is a universal data interface and SQL-Database engine, that mediates between data source systems (like operational databases) and data analysis applications. To this end Vivid DB implements the two fundamental layers of a data warehouse.

The integration layer of Vivid DB is implemented by a modular plugin-system, which allows it to stay light-weight, while flexibly supporting a wide variety of different data sources. The included data support comprises an SQL-plugin, which utilizes SQLAlchemy to allow it's connection to a variety of SQL-Databases (IBM Db2, Oracle Database, SAP HANA, Microsoft SQL, MySQL, Postgesql, …). There above Vivid DB as aimed to ship with integrated support for the most common laboratory measurement devices, flat files and data generators that appear in the wild.

The staging layer of Vivid DB is currently implemented as a native SQL-Database engine, featuring a DB-API 2.0 interface with full SQL:2016 support, a vertical data storage manager and real-time encryption. On this foundation Vivid DB is aimed to support sampling in common data analysis formats: NumPy-Arrays and R-Tables.

According to our convictions Vivid DB is free software, based on the Python programming language and actively developed as part of Vivid Code.

Three obstacles in data science and one vision

2019-03-20T00:00:00+00:00

For the current development- and exploration process in data science three obstacles in particular appear as outstanding hurdles, when it comes up to realize projects - and even more, when it comes up to venture collaborations.

Some years ago - in the early 2010s - when Google’s TensorFlow still was only an idea and Geoffrey Hinton’s daredevil Science Article still only received a bunch of citations, the undisputed technical issues in data science were the absence of computing power and the absence of a common play ground. Of course, during the last decade, NVIDIA and Google respectively stepped into the breach with CUDA and TensorFlow / Keras. So the question arises “What are today’s foremost technical obstacles in data science?”.

#1: The Plug Jumble

Data scientists are concerned with the analysis of statistical samples. A large part of the resources, however, often falls on the integration and mapping of data sources into appropriate data analysis formats. Moreover this task frequently turns out as an unappreciated and frustrating job, that neither belongs to system administration nor to data science. In particular for collaborations with different or changing operational data landscapes, the additional efforts can become a permanent and critical factor, that impedes the advance of projects.

Vivid DB unifies various data sources into a common data interface

We want to solve this issue with Vivid DB, a universal data mapper, that mediates between data analysis and data sources. On the data analysis side, Vivid DB supports many de-facto standards like NumPy-Arrays and R-Tables. On the backend-side Vivid DB aims to support a large variety of different data sources that appear in the wild. This comprises a broad selection of SQL-Databases (IBM Db2, Oracle Database, SAP HANA, Microsoft SQL, MySQL, Postgresql, …), assorted NoSQL Databases, flat-file-Databases like CSV and R-Table exports, as well as assorted laboratory measurement devices.

#2: Paper Bottlenecks

The development- and exploration process in data science heavily depends on the ability to adapt current cutting-edge approaches. This ability, however, frequently is impaired by the detour experienced by publications in paper form: Due to the limited space, the provided pseudo code often loses valuable details over the original algorithm. But also if the original algorithm is provided online, it can take tremendous efforts to properly identify it’s scope and adapt it to the underlying prerequisites - only to decide about it’s suitability!

Resolution of abstract code requests by currently best fitting algorithms, using a self contained cluster of multiple Vivid Stores

Our approach to this obstacle is Vivid Store - a smart algorithm repository, which enforces unified interfaces for different algorithm categories. This allows an automatically evaluation and comparison of the hosted algorithms with respect to different applied data types and evaluation metrices. On this basis Vivid Store is able to determine currently best fitting algorithms with respect to the algorithm category, the used data type and the evaluation metric. This information is stored in evaluation tables, that can be requested by clients and other Vivid Stores. This makes it possible, to interconnect the individual stores of different organizations and therefore to provide an entirely new level of collaboration!

#3: Riding Dead Horses

Due to the rapid advances in data science data analysis applications like in no other domain suffer of short code lifetimes. This follows from the simple rule, that products only survive as long as they remain competitive. And once the zenith has been reached, the law of happy hunting grounds applies:

When you’re riding a dead horse, the best strategy is to get off!

Wisdom of the Dakota Indians

So the question arises, how data analysis projects can be kept competitive without the permanent binding of valuable resources!?

Processing of data analysis flows using Vivid Node

Our answer to this issue is the rapid prototyping system Vivid Node, which separates the program flow of data analysis applications from their used algorithms. However, Vivid Node is not only a rapid prototyping system, but indeed takes the idea a large step beyond, by the integration of cloud-based automatic algorithm and model selection. The fundamental observation behind this approach is, that it is almost never required to use a specific algorithm or model but only one that does the job - so why not simply use the best one, that’s currently available?

The Vision: Vivid Code

In order to use currently best fitting algorithms any Vivid Node instance communicates with one or many connected Vivid Store instances. The communication is initiated with an EVALUATION REQUEST for any connected Vivid Store. This request comprises (E1) an Algorithm Category, (E2) the used Data Type and (E3) the applied Evaluation Metric. Thereupon the connected Vivid Stores respectively use their evaluation lookup tables to respond to this request with an CODE OFFER. This includes the above given information, as well as (O1) an Evaluation Score and (O2) an Algorithm ID, which identifies the algorithm within the Store. The collection of all offers received by Vivid Node within a pre-defined time window are ranked by their evaluation scores.

Thereupon the highest ranked code offers are looked up in a local algorithm cache, by using the combination of the domain name of the store and the algorithm id. If this combination, however, could not be found, Vivid Node creates a CODE REQUEST to the respective store, which includes (C1) the Algorithm ID and (C2) a cryptographic token, that identifies the user. Finally, the transaction is finished, when the store responds to the code request with a CODE ANSWER. This answer depends on the authorization of the user: If the user is unknown or not allowed to receive the algorithm, the answer is constituted by (A1) the Algorithm ID and a respective (A2) Error Notification Flag. If the user, however is authorized the error flag is empty and the answer also comprises (A3) the encoded algorithm.

At this point the idea of “currently best fitting” should be quite clear. The above description, however, conceals one essential detail that is necessary to share code between different organizations: For any instance and any collaboration partner, the algorithms of a given category are required to use the same data interface to be interchangeable! At this point, Vivid DB as a universal data mapper joins the team. Vivid Nodes as well as Vivid Stores use Vivid DB to connect to data sources. This allows collaborating organizations to share (or to offer) algorithms and code without the need to share data! Together the three components constitute the Vivid Code framework.

Chances and Applications

For enterprises the incorporation of customer and market information is getting more important. Consequently many enterprises extend their analytical tools in market research and decision support by business intelligence software. Usually, there are two options for the implementation of such projects: In house development and outsourcing by consultants. On closer perspective, however, it becomes apparent that both approaches share a common weakness of individual software: The high TCO. The Vivid Code framework provides a third option by minimizing the TCO through the synergy effects of automated collaborative data science. This allows to minimize the TCO of data analysis projects while keeping them state-of-the-art.

Collaboration between different organizations using the Vivid Code framework

As a data scientist imagine the following situation: The new postgraduate in your workgroup just released a gradient descent that outperforms the one you wrote some years ago by far. The bad news, however, is that nearly any single application in your lab uses your old algorithm. So the basic benefits of the Vivid Code framework in this situation should be quite clear: All your application automatically use the new algorithm. But now, let’s get one step beyond and imagine that your workgroup is interconnected with the algorithm catalogs of many other workgroups… To be quite honest: Personally this picture gives me the creeps.

Welcome at Frootlab

2019-03-19T00:00:00+00:00

We are a young developers team with strong expertise in data science and networking technologies. Our vision is a democratic and federated AI-Revolution, that does not lead to exclusion or incapacitation, but promotes public well-being, social justice and social cohesion.

For us communication and cooperation are the key factors and driving force behind scientific and industrial progress. In order to realize our vision, we therefore started to identify the foremost obstacles for collaborations in data science and started to develop comprehensive [solutions, based on an entirely new programming paradigm.

According to our fundamental convictions we release our products as free software to provide universal access to research and education. Since the application of our projects in industrial environments requires stable development and compliance with industrial standards, it currently is important to us to maintain the developmental sovereignty, for which we release our products as single vendor open-source projects rather then community-driven.