Jekyll2022-01-18T11:03:59+00:00https://www.frootlab.org/feed/corporate.xmlFrootlab | CorporateLearn more about automated collaborative data science at the homepage and corporate blog of the Frootlab Organization and the Vivid Code frameworkOld wine in smart bottles2019-03-25T00:00:00+00:002019-03-25T00:00:00+00:00https://www.frootlab.org/blog/corporate/vivid-store<p><strong>Since almost 10 years the rate of data science publications has been growing
enormously! For scientists and developers, it is therefore becoming more and
more difficult to keep track of suitable current approaches.</strong></p>
<!--more-->
<p>The accelerated publication rate, is not only caused by an increased demand, but
also due to the very nature of this area: The more you already know about your
data domain (wise people call it “belief”), the better your estimations for the
observed sample can be. This simple Bayesian wisdom has great implications for
data science: The publications are not only getting more in numbers, but also
in the numbers of partially overlapping domains!</p>
<p>So it should be quite clear, what the issue is. But what are the current tools,
to address it? Corresponding to their individual data domains, the research
papers in data science are distributed among different platforms. A typical
example, that I would like to pick is
<a href="https://arxiv.org/" target="_blank">arXiv</a> (pronounced <em>archive</em>). Since the
early days of the web, this ‘Grande Dame’ essentially serves as a plain
repository for PDF pre-prints. And since data science papers usually deal with
algorithms, these PDFs are quite often endowed with pseudo code, which basically
allows their implementation in any programming language of choice.</p>
<p>It is quite obvious, that this organically grown <a href="/blog/corporate/19079-three-obstacles-in-data-science.html">paper bottleneck</a> has substantial drawbacks: Due to
the limited space, the provided pseudo code often loses valuable details over
the original algorithm. But also if the original algorithm is provided online,
it can be grueling to properly identify it’s scope and adapt it to the
underlying prerequisites - only to decide about it’s suitability!</p>
<h2 id="what-is-vivid-store">What is Vivid Store?</h2>
<p>In a nutshell: The development- and exploration process in data science is
currently heavily impaired by the detour experienced by publications in paper
form. While it became easier and easier to publish, it became harder and harder
to get an overview. This is why we want to provide a better solution!</p>
<p><a href="/brea.html">Vivid Store</a> (alias <em>Brea</em>) is a smart algorithm repository server,
that enforces unified data interfaces for different algorithm categories. This
allows Vivid Server not only to automatically evaluate and compare the hosted
algorithms with respect to given metrics but thereupon also to determine, which
algorithm of a given category and data domain is the currently best fitting
(<a href="/blog/tags#CBF">CBF</a>) algorithm with respect to some metric. An example for
such a metric would be the average prediction accuracy within a fixed set of
gold standard samples of the respective domain of application (e.g. latin
handwriting samples, spoken word samples, TCGA gene expression data, etc.).</p>
<p>According to our <a href="/about#us">convictions</a> Vivid Store is
<a href="https://fsfe.org/freesoftware/basics/summary.html" target="_blank">free software</a>,
based on the <a href="https://www.python.org/" target="_blank">Python</a> programming
language and actively developed as part of our <a href="/vivid">Vivid Code</a> framework.</p>Patrick MichlSince almost 10 years the rate of data science publications has been growing enormously! For scientists and developers, it is therefore becoming more and more difficult to keep track of suitable current approaches.How to tame the Plug Jumble2019-03-23T00:00:00+00:002019-03-23T00:00:00+00:00https://www.frootlab.org/blog/corporate/vivid-db<p><strong>Online analytical processing and predictive analytics in combination with
machine learning provides a new challenge in data-warehousing: The response time
for large transactions of data from different domains.</strong></p>
<!--more-->
<p>Today’s data analysis and enterprise analytical applications, increasingly
utilize complex statistical models, like artificial neural networks, which
demand large amounts of raw unaggregated data. Since for many of such
applications, however, the response time is critical, it is becoming quite
clear, that the traditional approach, to dump day-to-day data into a huge
repository for data analysis, needs to be revised.</p>
<p>So how to create a high throughput data transaction structure with low latency?
Of course, the key is decentralization! The simple idea is to directly “plug”
the analysis applications into the source data systems they require. On a closer
perspective, however, this idea turns out to be horrible! It not only spawns an
absolutely unmanageable jumble of data interfaces (which first of all have to be
implemented), but also does not provide any flexibility to the underlying
structure. … Nevertheless - At no time people have been deterred of any simple
idea by the argument of “a bad idea”! The result is, where we find ourselves
today: In a Plug Jumble!</p>
<h2 id="what-is-vivid-db">What is Vivid DB?</h2>
<p><strong>In order to bring a little more order into the chaos, we have decided not to
follow the most simple idea, but that one right after it: A multi plug!</strong></p>
<p><a href="/projects/deet.html">Vivid DB</a> (alias <em>Deet</em>) is a universal data interface and
SQL-Database engine, that mediates between data source systems (like operational
databases) and data analysis applications. To this end Vivid DB implements the
two fundamental layers of a data warehouse.</p>
<p>The <strong>integration layer</strong> of Vivid DB is implemented by a modular plugin-system,
which allows it to stay light-weight, while flexibly supporting a wide variety
of different data sources. The included data support comprises an SQL-plugin,
which utilizes <a href="https://www.sqlalchemy.org" target="_blank">SQLAlchemy</a> to
allow it's connection to a variety of SQL-Databases (<a href="https://www.ibm.com/analytics/us/en/db2/" target="_blank">IBM
Db2</a>, <a href="https://www.oracle.com/database/" target="_blank">Oracle
Database</a>, <a href="https://www.sap.com/products/hana.html" target="_blank">SAP
HANA</a>, <a href="https://www.microsoft.com/sql-server" target="_blank">Microsoft
SQL</a>,
<a href="https://www.mysql.com" target="_blank">MySQL</a>,
<a href="https://www.postgresql.org/" target="_blank">Postgesql</a>, …). There above
Vivid DB as aimed to ship with integrated support for the most common laboratory
measurement devices, flat files and data generators that appear in the wild.</p>
<p>The <strong>staging layer</strong> of Vivid DB is currently implemented as a native
SQL-Database engine, featuring a DB-API 2.0 interface with full SQL:2016
support, a vertical data storage manager and real-time encryption. On this
foundation Vivid DB is aimed to support sampling in common data analysis
formats: <a href="http://www.numpy.org/" target="_blank">NumPy-Arrays</a> and
<a href="https://www.r-project.org/" target="_blank">R-Tables</a>.</p>
<p>According to our <a href="/about#us">convictions</a> Vivid DB is
<a href="https://fsfe.org/freesoftware/basics/summary.html" target="_blank">free software</a>,
based on the <a href="https://www.python.org/" target="_blank">Python</a> programming
language and actively developed as part of <a href="/vivid">Vivid Code</a>.</p>Patrick MichlOnline analytical processing and predictive analytics in combination with machine learning provides a new challenge in data-warehousing: The response time for large transactions of data from different domains.Three obstacles in data science and one vision2019-03-20T00:00:00+00:002019-03-20T00:00:00+00:00https://www.frootlab.org/blog/corporate/three-obstacles-in-data-science<p><strong>For the current development- and exploration process in data science three
obstacles in particular appear as outstanding hurdles, when it comes up to
realize projects - and even more, when it comes up to venture collaborations.</strong></p>
<!--more-->
<p>Some years ago - in the early 2010s - when Google’s
<a href="https://www.tensorflow.org/" target="_blank">TensorFlow</a> still was only an idea and Geoffrey
Hinton’s daredevil <a href="https://www.cs.toronto.edu/~hinton/science.pdf" target="_blank">Science
Article</a> still only received a
bunch of citations, the undisputed technical issues in data science were the
absence of computing power and the absence of a common play ground. Of course,
during the last decade, NVIDIA and Google respectively stepped into the breach
with <a href="https://developer.nvidia.com/cuda-zone" target="_blank">CUDA</a> and TensorFlow / Keras. So
the question arises “What are today’s foremost technical obstacles in data
science?”.</p>
<h2 id="1-the-plug-jumble">#1: The Plug Jumble</h2>
<p>Data scientists are concerned with the analysis of statistical samples. A large
part of the resources, however, often falls on the integration and mapping of
data sources into appropriate data analysis formats. Moreover this task
frequently turns out as an unappreciated and frustrating job, that neither
belongs to system administration nor to data science. In particular for
collaborations with different or changing operational data landscapes, the
additional efforts can become a permanent and critical factor, that impedes the
advance of projects.</p>
<figure>
<a href="/images/fig/vivid-db.svg" title="Vivid DB unifies various data sources into a common data
interface">
<img src="/images/fig/vivid-db.svg" alt="Vivid DB unifies various data sources into a common data
interface" />
</a>
<figcaption>Vivid DB unifies various data sources into a common data
interface</figcaption>
</figure>
<p>We want to solve this issue with <a href="/projects/deet.html">Vivid DB</a>, a universal
data mapper, that mediates between data analysis and data sources. On the data
analysis side, Vivid DB supports many de-facto standards like
<a href="http://www.numpy.org/" target="_blank">NumPy-Arrays</a> and
<a href="https://www.r-project.org/" target="_blank">R-Tables</a>. On the backend-side Vivid DB aims
to support a large variety of different data sources that appear in the wild.
This comprises a broad selection of SQL-Databases
(<a href="https://www.ibm.com/analytics/us/en/db2/" target="_blank">IBM Db2</a>,
<a href="https://www.oracle.com/database/" target="_blank">Oracle Database</a>,
<a href="https://www.sap.com/products/hana.html" target="_blank">SAP HANA</a>,
<a href="https://www.microsoft.com/sql-server" target="_blank">Microsoft SQL</a>,
<a href="https://www.mysql.com" target="_blank">MySQL</a>,
<a href="https://www.postgresql.org/" target="_blank">Postgresql</a>, …), assorted NoSQL Databases,
flat-file-Databases like CSV and R-Table exports, as well as assorted laboratory
measurement devices.</p>
<h2 id="2-paper-bottlenecks">#2: Paper Bottlenecks</h2>
<p>The development- and exploration process in data science heavily depends on the
ability to adapt current cutting-edge approaches. This ability, however,
frequently is impaired by the detour experienced by publications in paper form:
Due to the limited space, the provided pseudo code often loses valuable details
over the original algorithm. But also if the original algorithm is provided
online, it can take tremendous efforts to properly identify it’s scope and adapt
it to the underlying prerequisites - only to decide about it’s suitability!</p>
<figure>
<a href="/images/fig/vivid-store.svg" title="Resolution of abstract code requests by currently best fitting
algorithms, using a self contained cluster of multiple Vivid Stores">
<img src="/images/fig/vivid-store.svg" alt="Resolution of abstract code requests by currently best fitting
algorithms, using a self contained cluster of multiple Vivid Stores" />
</a>
<figcaption>Resolution of abstract code requests by currently best fitting
algorithms, using a self contained cluster of multiple Vivid Stores</figcaption>
</figure>
<p>Our approach to this obstacle is <a href="/projects/brea.html">Vivid Store</a> - a smart
algorithm repository, which enforces unified interfaces for different algorithm
categories. This allows an automatically evaluation and comparison of the hosted
algorithms with respect to different applied data types and evaluation metrices.
On this basis Vivid Store is able to determine currently best fitting algorithms
with respect to the algorithm category, the used data type and the evaluation
metric. This information is stored in evaluation tables, that can be requested
by clients and other Vivid Stores. This makes it possible, to interconnect the
individual stores of different organizations and therefore to provide an
entirely new level of collaboration!</p>
<h2 id="3-riding-dead-horses">#3: Riding Dead Horses</h2>
<p>Due to the rapid advances in data science data analysis applications like in no
other domain suffer of short code lifetimes. This follows from the simple rule,
that products only survive as long as they remain competitive. And once the
zenith has been reached, the law of happy hunting grounds applies:</p>
<blockquote>
<p>When you’re riding a dead horse, the best strategy is to get off!</p>
<p>Wisdom of the Dakota Indians</p>
</blockquote>
<p>So the question arises, how data analysis projects can be kept competitive
without the permanent binding of valuable resources!?</p>
<figure>
<a href="/images/fig/vivid-node.svg" title="Processing of data analysis flows using Vivid Node">
<img src="/images/fig/vivid-node.svg" alt="Processing of data analysis flows using Vivid Node" />
</a>
<figcaption>Processing of data analysis flows using Vivid Node</figcaption>
</figure>
<p>Our answer to this issue is the rapid prototyping system <a href="/projects/rian.html">Vivid
Node</a>, which separates the program flow of data analysis
applications from their used algorithms. However, Vivid Node is not only a rapid
prototyping system, but indeed takes the idea a large step beyond, by the
integration of cloud-based automatic algorithm and model selection. The
fundamental observation behind this approach is, that it is almost never
required to use a specific algorithm or model but only one that does the job -
so why not simply use the best one, that’s currently available?</p>
<h2 id="the-vision-vivid-code">The Vision: Vivid Code</h2>
<p>In order to use currently best fitting algorithms any Vivid Node instance
communicates with one or many connected Vivid Store instances. The communication
is initiated with an <code class="language-plaintext highlighter-rouge">EVALUATION REQUEST</code> for any connected Vivid Store. This
request comprises (E1) an Algorithm Category, (E2) the used Data Type and (E3)
the applied Evaluation Metric. Thereupon the connected Vivid Stores respectively
use their evaluation lookup tables to respond to this request with an <code class="language-plaintext highlighter-rouge">CODE
OFFER</code>. This includes the above given information, as well as (O1) an Evaluation
Score and (O2) an Algorithm ID, which identifies the algorithm within the Store.
The collection of all offers received by Vivid Node within a pre-defined time
window are ranked by their evaluation scores.</p>
<p>Thereupon the highest ranked code offers are looked up in a local algorithm
cache, by using the combination of the domain name of the store and the
algorithm id. If this combination, however, could not be found, Vivid Node
creates a <code class="language-plaintext highlighter-rouge">CODE REQUEST</code> to the respective store, which includes (C1) the
Algorithm ID and (C2) a cryptographic token, that identifies the user. Finally,
the transaction is finished, when the store responds to the code request with a
<code class="language-plaintext highlighter-rouge">CODE ANSWER</code>. This answer depends on the authorization of the user: If the user
is unknown or not allowed to receive the algorithm, the answer is constituted by
(A1) the Algorithm ID and a respective (A2) Error Notification Flag. If the
user, however is authorized the error flag is empty and the answer also
comprises (A3) the encoded algorithm.</p>
<p>At this point the idea of “<em>currently best fitting</em>” should be quite clear. The
above description, however, conceals one essential detail that is necessary
to share code between different organizations: For any instance and any
collaboration partner, the algorithms of a given category are required to
use the same data interface to be interchangeable! At this point, Vivid DB
as a universal data mapper joins the team. Vivid Nodes as well as Vivid Stores
use Vivid DB to connect to data sources. This allows collaborating organizations
to share (or to offer) algorithms and code without the need to share data!
Together the three components constitute the <strong>Vivid Code</strong> framework.</p>
<h2 id="chances-and-applications">Chances and Applications</h2>
<p>For enterprises the incorporation of customer and market information is getting
more important. Consequently many enterprises extend their analytical tools in
market research and decision support by business intelligence software.
Usually, there are two options for the implementation of such projects: In house
development and outsourcing by consultants. On closer perspective, however, it
becomes apparent that both approaches share a common weakness of individual
software: The high TCO. The Vivid Code framework provides a third option
by minimizing the TCO through the synergy effects of automated collaborative
data science. This allows to minimize the TCO of data analysis projects while
keeping them state-of-the-art.</p>
<figure>
<a href="/images/fig/vivid-code-collaboration.svg" title="Collaboration between different organizations using the Vivid
Code framework">
<img src="/images/fig/vivid-code-collaboration.svg" alt="Collaboration between different organizations using the Vivid
Code framework" />
</a>
<figcaption>Collaboration between different organizations using the Vivid
Code framework</figcaption>
</figure>
<p>As a data scientist imagine the following situation: The new postgraduate in
your workgroup just released a gradient descent that outperforms the one you
wrote some years ago by far. The bad news, however, is that nearly any single
application in your lab uses your old algorithm. So the basic benefits of the
Vivid Code framework in this situation should be quite clear: All your
application automatically use the new algorithm. But now, let’s get one step
beyond and imagine that your workgroup is interconnected with the algorithm
catalogs of many other workgroups… To be quite honest: Personally this
picture gives me the creeps.</p>Patrick MichlFor the current development- and exploration process in data science three obstacles in particular appear as outstanding hurdles, when it comes up to realize projects - and even more, when it comes up to venture collaborations.Welcome at Frootlab2019-03-19T00:00:00+00:002019-03-19T00:00:00+00:00https://www.frootlab.org/blog/corporate/welcome-at-frootlab<p><strong>We are a young developers team with strong expertise in data science and
networking technologies. Our vision is a democratic and federated AI-Revolution,
that does not lead to exclusion or incapacitation, but promotes public
well-being, social justice and social cohesion.</strong></p>
<!--more-->
<p>For us communication and cooperation are the key factors and driving force
behind scientific and industrial progress. In order to realize our vision, we
therefore started to identify <a href="/blog/corporate/19079-three-obstacles-in-data-science.html">the foremost obstacles for collaborations in data
science</a> and started
to develop comprehensive [solutions, based on an entirely new <a href="/blog/tags#CAMP">programming
paradigm</a>.</p>
<p>According to our fundamental convictions we release our products as <a href="https://fsfe.org/freesoftware/basics/summary.html" target="_blank">free
software</a>
to provide universal access to research and education. Since the application of
our projects in industrial environments requires stable development and
compliance with industrial standards, it currently is important to us to
maintain the developmental sovereignty, for which we release our products as
single vendor open-source projects rather then community-driven.</p>Patrick MichlWe are a young developers team with strong expertise in data science and networking technologies. Our vision is a democratic and federated AI-Revolution, that does not lead to exclusion or incapacitation, but promotes public well-being, social justice and social cohesion.