SCALING UP EFFORT

2/15/2016

From the beginning, Zet Universe was envisioned as a knowledge management and analytic platform, that could be used to solve different problems, from organizing personal information, and learning new things in the spare time, to researching different topics, analyzing competitors, tracking work projects, and more.

One of the biggest values Zet Universe possesses is its ability to work in an offline mode, and work locally. It's different compared to other famous analytic platforms, such as Palantir Gotham, or Quid, and others, as these platforms typically require you to have a constant network connection and (except Palantir Gotham) require you to have a high speed connection to their cloud computing services.

Zet Universe, in contrast, allows you to have your projects data with you, residing on your computer, as you are on the go, riding on a car to your client, or getting to a new destination by train or a plane. Zet Universe contains a powerful semantic infrastructure that is capable of extracting and analyzing data from your own projects while working from your own machine.

In order to make this possible, Zet Universe requires some good hardware to work, ideally with no less than 4GB of RAM, Intel's Core processor, and SSD drive. But good hardware is only part of the equitation, as we also have to make Zet Universe's software tailored to its specific tasks.

As you, our Insiders, work with Zet Universe, we learn more and more about our current product design's strengths and weaknesses, and as you tell us more and more about your needs, we get better understanding of how we should improve the product to make it being more relevant for you.

This blog post is a story of our Scaling Up Effort, and in this series we will discuss two areas of this effort, Storage Layer and thumbnails cache.

SCALING UP STORAGE LAYER

In 2013, Air Gesher and Danielle Kramer of Palantir made a talk at the Strata Conference 2013, named "AtlasDB: ACID Transactions for Your Favorite Key-value Store". In this talk they discussed a new bolt-on layer for a key-value stores (distributed or otherwise), a system in use at Palantir called AtlasDB, which sits on top of either LevelDB (on local computers) or Cassandra (on distributed systems).

Palantir Gotham's interactive analytic core, which was originally built with a traditional RDBMS as its backing store, was hitting the limits of economical and easy scaling. They needed to move a distributed backing store for scalability, and they need to also run Palantir Gotham on local machines as well. To solve this problem, Palantir development team designed and built a special transactional layer running on top of the key-value stores, enabling Palantir Gotham to work with way, way larger data collections.

Meanwhile, at Zet Universe, we were hitting a similar obstacle. The v2 of Zet Universe (the one we've been developing since mid-2013) was originally built with a file-based storage as its backing store. Each time you add a new document or ask Zet Universe to track a folder, special items called entities are created, and each new change is stored in an individual JSON file. This approach was dictated by the goal to make Zet Universe data being easily syncable between user's and his/her colleagues PCs using fast and convenient synchronized cloud storage systems like Dropbox, OneDrive, or Box.

After running a series of experiments with early customers, we've got to a conclusion that we should employ a custom synchronization solution. In late 2014 we've run another series of experiments in an effort to scale up our storage layer (including RDF-based stores, SQLite-based, generic Key-Value stores, and others), and it became clear to us that most of these solutions are not really relevant to our problem, and we decided to keep using the existing one, to learn more about the product's real world usage to guide us in building a more efficient backing store.

Thanks for a 6 months of the Insider Preview Program, and for your invaluable feedback, we are now glad to share the good news: starting with the February 2016 build, Zet Universe will be now using a more efficient backing store.

DESIGNING NEW STORAGE LAYER: UNDERSTANDING DATA AND DEVELOPMENT AT ZET UNIVERSE

We take a research approach to building software. Instead of just using the existing solutions, we look at the wide variety of different systems running in production, recognize their patterns, compare them to the patterns of Zet Universe, and ask, "What technical solution would be the most efficient for our user in his or her work?" To answer this question, we, very much like are counterparts at Palantir, use a holistic understanding of how low-level data integration, scalable data stores, API layers, and an entire suite of user interface tools, when properly integrated, create an efficient and simple user experience. For us, it is very important to put the technology out of the way from our users, and yet make each piece of it to work at maximum level.

Certainly, when we have components that already exist to serve our needs, we are glad to use them - be that Lucene as a high-quality open source search engine, or WPF as our presentation framework. But when we identify a capability gap, we build new things.

DESIGN GOALS

As it was said in the beginning of this blog post, one of the core goals for us is to make sure you can keep using Zet Universe not only when you have a high-speed internet connection, but also when you are offline or are occasionally connected to the network. There are, however, several other goals we've set for us as extremely important to fulfill:

It should work in the offline mode, keeping current user's projects fully accessible, and newly added local data being processed within the system while working offline,
It should be kept small (currently is installation is 11MB worth),
It shouldn't not require admin rights to install and use,
It should be able to work on 32-bit machines (and it has to run as a 32-bit application due to use of specific libraries),
It should allow working with relatively large data sets (from 10K up to 1M of objects within the project).

UNDERSTANDING OUR DATA MODEL

Zet Universe's data model is, in many ways, unique, as it has aspects of both spatial and temporal databases:

Each object in its data model has its place on the 2-dimensional information space, and each object can have a history of changes.
At each moment of time, Zet Universe is showing you only a current snapshot of its data.

If you have say 10K objects, Zet Universe stores approximately 100K historical change records about those objects, but it needs to load only 10K objects into memory.

ROUND 1: EXPLORING SQL DATABASE SERVERS

Originally (in Alpha version, 2012-2013), Zet Universe used a complex approach, with data stored in the Microsoft's SQL Server 2012, and it was accessible via a locally deployed Web Service.

This approach had several benefits:

Both user's data and metadata are stored within one data store
Database is running in the separate process, and only required chunks of data are sent to a client
SQL Server provided us with a built-in full-text indexing, semantic similarity search, proven backup solutions, and more.

After first deployments within the team and first customer deployments it became clear to us that this approach isn't really efficient; Alpha version deployments also helped us to formulate some of the design goals:

It should be kept small (currently is installation is 11MB worth),
It shouldn't not require admin rights to install and use,

ROUND 2: JSON FILES

Once we've formulated the new design goals, we've decided to find the most simple solution for data store, and, combined with the idea of making metadata and project data synchronizeable via consumer- and business-grade cloud storage platforms like Dropbox, OneDrive, and Box, we've ended up with a simple approach:

User can have multiple projects,
Each project is technically a folder, and it can have one or more objects (entities) inside it,
Each entity is represented as a collection of individual storage records, saved as JSON files

This approach worked quite well for some time:

Zet Universe loaded file names on each startup,
Zet Universe than picked only those it needed to get a current snapshot of the spatio-temporal database (those files make altogether)
Each new record means a new separate file on a disk.

However, this solution wasn't scalable, as we wanted to move from hundreds of objects within the single projects to tens and hundreds of thousands of objects. We've also found that Microsoft's antivirus, Windows Defender, was running checks on our files each time Zet Universe was launching, which means that each application's start up could become slower due to extensive antimalware checks being run by Windows Defender.

We've continued the search.

ROUND 3: EXPLORING EMBEDDED SQL AND NOSQL STORES

In late 2014 we've decided to return to SQL-based solutions, employing SQLite as a possible embedded database.

Unfortunately, this approach wasn't really right for us: SQLite runs a query processor, maintains tables, indexes, and other artifacts of the relational databases, which were irrelevant for our data model.

Between late 2014 and early 2016 we've tried other embedded solutions, like Google's LevelDB, Symas Lightning Memory-Mapped Database, ESENT (Microsoft's ISAM database), Brightstar, and other solutions.

In general, we've got to the following conclusion:

Each embedded solution loads at least part of the database in memory (think Google's LevelDB), which isn't acceptable for us,
Most of the embedded solutions have a lot of functionality we didn't really need (think full-text search, SQL Language support, triple stores, and other things).

In the long run, even the most-efficient key-value store would become a bottle neck to us, as it would use the precious memory for its data structures, and, most possibly, would provide us with functionality we won't need.

ROUND 4: EXPLORING FILE-BASED SOLUTIONS

In late 2014 we've researched an opportunity to persist the entire data graph to the disk, but decided against it. As it was mentioned earlier in this post, we don't need to load everything into memory; all we need is to load the current snapshot (where current is based on the current date and time which changes every moment as we speak).

We've returned to the file-based solutions. This time, we wanted to use one file on

One of the possible ways would be using a ZIP or TAR archive, and keep metadata inside it. However, the internals of ZIP archive logic made it clear that updating its often would lead to large memory usage which was undesirable for us.

Another option was to use virtual file systems. Windows has a built-in support for VHD disks since as early as Windows 7; however, to create new VHD disk or to attach an existing one, user needs to have admin rights, which goes against our design goals.

After running a technical discussion with our advisors, we've got a recommendation to look at the old Windows functionality, "Locking and Unlocking Byte Ranges in Files", which is also used in the old OLE/COM Structured Storage format, in it's implementation called compound files. This is the same format Microsoft used for old Microsoft Office files.

The internals of this format are simple: essentially, it's a sort of a virtual file system (FAT). It has root, which can contain storages (think directories) and streams (think files). It has an internal hierarchical index, and in order to read specific parts of the file you need to specify an internal path to it, instead of loading the entire file into memory.

This sounds pretty much like our situation. We could store our metadata right within the compound file, and we could save historical records into it, and we could stop worrying that the entire file could have a size much larger than the available physical memory (a bottleneck for the rest of the embedded stores).

Indeed, this is the solution we've ended up with for now. Each project's metadata (objects, properties, relationships) is now stored in its own compound file, and as each such file's size is limited only by the file system limits, this approach is quite scalable for our needs.

WHAT'S NEXT?

Our current approach is rather straight-forward. We migrate old JSON files into the compound file, and then we form a spatio-temporal index during each application's start. This was relevant when we used a file system, as it was expected that new files could appear while the program wasn't running. Now that the synchronization will be working within the application's boundaries, we will cache the spatio-temporal index within the compound file as well to speed up the application's startup.

Another area for improvement is to store the entire object graph of the current snapshot, and load it on the startup, and then use the spatio-temporal index for new items, to update the snapshot.

These improvements won't be available as part of the February 2016 build but will follow with the upcoming builds.

SCALING UP THUMBNAILS CACHE

Historically, most of the data tracked by our users in Zet Universe was in the form of folders and files. A visual metaphor of an infinite zoomable space made showing meaningful miniatures of those files a critically important piece of the overall frictionless user experience we wanted to provide our customers with.

With this goal in mind, we've employed an aggressive thumbnail caching strategy. However, although this strategy worked when users needed to work with files and folders only, it became the bottleneck for us in our effort to support much larger data sets.

Since the February 2016 build, Zet Universe now maintains a very small thumbnail cache, storing only commonly used thumbnails (generic thumbnails for each kind of data), and loads thumbnails for the rest of the items from disk in an asynchronous pattern.

This helped us to support, for example, 10 projects with more than 16K of records entirely in-memory, and spending nearly 200MB of memory for the entire application. Given that there are some visuals that define the user interface itself, this is a pretty good use of memory; previously it was easy to hit 1GB of memory with a way smaller amount of projects and records.

ZET UNIVERSE TODAY

Zet Universe has an efficient and scalable local data store based on the compound files, where the only limitation is the file system limitations. It also has a new entity thumbnail system that caches only commonly used thumbnails, and loads the rest of thumbnails in an asynchronous pattern.

Stay tuned for part two of the Scaling Up Effort series, where we'll do a deep dive into the next steps of further improving our local data store.

0 Comments

BUILDING ZET UNIVERSE