ArangoDB Combines Several Data Models in One NoSQL Database
Is it something new or a repeat of an old idea?
- By Dan Kusnetzky
A representative of ArangoDB reached out to present what the company thinks is a new idea. I was told that "We are working on the native multi-model idea, meaning that we found a way to efficiently combine three data models (key/value, documents, graphs) into one NoSQL database core and let users access their data with one declarative query language: AQL."
Reviewing the history of databases, we quickly see that there have been many different approaches applied to address developer needs to store and retrieve different types of data. I can recall many approaches that were widely used. (The following list certainly can't be considered comprehensive):
Various Database Approaches
One approach was called a navigational database. Another was developed by the Committee on Data System Languages, and was known both by the name "CODASYL database" and "Network database." Yet another approach was known as a "Key/Value database," which was used both in MUMPS and the PICK systems. The "Relational database" was next. Now we see NoSQL databases becoming popular. Let's look at each of these in turn.
The earliest approach to constructing a useful data store that would allow applications to find records based upon a number of criteria was a navigational database. It was made up of a number of separate files that contained indices, and a single file that contained the actual data. The index files contained a search field and a record number in the data file. Applications could search one of the indices to find the needed record, then go to the data file and jump directly to the needed record. The indices were kept sorted in order so that the data file could be read in order, based on any of the indices. Later, the index files were brought back into the data file. This was called an "Indexed sequential file." Purists would point out that this really can't be considered a true database, even though many early transactional systems used this approach for data storage.
This approach was fairly simple and easy to implement on what would be considered tiny systems today. I used this approach for a hospital information system on a system that only had 16KB of main memory and three 6MB disks.
Unfortunately, it also was easy for the indexes and main data file to get out of synchronization. Once this issue was discovered, it was fairly straightforward to go through the data file and recreate the corrupted index.
One challenge when one or more indices became corrupted was uncovering when the corruption happened, and fixing all updates to the main data file that may have been done erroneously. Another challenge was restoring this type of database after a system crash. If backup procedures didn't capture all the indices, IT staff would be forced to first discover that the index was missing, then run through a procedure to re-create it.
In 1959, a consortium -- the Committee on Data System Languages -- was formed to both guide the development of a standard programming language and the development of a standard way to store "lists." The COBOL language was one of this committee's projects, and a "network database" was the result of its work on a standard way to store and retrieve data.
Without going into minute detail, the committee developed the concept of a Data Description Language (DDL) that was used to define the items that were to be stored and the relationships between and among them. A Data Manipulation Language (DML) was used by developers to give commands to the database, allowing programs to store, retrieve and update data.
This database model made it possible for multiple records to be linked to multiple owner records and vice-versa. Some described this as an "upside-down tree" in which each record was linked to one or more owners and, potentially, other data items in a "mesh." While powerful, this approach was hard to understand and quite complex. I am aware of a situation in which a malicious developer, who was planning to leave the company, created a program that exploited a badly designed mesh. This program would make database queries that could never be satisfied, so that all the processing power of the host system would be consumed following links back and forth and up and down the mesh. Not funny.
This approach to databases was somewhat similar to the navigational database model, but the data was stored in the indexes themselves. This meant that data could be retrieved in sorted order without an explicit sort process. Small files containing the index data would be developed that pointed back to where the rest of the "record" could be retrieved. This database model was very good for applications that stored and retrieved individual data items. It wasn't very good for applications in which the entire data store needed to be transversed to create a report.
The relational database model is extremely popular today, and can be found supporting applications in everything from embedded numerical control applications, to applications in mobile phones and tablets, to PCs, to every type of server.
Proposed by E.F. Codd in 1970, it organizes data into one or more "tables" or "relations." The columns in these tables are known as "attributes." Rows are known as "records." Each record has a unique index value. Database developers break down the data into a special form that places related data for one product or customer in different tables. So, a customer database might be made up of an address table, an order table, and a credit card table.
This approach has proven to be very flexible, but at times it can be cumbersome for simple search and retrieval applications or applications that need to process non-structured data, such as documents, maps, graphs, presentations or even operational log files.
The NoSQL database can be seen as a reaction to the limitations of relational databases. The data is not kept in structured tables, but still can be searched to find relevant records.
Multi-model databases maintain data in a format that's easy to create, search and update and then offers mechanisms that support relational, non-relational and key/value store access. Object-relational database systems can be seen as early forms of multi-model databases. There are a number of suppliers offering databases in this category, including ArangoDB, Cosmos DB, CouchBase, CrateDB, Datastax, EnterpriseDB, MarkLogic and even Oracle. There are a few others, but I think you get the point.
One prominent proponent was FoundationDB, which was acquired by Apple after several impressive wins. Apple pulled FoundationDB from the market shortly after acquiring it.
Dan's Take: Intriguing Technology, but Questions Need to be Asked
As I read through ArangoDB, I saw that the ideas being presented by the company make sense. If it was possible for an enterprise to standardize on a single database engine, it would be possible to simplify development and support. If a single database engine could access data using many different access mechanisms, the data could be moved from all the separate databases in current enterprise use without having to change a huge portfolio of applications.
The key questions about this "centralization of data" is whether the multi-model database would be as or more efficient than database engines in current use, and if the performance of database-based applications would be the same or better.
ArangoDB says that "You can store your data as key/value pairs, graphs or documents and access any or all of your data using a single declarative query language." Does that mean that all established applications must be changed to use this query language? If so, it's unlikely that enterprises will change what they're doing to move from one database to another. Any savings realized through the use of a single database would be more than consumed by the effort to change everything.
Another concern is how transportable this database technology is. Enterprises currently have mainframes, midrange systems running a number of single-vendor operating systems, midrange systems running UNIX and many industry standard x86-based systems running Windows, Linux and UNIX. It isn't clear if ArangoDB supports all of these platforms. If not, database unification isn't really possible.
I hope to learn more about the company and its products and, perhaps, speak with users of this technology. While I think this is an interesting idea, as usual, the devil is in the details.
Daniel Kusnetzky, a reformed software engineer and product manager, founded Kusnetzky Group LLC in 2006. He's literally written the book on virtualization and often comments on cloud computing, mobility and systems software. He has been a business unit manager at a hardware company and head of corporate marketing and strategy at a software company.