Memory-resident Journal-Backed Databases

"It's easy to write memory-resident database engines"

Many databases these days can fit comfortably into RAM

Many software companies deal with databases that fit easily into main memory.  E.g. a school administration software system can fit a large school's data, (excluding photos and other blob-like items) into 2 Gb of RAM.  However, most software companies have been slow to realise that they're better off ditching traditional RDBMS's and putting everything in a memory-resident database - whether of the SQL or No-SQL variety.

Instead there's a lot of interest in cloud databases, despite the fact that more software applications than ever can fit comfortably into a single compute node or even the main memory of a single compute node.  The database of all road users in the UK could fit into 8 Gb.


TimesTen and other memory-resident SQL databases

I've written proprietary database engines for 4 companies (see below).  A pretty basic thing I learnt was:  "if you want it to run fast, put it all in main memory".   If you can.   So I had a good look at "TimesTen".  TimesTen is an SQL RDBMS designed to operate in main memory (plus logging to disk in case you need to restart, of course).  They have a nice white paper explaining how the internals of the database engine can be greatly simplified and sped up if you have everything in main memory - a good deal of the complexity in an RDBMS comes from efficiently managing access to disk.  I thought "This would be good for Edval".  Then I thought:  "Hang on, why would I need TimesTen to store objects in main memory?  If it's in main memory, then I can do that myself using plain old Java objects (or C++ objects etc).   I need the data to be in the form of programming language objects at some point anyway.  I'd rather not have to load them into SQL and out of SQL. I'd rather deal with them directly in Java - what value does the SQL layer give me?


The Mrjb Concept

Suppose you're programming in Java or C# or python or C++, and all the data that your application deals with fits into main memory (other than photos and other blobs - see below).    Suppose you never have to restart your web server.  Then, there'd be nothing you'd need to do - you already have an object-oriented database:  to create an object, use "new".  To delete an object, remove references to it.  Use "synchronized" methods where necessary.  Use HashMap's and ArrayLists's and trees and whatever other collections you need.

Ah yes, but what happens when you do need to restart the web server?  You'd lose all your data, right?   So of course we need to log all updates to disk.  This can be done in a single file, or a small number of files, or a large number of files, in append-only mode:  some people call this a "log structured system".  It's well known that you can write to disk extremely fast if it's a log-structured system, i.e. you're spending most of the time appending to the one file.   In normal operation you never read this file.  You don't need to index the file, and you don't need to reprocess anything, and depending on the operating system, you might find that the disk head rarely moves.

When you restart the server, you need to read this log file.  You need to start at the beginning, and replay all the transactions in sequence.  Unless there's a bug in your code, you'll end up with the same in-memory state as before you restarted the server.  

Doesn't it take a long time to do this?  Well, yes...in fact this is the main down-side.  In the case of Edval, I have one file per school per academic year all on the one server.   It takes about 15 minutes to restart the server, if it's towards the end of the academic year, with 100 schools stored on the one compute node.   If I don't need to do this often, and especially if I can do it at night-time, then it's not a problem;  and even better if you can swap the IP addresses between the production server and the hot standby server then the users don't experience any down-time.

One technical detail:  When reading the transaction strings, I identify whether anything has actually changed or not.  If nothing has changed then I don't want the transaction output to the file at all.  Similarly, when constructing the original transaction string, I sometimes optimise it by only outputting fields that have changed. 


The Benefits

  • Every update and every query seems to happen instantaneously
  • You're directly manipulating the objects using your favourite programming language
  • You have a comprehensive audit trail:  everything that happens is recorded in the log file.  You can study the sequence of events, explain to the customer why something happened that he thought was wrong, or replay the events to reproduce the bug.  
  • The transaction strings can be sent via a socket to the hot standby system at the same time as they're written to the log file, if you're implementing high availability
  • No SQL injection attacks

I love having the audit trail.  I could have used XML or FSN for the messages, but instead I wrote my own simple text-based format, designed to be mid-way between easy-to-parse and easy-for-a-human-to-read, for example:

20100621 15:43:05 127.0.0.1  {
APPT 106 purpose="Board meeting" time1=20100816~11:00 time2=20100816~12:00 attendees=me location="(-33.895212, 151.213589)"
}
20100621 15:43:05 127.0.0.1  {
APPT 107 purpose="Renata's volleyball final" time1=20100821~20:00 time2=20100821~21:00 attendees=me location="(-33.895212, 151.213589)"
}
20100621 16:27:41 127.0.0.1  {
FIELD name="Tim Cooper" password="password1"
}
20100621 16:27:41 127.0.0.1  {
APPT 101 purpose="Discuss iPhones app with Claus" time1=20100811~15:41 time2=20100811~16:41 attendees=me,claus@buzzhives.com location=(-34.040054,151.008132)
}

In development, I often shut down the server and edit the file in Notepad, e.g. to produce test cases.  In production, I occasionally shut down one dataset (for Edval, one dataset means one school's file) and edit the file in Notepad, and restart the dataset.  I would do this e.g. to roll back the database to a previous state or to fix up major stuff-ups that the user has performed e.g. deleting all courses and then doing 2 hours' data-entry before noticing a mistake.  In several cases I've unwound major user stuff-ups by removing one set of transactions while retaining other transactions that happened subsequently to the stuff-up.


Disadvantages

  • As mentioned above, it can take a while to reload all the data when restarting the server.  Ways to mitigate this are:
    • Have a fast server and use multiple threads
    • Restart the server at quiet times
    • Do development with a small dataset
    • Set up a high availability system with a hot-hot configuration, where one server serves users while the other one is restarting
    • If absolutely necessary (it hasn't been for me), you can have an end-of-year job where the in-memory data-structures are dumped to a clean log file such that only the current state, but not the history of how we got to this state, is retained. 
  • It is difficult to implement transactions and roll-back.  It's possible, but it requires a lot of effort if there are many objects affected.  Personally I find transactions overrated:  I haven't needed complex transactions for Edval or BuzzHives.
  • It does require work to translate transactions to strings and to parse them back again.  In an SQL system, you need to do the same sort of work to translate an action to a string (an SQL UPDATE command) but reading the data is easier.  I don't find this requirement onerous, it takes up very little of my time, and in fact this gives me a good place to handle "schema evolution": converting data from superseded data models to the newer data model.


How does this technique handle High Availability?

You have 2 nodes, each with the same data in memory.  This is a peer-peer configuration, not master-slave, so that in principle, users can randomly choose which one to use (this can give you load balancing benefits as well as high availability), although you may set up your firewall to direct users to just a particular one.

Each transaction is written first locally, and then copied to the other server via a Tcp/Ip pipe.  I block the "attemptTransaction()" function until the local transaction is flushed to disk, but I use a separate thread to write to the Tcp/Ip socket.   The other server processes & writes the transaction the same way as locally generated transaction.

Related to high availability, I haven't done the following but I know someone who has:  If you have a satellite site which becomes disconnected from the internet/the main site, and you're ambitious enough to try to keep the system going in some capacity:  these transactions can be fairly easily merged when the connection is finally restored:  concatenate/interleave the messages.  In most cases, you'll get the users' work correctly updated.  In a small percentage of cases the transactions will fail due to incompatible changes having been made.  All the transactions I've used for Edval or Buzzhives are 'idempotent' meaning it doesn't matter if you get it twice in a row.


FAQ: Does the Mrjb concept handle multiple writers?

Yes.  With tomcat, I'm just using a single JVM, and so all I need to worry about is that the writes are "synchronized".  However, more generally, there might be different programs accessing the same database.  I might even want a C++ program and Java program to share the same database.  All this works fine provided you follow these steps each time you write a transaction:

1. Lock the file
2. Read & process any messages that other programs might have written
3. Process your new transaction.  If it produces an error, e.g. because it references an object which was deleted, unlock the file & abort.
4. Write your new transaction to the file
5. Unlock the file

With the Edval and BuzzHives implementations, I don't bother locking the file or reading other programs' transactions, because there are no other programs.  With the SmartsGroup implementation, the "alerts database", we do follow these steps - in fact, with High Availability, a failover node is an example of an additional writer.


FAQ: Are you providing a library?

I'm describing here a technique, not a product.  You can write your own database engine easily enough.  Although the high availability code does have some complexity...maybe I will release Mrjb as a library one day...or just ask me for Mrjb.java and MrjbDatasetManager.java.


FAQ: How do you handle Blobs (Binary Large Objects) such as photos?

I don't put photos/text documents/other Blob-like objects into the database as described here.  Instead I store them as a single file per Blob, somewhere in the file-system.  File-systems already provide a convenient way to store Blob's, namely with multi-level folders.  The difference between Blob's and the typical objects ("rows") you deal with is that Blob's don't contain references and don't need to be indexed, or at least not in the same way, so you don't lose much by having them non-memory-resident.   (If you want to index text documents, I wouldn't recommend either RDBMS's or Memory-resident Programming Object Databases).


Other companies using the Mrjb technique

http://martinfowler.com/articles/lmax.html  : A high performance transaction processing engine for the finance industry

http://www.nasdaqomx.com/digitalAssets/71/71219_q10-0979_x-stream_fs_1029.pdf : The X-Stream trading engine uses RDBMS's to load reference data into memory, and it saves data back to RDBMS tables at the end of the day, but during the day everything happens in-memory and is logged to a "forward transaction log".  Everything works great except if they need to restart the trading engine near the end of the day for any reason, typically software related, it can be a little slow to come back online.  X-Stream got an early lead on its competitors in the 1990's thanks to the high transaction rate they achieved with in-memory order matching, but in nearly every initial sales meeting they needed to convince the customer that they can achieve high availability despite not committing each trade to disk in realtime.



4 companies for which I've written proprietary database engines

http://www.edval.com.au  :  software for high schools, for timetabling and roll-call and teacher absences/excursions/room bookings.  Uses Mrjb for web-hosted functionality.
http://www.buzzhives.com  : a top secret social media web startup.  Uses Mrjb.
http://www.smartsgroup.com  : surveillance software for stock exchanges.  Handles regulation for 1/4 trades worldwide.  At some sites we handle 4 billion messages per day.  Uses several database techniques that all make the most of the available RAM, including my favourite technique:  "data-structure inside memory-mapped file" (maybe the subject of another post in future).
http://www.brainresource.com  : the world's largest database of brain scans and integrated neuroscience measures.  Uses techniques based on memory-mapped files.


Other links

If I needed to use a disk-based SQL RDBMS, it would be this one:  http://www.justonedb.com/


Comments