Wednesday, November 29, 2006

Why another database manager?

A friend asked me recently why I spent my time implementing a DBMS, when there are already a number of open source databases. It is a good question because clearly I am not breaking any new ground here. In fact, I find myself incapable of inventing great new technology. Most of what I am implementing is well known stuff. I am a software engineer, rather than a scientist, going by the definitions of engineers and scientists by C.A.R.Hoare. It seems that this project is pure self indulgence, when I could be spending my time more fruitfully, either contributing to projects like Apache Derby, or working on something more relevant.

I guess that if I am honest with myself, I have to admit that there is an element of self indulgence here. But there is also some utility. The knowledge I gain in the process is a benefit to me in my work life. But apart from that, I think that my DBMS implementation is better documented and easier to understand than other opensource implementations. This is for a couple of reasons:
  1. I use well established algorithms, which are well documented in computer science literature. I am also putting more effort into documentation than is typical of many opensource projects.
  2. The system is decomposed into well defined modules which are loosely coupled. I find most other implementations are far more integrated, and therefore difficult to understand. I have traded off performance and efficiency in favour of ease of understanding.
There are still no books that describe how to build a real DBMS, and also show you with real code how DBMS features such as transactions, locking, recovery, btrees, etc. work. The only book that comes close is Transaction Processing: Concepts and Techniques, but the sample code contained in this book is not complete. In some ways, my project provides a sample implementation of many of the techniques described in this book.

Finally, there is the question of pride. When I started this project many years ago in C++, I never thought I could do it. I had no training in this field, and no access to people who do this type of stuff at work. Having come this far, it seems a shame to give up.

Tuesday, November 28, 2006

Premature Optimisation

Here is an example of C.A.R.Hoare's statement that premature optimization is the root of all evils.

In implementing the Lock Manager in SimpleDBM, I was concerned about the impact thread synchronisation would have on the scalability of the Lock Manager. The Lock Manager uses a hash table to speed up lookups of the locks. Many implementations of Lock Managers synchronize the entire hash table while manipulating locks. There is an issue within the Apache Derby project discussing the scalability problems posed by such global synchronisation. The SimpleDBM implementation does not use a global lock. Instead it uses a custom hash table, where every hash bucket contains a ReentrantLock. The hash bucket lock is acquired before any traversal or modification is performed on the bucket's chain. I went even further and put a ReentrantLock into each Lock Header object. A Lock Header is required for each active lock in the system. The Lock Header maintains a queue of lock requests (clients) for the lock. Lock Headers are the objects that get put on the hash chains.

The idea was that the hash bucket lock could be released early by obtaining a lock on the Lock Header, once the Lock Header had been found. This would increase concurrency by reducing the duration for which the hash bucket needs to be kept locked.

In my attempt to further reduce locking, I tried to find ways of reducing the time for which the hash bucket lock was held. One such optimisation involved setting a field in the Lock Header to null rather than manipulating the hash bucket chain when releasing a lock. This code worked fine until the day I tried testing in a multi processor environment (CoreDuo), when I found that one of my test cases started failing. This test case was designed to stress concurrent operations on the same hash bucket. The problem was that in attempting to optimise I had broken the code. The optimisation was flawed because in one section of the code I was modifying the field in the Lock Header while holding the Lock Header lock, while in another section, a search was being performed on the hash bucket while holding the hash bucket lock, and this search was inspecting the field that was being updated.

A lesson from this is that multi-threaded code must be tested on a multi-processor machine.

Anyway, when I hit the bug described above, I decided that it was time to revisit the thread synchronisation code in the Lock Manager. I realised that I was attacking the wrong problem in my tuning efforts, because there were bigger scalability issues that needed fixing compared to the problem I was trying to fix.

The first problem was of memory usage. I was using too many ReentrantLocks; one per hash bucket, plus one for each lock header. In a system with 10,000 concurrent active locks, the total number of ReentrantLocks would be 10,000 plus the number of buckets in the hash chain. If I wanted to make the Lock Manager scalable, I needed to reduce the amount of memory used per lock.

The second problem was that the hash table was not dynamic. Its size was fixed at the time of creating the Lock Manager. This was likely to cause severe performance and scalability problems due to hash collisions. It would be virtually impossible to set the correct hash table size statically, and a poor hash table would cause more hash collisions leading to long hash chains, which would mean greater contention between threads trying to access the same hash bucket.

I decided that locking at the level of Lock Headers was not giving me sufficient bang for money, whereas, if I made the hash table dynamic, and used Java's inbuilt monitors instead of ReentrantLocks to synchronise the buckets, the system's overall efficiency and scalability would be greater than what my current code was achieving. The increase in the number of hash buckets would reduce the number of hash collisions and thereby lock contention amongst threads. The reduction in the number of ReentrantLocks would reduce memory usage and therefore make the Lock Manager more scalable. The refactored code would also be simpler and easier to maintain.

Thursday, November 16, 2006

Lock Manager updates

SimpleDBM's Lock Manager is based upon the algorithms described in Transaction Processing: Concepts and Techniques. Until now, the Lock Manager did not have support for deadlock detection. A simple timeout mechanism was used to abort transactions that were in a deadlock. Yesterday I decided to start implementing a deadlock detector. As I was researching this, I discovered a problem in the way rollback to a savepoint is implemented by the Transaction Manager. When rolling back to savepoint, any locks acquired after the savepoint are released. However, any locks released after the savepoint are not re-acquired. There are times when locks can be released early, for example, when using cursor stability mode, the lock on current row is released when the cursor moves to the next row. When rolling back to a savepoint, we need to restore cursors to their status as at the time of savepoint, and ensure that the cursors reacquire locks on the current row. Since SimpleDBM does not have cursors yet, this is not a problem right now. The difficulty is how to implement this type of interaction between cursors and the Transaction Manager. It seems to me that the Transaction Manager needs to know which cursors are being used by a transaction, and for each cursor, save current row in the savepoint. When the transaction rolls back to a savepoint, the saved information can be used to reposition the cursors to the correct rows and reacquire locks. Since I want to avoid this type of close dependency between the transaction manager and other modules, it will most likely be necessary to implement the Observer pattern.

Speaking of the Observer pattern, I had to recently add a LockListener interface to the Lock Manager module. This was required to allow me to write better test cases. Designing test cases where multiple threads interact in a certain way is complicated. Earlier I used a simple strategy - threads went to sleep for certain intervals, and the pauses between the thread executions were timed just so that the correct thread interaction could be achieved. Unfortunately, this meant that the test cases often had a number of sleeps in them, which slowed everything down. Also, I dislike the approach because it is a hack. In some of the new and revised test cases, I now use the LockListener facility to identify wait points within the locking subsystem. Thus an event is generated when the lock requester is about to start waiting for a lock. This event is trapped by the test cases to synchronize between multiple concurrent threads of activity.

The deadlock detection algorithm I am using is based upon the simple deadlock detector described in the Transaction Processing book alluded to before. However, the code presented in the book requires tight coupling between the Lock Manager and the Transaction Manager, whereas in my implementation, the Lock Manager has no knowledge of the Transaction Manager. Loose coupling makes code more flexible, but there is a cost. Tight coupling can reduce memory consumption quite a bit and improve efficiency, as there is less need for redundant information. For example, in the book, each Transaction contains a pointer to the lock request that is being waited for, and the Lock Manager freely accesses this information. In SimpleDBM, the Lock Manager maintains a separate HashMap where lock waits are recorded, so that the Lock Manager can navigate the lock waits graph without any knowledge of the clients invoking the Lock Manager.

Saturday, November 04, 2006

New source repository for SimpleDBM

SimpleDBM has a new source code repository. I moved the source code from ObjectWeb's repository to Google Code because I found the ObjectWeb repository somewhat slow and unresponsive. The SimpleDBM web pages are still hosted at ObjectWeb.

In the past few months, SimpleDBM's source code repository has move from to ObjectWeb to Google Code, which I hope will be its final resting place. Compared with ObjectWeb and, Google Code has the usual Google stamp of clean uncluttered interface, and relatively faster response. Another thing I like is that as soon as the code is checked in, it appears on the web site.

Thursday, November 02, 2006

Package structure

There has been a revision in the package structure within SimpleDBM.

In the earlier structure, the API and the implementation of a module were designed to be together. Hence, if there was a module named test, the packages would have been org.simpledbm.rss.test for the API, and org.simpledbm.rss.test.impl for the implementation.

The current structure is more favoured towards separating the whole of the API from the implementation. In the new structure, all APIs are under org.simpledbm.rss.api.<module>, and all implementations are under org.simpledbm.rss.impl.<module>. I decided to separate the API from the implementation in this way for two reasons:
  1. It would be easier to generate JavaDoc for the API alone.
  2. The API could be packaged separately in its own jar file without too much effort.

Sunday, October 29, 2006

Unit testing and MVCC in Berkeley DB

After a break of several months, I am about to start working on SimpleDBM again. I am determined not to add any new functionality until existing functionality is thoroughly tested and fully documented.

Currently I am working on improving the unit test cases for the BTree module. This is easily the most complex module in SimpleDBM, and producing comprehensive unit test cases is a significant task in its own right. Anyhow, the effort will be worthwhile as a database system is only useful if it is completely reliable.

It was exciting to learn that Oracle has added support for Multi Version Concurrency in Berkeley DB. I haven't looked at the implementation in great detail but a cursory look seems to indicate that the changes are mainly in the Buffer Pool. To support MVCC, the Buffer Pool has been enhanced to hold multiple versions of pages. Readers can obtain a snapshot of the database using older versions of pages, and thus avoid obtaining locks on rows being read. Unlike Oracle's own version which reconstructs older versions of pages using the undo log, the Berkeley DB implementation appears to be a simple memory based versioning solution. The downside will be that this will not scale to large workloads as it will require huge amounts of memory if the number of pages being versioned increases significantly.

Sunday, June 11, 2006

SimpleDBM moving to ObjectWeb

SimpleDBM is moving to ObjectWeb. The reasons for this move are:
  1. ObjectWeb specializes in building middleware solutions. I hope to build a community of users and developers who are like-minded and interested in middleware technologies.
  2. is run by and has too much focus on Sun Microsystems. In some ways it is a propaganda tool for Sun. I wanted a more open environment for SimpleDBM, where every project has the same status.

A related news is that I am migrating the version control from CVS to Subversion.

Tuesday, April 04, 2006


Due to changes in my work life, I have not been able to devote much time to SimpleDBM in recent weeks. Things are getting back to normal slowly and I am beginning to work on some of the outstanding tasks. The main focus is to provide a usable RSS component, which is feature complete, but needs more test cases, and more testing.

Although test coverage is a useful measure of the effectiveness of test cases, it does not help you much with issues related to multi-threading. Sometimes, obscure bugs are discovered by chance - for example, I discovered a bug in my BTree test harness when I disabled some diagnostic logging. The small time difference caused by this change led to a different interaction between threads, and exposed the bug.

I want to devote more time to testing the RSS components because they provide core services that are fundamental to the rest of the system. Higher level components need to be able to rely completely on the correct functioning of RSS.

Monday, February 20, 2006

Tuple Manager implementation is feature complete

Recently completed the implementation of Tuple Scans. Unlike Index Scans which use next key locking to avoid phantom reads, the Tuple Scan implementation does not protect the gap between one tuple and the next. This means that Tuple Scans cannot provide strictly Serializable behaviour. I wonder if this will be an issue later on.

Saturday, January 28, 2006

On licensing

SimpleDBM is licensed under GPL V2 or later. I decided to use GPL because I believe in the values that the GNU movement stands for. It is a pity that so much FUD is generated regarding the GPL, and more pity that there is such a proliferation of OpenSource licences. If GPL was Business Unfriendly, then Linux would never have been successful.

When GPL V3 comes out finally, I will adopt it for SimpleDBM.

On a side note, I finally managed to get around to implementing a few things that were long on my TODO list:
  1. The Log Manager now automatically deletes older archive log files.
  2. There is a background thread for generating Checkpoints.
  3. Rollback to a Savepoint will discard PostCommitActions that were scheduled after the Savepoint was created. This means that if you create a Savepoint, drop a container, and then rollback to the Savepoint, the drop action will be discarded.

Monday, January 23, 2006

Priorities for January

This month I am working on finishing the Developer's Guide, and also plan to update the Javadoc documentation. While updating the documentation I realized that I should not have used the term BTree when defining the interface for the Index Module. A BTree is an implementation strategy for Indexes, therefore, it is better to use a more generic term when specifying the interface. I am refactoring the code to correct this.

My priorities for January and February are to:
  1. Complete the documentation.
  2. Augment JUnit test cases.
  3. Tie up loose ends and produce a usable RSS component.

If I complete all this by February, from March onwards I shall start working on building the next layer of the system, i.e., type system, system catalogs, tables and indexes with multiple attributes.

Saturday, January 14, 2006

Multiversion Concurrency

Some time ago, I promised to write about some of the techniques used in Oracle. A very early draft of a paper on Multi-Version concurrency is now available. It discussed MVCC implementations in PostgreSQL and Oracle.

SimpleDBM does not implement MVCC on purpose, as I wanted to understand traditional implementations before attempting to implement MVCC. Perhaps one day, a different version of SimpleDBM will implement MVCC.

I am very keen on ensuring that this paper is accurate in its description of PostgreSQL and Oracle implementations. I would also like to add descriptions of other DBMSes like Firebird and MySQL/InnoDb.

Friday, January 13, 2006

Documentation moving to LaTeX

I am ashamed to say that I just discovered LaTeX. Of course, I knew about TeX but never thought I would use it ... well, I have just converted the SimpleDBM Reference Manual to LaTeX, and I love the results. The output is so much better, and the document looks professional. Here's the link to the PDF output.

Friday, January 06, 2006

More on Exceptions

In a previous post, I blogged about why I favour Checked Exceptions over Unchecked ones. Today, I'll talk about how I am circumventing some of the issues with Checked Exceptions.

The big advantage with Checked Exceptions is that the method signature tells you what Exceptions are likely to be thrown. Great as this is, it is also a liability, because any change in the Exception specification can break client code.

SimpleDBM comprises of several modules. In SimpleDBM, the module is the unit of reusability. Each module has an API which is represented by a set of Interfaces, and one or more implementation. An important objective is to make each Module's API stable so that as the code evolves, other modules are not impacted by changes in the API. This is where Exception specification becomes important.

Let us assume there are two modules, A and B, and also assume that B depends upon A. Now, suppose that some methods in A's API throw an exception called ExceptionA and some of B's methods throw ExceptionB. Since B's methods call A's methods, when defining B's methods, we have following options:

  1. Allow B's methods to throw ExceptionA.
  2. Catch ExceptionA in B's methods and wrap them in ExceptionB.

The problem with first approach is that it makes B's API unstable. What if A's methods start throwing a new Exception type?

The problem with the second approach is that ExceptionA has been wrapped and cannot be caught by a client. A client that invokes B may want to handle ExceptionA in a specific manner. In the first option, the client could write following code, but with the second option, this is not possible:

try {
// call B's API
catch (ExceptionA e) {
// Catch and handle ExceptionA
What we want is to preserve the information that A threw ExceptionA, but still avoid having ExceptionA in B's method signatures.

The solution is to wrap ExceptionA with a specific sub-class of ExceptionB, rather than plan ExceptionB. Let us call this sub-class ExceptionBExpetionA. The trick is that methods in B should only be specified to throw ExceptionB. This is okay because ExceptionBExceptionA is a sub-class of ExceptionB. However, now clients can catch ExceptionBExceptionA and handle this particular exception, while ignoring other instances of ExceptionB.
try {
// call B's API
catch (ExceptionBExceptionA e) {
// Catch and handle ExceptionBExceptionA
Not all exceptions thrown by A need be wrapped in this manner - only those that are specifically useful to the client.

In SimpleDBM, each module defines its own Exception class. Methods of the module can only throw instances of the module Exception. However, where necessary, sub-classes of the Exception are created that represent more specific information, sometimes wrapping Exceptions thrown by other modules.