Saturday, August 23, 2008

A History of SimpleDBM

The SimpleDBM project started as the DM1 project in August 2001. Initially, I started writing the database code in C. A year later, I ported the code to C++. Here is what I wrote at the time about my switch to C++:
I decided to port the product to C++ primarily for following reasons:
  • I wanted to use the automatic construction and destruction facility in C++. I think this is one of the coolest features of C++. As an aside, I like Java's finally solution better than C++ destructors because with finally, I am able to control the destruction of objects better.

  • I was apprehensive about using C++, because I find that C++ is a large and complicated language. In order to keep the DM1 Threads code simple, I made some decisions that are sure to be controversial:

  • I do not use the C++ standard library, including the STL, at all. This is because I do not like the code bloat that results from its use. I also do not like the idea of using the heap for creating simple objects like strings. I think that C programs are generally more efficient than C++ programs, because they are much more circumspect about using the heap. One advantage of not using the C++ standard library is that it makes DM1 Threads more portable.

  • I do not use templates other than in a very small way.

  • I use sparingly or not at all, certain features of C++ language, such as multiple inheritance, operator overloading, new style typecast operators, etc. This is because I feel that none of these features are essential (Java does without them nicely).

  • I think that Java is a better language than C++, but the lack of certain features (efficient array handling, pointers, etc.) makes it unsuitable for a project like DM1. In many ways, I use C++ as a better Java. C# would have been a good choice, but due to its lack of availability on UNIX platforms, I was unable to use it.
Progress over the next few years was slow. This was partly because there were long periods when I wrote nothing, and partly because I had to do research for the project. I am self taught programmer with no formal degree in Computer Science. I have also never worked for a database vendor. Implementing a database has been an ambition for a long time. I had already implemented single user persistence storage system in C++, but this was nowhere near a real database system with transactions.

I was helped tremendously by reading code of other Open Source systems such as Shore and PostgreSQL. Another source of information was the research published by ACM and also by IBM researcher C. Mohan. I became a member of ACM in 2001 just so that I could access all the research.

There weren't and still aren't any books that go into the nitty-gritty of implementing relational databases. The only book that came remotely close to discussing some of the details of a database engine was the book Transaction Processing: Concepts and Techniques, by Jim Gray and Andreas Reuter. This was a life saver.

By 2005 I had become very frustrated with C++. I found that the IDEs available for C++ were no match for the Java IDEs. Even simple refactoring of code was pretty onerous.

Then IBM announced in 2004 that they were open sourcing Cloudscape. I was very excited because here was a database that was written in Java, and that performed quite well. Another key event was the release of Java 1.5 (Java 5.0). Finally there was a version of Java that had the locking primitives that I needed for my database project. I procrastinated for a while because I did not fancy porting all the code I had written in C++ to Java.

Fortunately (for the project), the company I was working for at the time, got taken over. There was a period of inactivity as we awaited the fate of our IT department. Having not much to do at work meant that I had spare time and was able to use this time to port the code to Java. This was in late 2005, which was the most fruitful six months I spent on the project.

I completed most of the core modules of SimpleDBM in the six months between July and Dec 2005. Since then I have spent more time testing and refactoring, and added the TypeSystem and Database API modules. After I changed my job in 2006, progress has been sporadic. I had a period of year and half when I was so busy at work that I was taking work home every day. I was unable to spend time on SimpleDBM at all.

Early in 2008, I was approached by an ex-database enterpreneur who suggested that I should do some work on Apache Derby. I thought of ditching SimpleDBM to work on Derby. Unfortunately, this relationship did not work out. Now I think that if I am working on my own free time, then it is more fulfilling to work on SimpleDBM, as everything I create is my own. I have considerable freedom, and the ability to work on areas that interest me. I would very much like to work on Apache Derby but cannot afford to do so unless it is a paid job.



Wednesday, July 09, 2008

SimpleDBM release 1.0.8 BETA is out

At long last, I have put together the 1.0.8 BETA release. This contains the experimental TypeSystem and Database API module.

Thursday, July 03, 2008

Update on SimpleDBM's Data Dictionary

I have chosen to implement option 3 as described in a previous post. For each table, a special container is created, which is used to store not only the table structure but also structures of all associated indexes. Special containers are created for indexes as well but these contain pointers to the table structures. (The reason we need the index definition containers is that during recovery an index update may be performed prior to a table update.)

A memory cache is maintained of the entire data dictionary. When a new table is created, its structure is immediately saved into the memory cache. The structure is also persisted using a PostCommitAction - which is a type of redo only log that is executed once the transaction commits.

At system restart, the storage system is scanned for containers that have a specific naming convention and are in a well known location. This way, SimpleDBM is able to initialize the memory cache at startup. Note that this initialization occurs post recovery. The list of open containers is consulted to ensure that only those definitions are loaded that are relevant.

There is a problem that during recovery, any redo/undo operation on a table or index will require access to the table/index definition. Since the data dictionary memory cache is not available at this stage - it is populated after recovery, the solution is to retrieve the definitions on demand.

The table/index creation process uses a standalone transaction. During this time the container IDs are exclusively locked. This ensures that by the time any subsequent transaction performs a data manipulation operation against the table, the data dictionary memory cache has been updated to contain the table definition.

At present tables cannot be dropped, as this has not yet been implemented.

Along with the Data Dictionary implementation, there is also a high level Database API that is available. This hasn't been released yet as it is still being tested. I am hoping that a fully tested version will be available towards the later part of this year.

Wednesday, February 27, 2008

SimpleDBM update

SimpleDBM project work has been slow for the past couple of months. This is because I have decided to develop SimpleDBM as a plug-in storage engine for Apache Derby, as an alternative to the Derby Store implementation. If anyone is interested in tracking this work, please visit the Apache JIRA system and look for DERBY-3351.

There are a few problems to be overcome for this to happen. One of this is to make some changes to Derby so that the Store components can be easily replaced. The good news is that Derby is already very modular, and its original design was intended to allow using the Store implementation on its own. I am working on improving the modularity of Derby, and to reduce dependencies between the layers. This work is likely to take several months or even the whole of this year.

The other main hurdle is that there are some fundamental differences in the design approaches taken by SimpleDBM and Derby.

Principal amongst these is that SimpleDBM treats Row and Index data as transparent blobs. This is an attempt to allow higher level implementations to choose their own row/index formats. In Derby, the row structure and the type system is known to the Storage Engine. The former approach is more flexible, but adds performance overheads. At present, my thinking is that to interoperate with Derby, I may have to adopt the Derby type-system.

Another fundamental design choice is the way modules plug into each other. In SimpleDBM, a constructor based injection pattern is used. The modules themselves do not decide how to obtain references to other modules. Derby allows each module to obtain reference to other modules via an intermediary, the Monitor.

SimpleDBM does not use static data in any way. Even the core of the system that instantiates objects, uses a dynamic lookup mechanism (ObjectRegistry) to map type codes to class types. Derby on the other hand, uses a fair number of static constructs. The Monitor system and the Context Manager are two key examples.

There are other areas of incompatibility. In the end, I do not know if this experiment will succeed in the way I would like it to.