SimpleDBM - An Open Source Java Database Engine

Saturday, January 28, 2006

On licensing

SimpleDBM is licensed under GPL V2 or later. I decided to use GPL because I believe in the values that the GNU movement stands for. It is a pity that so much FUD is generated regarding the GPL, and more pity that there is such a proliferation of OpenSource licences. If GPL was Business Unfriendly, then Linux would never have been successful.

When GPL V3 comes out finally, I will adopt it for SimpleDBM.

On a side note, I finally managed to get around to implementing a few things that were long on my TODO list:

The Log Manager now automatically deletes older archive log files.
There is a background thread for generating Checkpoints.
Rollback to a Savepoint will discard PostCommitActions that were scheduled after the Savepoint was created. This means that if you create a Savepoint, drop a container, and then rollback to the Savepoint, the drop action will be discarded.

Monday, January 23, 2006

Priorities for January

This month I am working on finishing the Developer's Guide, and also plan to update the Javadoc documentation. While updating the documentation I realized that I should not have used the term BTree when defining the interface for the Index Module. A BTree is an implementation strategy for Indexes, therefore, it is better to use a more generic term when specifying the interface. I am refactoring the code to correct this.

My priorities for January and February are to:

Complete the documentation.
Augment JUnit test cases.
Tie up loose ends and produce a usable RSS component.

If I complete all this by February, from March onwards I shall start working on building the next layer of the system, i.e., type system, system catalogs, tables and indexes with multiple attributes.

Saturday, January 14, 2006

Multiversion Concurrency

Some time ago, I promised to write about some of the techniques used in Oracle. A very early draft of a paper on Multi-Version concurrency is now available. It discussed MVCC implementations in PostgreSQL and Oracle.

SimpleDBM does not implement MVCC on purpose, as I wanted to understand traditional implementations before attempting to implement MVCC. Perhaps one day, a different version of SimpleDBM will implement MVCC.

I am very keen on ensuring that this paper is accurate in its description of PostgreSQL and Oracle implementations. I would also like to add descriptions of other DBMSes like Firebird and MySQL/InnoDb.

Friday, January 13, 2006

Documentation moving to LaTeX

I am ashamed to say that I just discovered LaTeX. Of course, I knew about TeX but never thought I would use it ... well, I have just converted the SimpleDBM Reference Manual to LaTeX, and I love the results. The output is so much better, and the document looks professional. Here's the link to the PDF output.

Friday, January 06, 2006

More on Exceptions

In a previous post, I blogged about why I favour Checked Exceptions over Unchecked ones. Today, I'll talk about how I am circumventing some of the issues with Checked Exceptions.

The big advantage with Checked Exceptions is that the method signature tells you what Exceptions are likely to be thrown. Great as this is, it is also a liability, because any change in the Exception specification can break client code.

SimpleDBM comprises of several modules. In SimpleDBM, the module is the unit of reusability. Each module has an API which is represented by a set of Interfaces, and one or more implementation. An important objective is to make each Module's API stable so that as the code evolves, other modules are not impacted by changes in the API. This is where Exception specification becomes important.

Let us assume there are two modules, A and B, and also assume that B depends upon A. Now, suppose that some methods in A's API throw an exception called ExceptionA and some of B's methods throw ExceptionB. Since B's methods call A's methods, when defining B's methods, we have following options:

Allow B's methods to throw ExceptionA.
Catch ExceptionA in B's methods and wrap them in ExceptionB.

The problem with first approach is that it makes B's API unstable. What if A's methods start throwing a new Exception type?

The problem with the second approach is that ExceptionA has been wrapped and cannot be caught by a client. A client that invokes B may want to handle ExceptionA in a specific manner. In the first option, the client could write following code, but with the second option, this is not possible:

try {
   // call B's API
   B.someMethod();
}
catch (ExceptionA e) {
   // Catch and handle ExceptionA
}

What we want is to preserve the information that A threw ExceptionA, but still avoid having ExceptionA in B's method signatures.

The solution is to wrap ExceptionA with a specific sub-class of ExceptionB, rather than plan ExceptionB. Let us call this sub-class ExceptionBExpetionA. The trick is that methods in B should only be specified to throw ExceptionB. This is okay because ExceptionBExceptionA is a sub-class of ExceptionB. However, now clients can catch ExceptionBExceptionA and handle this particular exception, while ignoring other instances of ExceptionB.

try {
   // call B's API
   B.someMethod();
}
catch (ExceptionBExceptionA e) {
   // Catch and handle ExceptionBExceptionA
}

Not all exceptions thrown by A need be wrapped in this manner - only those that are specifically useful to the client.

In SimpleDBM, each module defines its own Exception class. Methods of the module can only throw instances of the module Exception. However, where necessary, sub-classes of the Exception are created that represent more specific information, sometimes wrapping Exceptions thrown by other modules.

Wednesday, December 14, 2005

Free Space Management

Despite my previous resolve, I have started work on the Tuple Manager module this month. I'll talk about some of the issues and challenges in this module.

The Tuple Manager module provides a low-level interface for managing persistence of table rows. It is low-level in the sense that this module has no knowledge of what is contained in a table row. I use the term tuple instead of table row, but even this is not the right term, as a tuple means a collection of attributes.

To the Tuple Manager, tuples are just blobs of data that can span multiple pages. When a tuple is inserted for the first time, it is assigned a unique Location, which is really an abstraction of the ROWID concept in other databases. The Tuple Manager module implements the Location interface, but other modules do not need to know anything about the internal structure of these objects.

Like BTree indexes, tuples are stored in containers. A container that is specialized for storing tuples is called a Tuple Container. By design, only one type of tuple may be stored in a particular container. In a higher level module, a tuple can be mapped to a table row, and the container to a table.

I have kept the interface of Tuple Manager generic by ensuring that it knows very little about tuples. Unfortunately, this means that tuple updates cannot be handled efficiently, specially with regards to logging, as the contents of the both before and after images of the tuple must be logged. A possible optimisation would be to use some form of binary diff algorithm to generate a change vector, and store the change vector only in the log record. I will initially implement the less efficient logging method, and later on, when time permits, implement the optimisation.

Both BTrees and Tuple Containers need free space management. By free space management we mean the process of identifying pages where new data can go. In the case of BTrees, SimpleDBM uses space map pages that use one bit per page. This is okay, because in a BTree, a page is either allocated or not. My paper examines some of the space management issues in BTrees in more detail.

In case of Tuple Containers, I am using space map pages that use two bits to store the space information for a single page. This means that we can track the following states: full (3), two-thirds full (2), one-third full (1), and empty (0). Initially pages start out empty, but when they are used, their status changes as required.

There are a couple of issues related to space management that merit discussion. Unfortunately, there are very few papers that go into all the details. I have found the following papers very useful:

[1] C.Mohan and D.Haderle. Algorithms for Flexible Space Management in Transaction Systems Supporting Fine-Granularity Locking. In Proceedings of the International Conference on Extending Database Technology, March 1994.

[2] Mark L.McAuliffe, Michael J. Carey and Marvin H. Solomon. Towards Effective and Efficient Free Space Management. ACM SIGMOD Record. Proceedings of the 1996 ACM SIGMOD international conference on Management of Data, June 1996.

The first issue is how to handle deletes. There are two options.

The first option is to delete a tuple physically and update space map page to reflect the change. However, this poses the problem that if the transaction aborts and the tuple needs to be restored, then the restore will fail if some other transaction uses up the released space in the meantime. To prevent this, some extra information needs to be stored in the page to indicate that although the tuple has been deleted, its space is reserved, and cannot be used by any other transaction. One possible approach is to store the transaction id and the amount of space reserved using the space previously occupied by the tuple. If the tuple occupied more than one page, then space must be reserved on all affected pages, since otherwise, when the tuple is to be restored, the pages may no longer have space to hold the tuple data. If logical undo is implemented, then it is possible to avoid reserving space in pages other than the first page, because a logical undo will allow new pages to be commissioned if necessary to accomodate the restored tuple. Since the tuple's unique id (Location) is bound to the first page, this page must always have space available for the tuple to be restored.

If as suggested above, the space map information is updated as soon as the tuple is deleted, then other transactions looking for free space may end up visiting the pages that have been affected by the tuple delete. However, those transactions may discover when they access the page, that space is not actually available. As a solution to this problem, the space map page update could be deferred until the tuple delete is known to have been committed. However, this would be inefficient, as the transaction that performs the delete will have to visit all pages affected by deleted tuples at commit time, and update the free space map information for these pages.

The space map update could also be delegated to the next transaction that needs the space. The problem with this is that if the page remains marked as fully allocated, then no other transaction will visit that page unless the tuples on the page need to be updated. There is the risk that the tuple space will never be reclaimed.

The problem of unnecessary visits to a page containing reserved space can be avoided by techniques described in [2]. This involves maintaining a cache of recently used pages and avoiding scanning of the free space map as long as there is candidate page available in the cache. When a page is affected by a tuple delete, it is added to the cache provided that its total free space, including the space reserved for deleted tuple, is greater than the fullest page in the cache. If a transaction visits such a page and is unable to use the reserved space, it removes the page from the cache.

In summary then, the preferred option appears to be to update the space map information as soon as the tuple is deleted.

Physically deleting tuples affects the amount of logging that must be performed. Since the tuple's data is removed, the log must contain the entire contents of the deleted tuple. Similarly, when undoing the delete, the Compensation log record must again contain the full content of the tuple. Thus the tuple data gets logged twice potentially, once when the delete occurs, and again if the transaction aborts.

This brings us to the second option, which is to use logical deletes. In this solution, the tuple remains as is, but is marked as deleted. No space reservation is needed, as the tuple still exists. The space map information is updated as before, that is, at the time of tuple being deleted. Using logical deletes makes undo of such deletes a simple matter of resetting the deleted flag. Logging overhead is substantially reduced.

With logical deletes, however, none of the space can be released prior to the transaction commit. In contrast, with physical deletes, if logical undo is implemented, at least some of the space can be immediately released.

Whether logical or physical deletes are used, in both cases, we still have the issue of how to inform other transactions that the tuple space is still needed. In both cases, the solution is the same. The Lock Manager can be used to ascertain whether the deleted tuple is still locked. If not, then the transaction can infer that tuple delete has been committed. The Lock Manager solution works even if the ID of the transaction that deleted the tuple is unknown, as it relies upon the tuple's Location only. If each tuple is tagged with the ID of the last transaction that updated the tuple, then it would be possible to directly query the transaction table for the status of the transaction. However in this case, the system would have to maintain the status of all transactions, even those that have committed or aborted.

SimpleDBM maintains the status of only active transactions, and also does not tag tuples with the IDs of transactions. Hence, it is appropriate to use the Lock Manager solution in SimpleDBM.

I mentioned before that there were two issues related to Space Management that I wanted to discuss. The second issue is related to logging of free space map pages. I am still working on this issue, and have not reached any conclusions yet. I will blog on this subject once I have completed work on the Tuple Manager. In the meantime, if you are interested in the logging issues, I suggest reading [1].

Monday, December 05, 2005

December Priorities

The priority for December is to complete the documentation and create sample programs. Both of these are intended to help new developers to get started with SimpleDBM.

So far, I have created a BTreeDemo sample - here is a screenshot. The demo program allows a user to interactively manipulate a BTree index. The program creates two threads and allows user to switch from one to the other. This is a good way to simulate locking and concurrency issues.

I am working on a Transaction Manager demo, which will be based upon the One Bit Resource Manager described in section 10.3.7.2 of the classic book Transaction Processing: Concepts and Techniques, by Jim Gray and Andreas Reuter.

The developer's guide is coming along slowly. The latest version can always be found here.

The other theme of this month is to carry on with more BTree test cases. I am reluctant to start work on anything else until I am completely satisfied with the testing.

I mentioned in my previous post that I would need to enhance the BTree module to allow generic multiple attribute index keys. This is now done, and a sample implementation of multiple attribute keys is available as part of the BTreeDemo program.

Tuesday, November 15, 2005

Design choices - flexibility versus performance part 2

When designing the interface for the BTree module, there are some conflicting priorities.

An important goal is to ensure that the interface is reusable and makes as little assumptions about the content of "keys" and "values" as possible. Some of the points to consider are:

a) Should the BTree module be aware of the type system available in the system?
b) Should it be aware of how multiple attribute keys are handled?

I chose to keep the level of abstraction higher. Therefore in my implementation of the BTree module, the module has no knowledge of what is contained in the key, whether it is made up of single or multiple attributes, etc. The advantage is that a key can be anything that satisfies a relatively simple interface defined by the system. The disadvantages are:

1) The system cannot perform specific performance optimisations that would be dependent upon the knowledge of the attribute structure and type system. For example, the system cannot perform compression on the key structure.
2) The system cannot support search operators at an attribute level. When searching and fetching keys and values, only one composite operator is supported - >=. That is, the fetch algorithm knows how to obtain the next key that is equal or greater than the specified key.

On the whole, I am happy with the interface as it provides greater reusability. It allows the BTree module to be independently useful.

A problem I have been grappling with recently is how to support data driven construction of keys, without breaking the high level interface defined so far. In a DBMS system, the definition of the keys is stored in System Catalogs. There has to be a way by which the key definition can be used to generate keys. In the current design, the BTree module expects to be given a factory class for instantiating keys. I thought of creating a Key factory that would read a key definition and then generate keys based upon the definition. However, the problem is that such a key factory would be capable of generating many different types of keys, whereas the current BTree interface expects a one to one relationship between the key factory and the type of key. I have come to the conclusion that I need to enhance the system in two ways:

Firstly, I need to support one to many relationship between the key factory and the key. Since the BTree instance is tied to a specific type of key, it therefore needs to be enhanced to supply an "id" that enables the key factory to generate the specific type of key required by the BTree instance. This means that instead of invoking:

key = keyFactory.getNewInstance()

the BTree instance would invoke:

key = keyFactory.getNewInstance(myid)

The key factory would be able to use the "id" to determine the type of key required.

The second enhancement I need is to do with performance. At present, the Object Registry always generates new instances of objects, which would be inefficient for a key factory that needs to maintain its own internal registry. I therefore need to enhance the registry to support Singletons - key factories need to be cached in the registry so that the same factory can be used by multiple BTree instances.

My final point about the tradeoffs between flexibility and performance is to do with storage structure of keys and log records. I have tried to make the storage structures self describing. This means that the keys and values, as well as the log records, must persist sufficient information to be able to reconstruct their state when required. In a multi-attribute key, for example, this means that each attribute must store its type information, data length, etc. along with the data. This naturally increases the storage space requirement of entities. The benefit is that the system does not require external support to determine how to read data back from persistent storage. For example, the BTree module does not require the System Catalalog to be available, it has sufficient information to be able to read and write keys to persistent storage.

Thursday, November 10, 2005

Refactoring

In a major refactoring exercise I have split the SimpleDBM modules into two higher level packages. The Latch Manager, the Object Registry and the Util packages all go under the package org.simpledbm.common. The rest of the packages go under org.simpledbm.rss. I decided to use the acronym RSS for the low level data management API in honour of System R. System R called its low level API Research Storage System or RSS in short. See the paper Morton M. Astrahan, Mike W. Blasgen, Donald D. Chamberlin, Kapali P. Eswaran, Jim Gray, Patricia P. Griffiths, W. Frank King III, Raymond A. Lorie, Paul R. McJones, James W. Mehl, Gianfranco R. Putzolu, Irving L. Traiger, Bradford W. Wade, Vera Watson: System R: Relational Approach to Database Management. ACM Trans. Database Syst. 1(2): 97-137(1976).

Of course, all this refactoring has left CVS in a mess, as now there are many empty subdirectories under src/org/simpledbm.

I have also been working on a Developer's Reference Manual. This will hopefully contain sufficient information for interested hackers and database technology enthusiasts to play around with various modules. You can access the document here. I would welcome any feedback.

Finally, the SimpleDBM project has now graduated out of the incubator at www.java.net. This should lead to greater exposure and interest in the project.

Monday, November 07, 2005

BTree Implementation Progress Report Part 4

BTree scans are now available as well. This means that the BTree implementation is feature complete, although, there are still some areas that need more work.

For the rest of November, I am going to concentrate on improving the code, refactoring bits that I don't like, updating documentation, and generally cleaning up the code. All this in preparation for a release of the code towards the end of the month.

From December onwards, work can start on the remaining bits in the Data Manager layer of SimpleDBM, i.e, tables. I have not yet decided whether knowledge about data types should be part of this layer or whether it is best left to a higher-level layer. I am tempted to keep the data layer as low level as possible; this will make it more reusable.

Wednesday, November 02, 2005

BTree Implementation Progress Report Part 3

Both insert key and delete key operations are now available. More test cases are being written; in order to test lock conflicts when the same key is concurrently inserted/deleted by different transactions, the new test cases have to use multiple threads. It is harder to debug such test cases, but Eclipse makes it easy. I simply run the JUnit tests in debug mode and set break points at appropriate places.

By end of this week I shall be working on the BTree scan operations. There aren't many research publications that deal with the inner workings of an index scan; the only paper that seems to discuss a real implementation is the paper by Mohan, C. An Efficient Method for Performing Record Deletions and Updates Using Index Scans, Proc. 28th International Conference on Very Large Databases, Hong Kong, August 2002. The paper discusses many of the performance issues with implementing scans. My implementation will suffer from the problems Mohan describes, as I want to do a basic implementation first before looking at optimisations.

To ensure that the BTree implementation is tested thoroughly, I have started to use the code coverage tool Clover. The vendor has graciously provided me a free license to use this tool. Using this tool I am able to determine which code paths have not been tested, and then write test cases to exercise those.

Friday, October 28, 2005

BTree Implementation Progress Report Part 2

Currently I am busy testing the code for inserting keys into the BTree. My testing methodology is:

Write a JUnit test case.
Debug the code, stepping through the entire test.
Fix any bugs and retest until test runs okay.

Out of the 16 test cases written so far, only 4 are on insert - with many more to come.

To facilitate testing I need to construct trees with specific properties. The way I have chosen to do this is to specify the data in an XML file and then load it to create the desired tree. Just goes to show the amount of effort required to construct proper test cases. My current estimate is that I won't have a fully tested BTree implementation until end of November.

Thursday, October 13, 2005

BTree Implementation Progress Report

I have been working hard on getting the BTree implementation completed. It is about 75% complete, however, delivery will take longer as I need to ensure that adequate test cases are available to verify the correctness of the algorithms. Already, this is probably the largest module. It will also be one of the most complex.

As mentioned before, I am implementing the B-link tree algorithm as published in the paper Ibrahim Jaluta, Seppo Sippu and Eljas Soisalon-Soininen. Concurrency control and recovery for balanced B-link trees. The VLDB Journal, Volume 14, Issue 2 (April 2005), Pages: 257 - 277, ISSN:1066-8888. I think this will probably be one of the few BTree implementations that fully supports merging of pages when keys are deleted.

I have written a short paper discussing some space management issues in the implementation. The paper can be found here.

Sunday, October 02, 2005

Database Technology camps

Mainstream Relational Database technology is split into two camps.

On the one hand are the systems that are directly or indirectly based upon techniques and algorithms derived from IBM's System R project. Most published research also falls into this group.

In the second camp are systems that have been inspired by Oracle. Oracle has a very interesting history. Although Oracle started with the specifications of System R, it actually did not use any of the System R ideas. A good summary of how Oracle evolved can be found here.

SimpleDBM is based upon the System R heritage. The main reason for this is that I do not work in a company that makes DBMSes. Hence I do not have access to information other than published research. Since Oracle does not publish details of its algorithms, it would be much harder for me to implement a system based upon Oracle. I have, however, given much thought to some of the techniques used in Oracle, and hope to discuss some of these in future blogs and papers.

Friday, September 30, 2005

Design choices - flexibility versus performance

One of the design choices I made in SimpleDBM is to build the system as a collection of modules, and keep the modules loosely coupled via interfaces. There is also a hierarchy within the modules; some modules come before others, and provide services used by other modules. It is very important to enforce this hierarchy, because otherwise, the dependencies between the modules would become cyclical, making it much harder to replace a module.

A benefit of choosing this design is flexibility. For example, although there are different types of Log records in the system, the Log Manager has no knowledge of them. Similarly, the Transaction Manager interacts with disk pages without knowing much about the structure of those pages. Even the Buffer Manager, which is responsible for caching disk pages in memory, actually knows nothing about disk pages. Therefore, it is possible to implement new page types without impacting the Buffer Manager. As a final example, the module I am working on currently, the BTree manager, has no knowledge about the types of index keys and rowids it has to handle. It relies upon a couple of externally provided factory classes to generate, and manage these objects. I am therefore able to develop the BTree module in advance of defining the table row structure, the field types or even the field operations. From the BTree module's perspective, all that is needed is that the keys should be sortable and support the form of serialization that is defined by SimpleDBM.

The downside to this flexibility is increased overhead. There is extra memory and processing overhead, as various object types must be created and managed dynamically using reflection. There is extra overhead in maintaining type information and looking up type information. For instance, when the log records are retrieved during restart recovery, and we need to replay the changes that have occured within a BTree page, we need to first obtain instances of the index key generation factory and the rowid generation factory. The log records must save extra type information to allow this.

Testing

One constraint I have placed upon myself when coding is that I will not start work on a new module until I have created and executed test cases for the module I am currently working on. This slows me down, but the payoff is tremendous.

Writing test cases for a module often takes more time than coding the module itself. I think that this is not unusual, and any project that is serious about testing must factor in additional time and effort for this activity. In the company I work for, we never write automated tests. All testing is carried out by running the program, feeding input and checking output manually. This form of testing is important as well, but totally inadequate. A unit test written by the developer can test aspects that a user or operator can never have access to.

Of course, all test cases are not created equal. Often we hear about projects that use JUnit based testing framework, but this in itself does not prove how good the testing is. There are several things to look for:

Test coverage - how much of the program has been subjected to testing.
Test scenario - a program is meant to be used in many different scenarios. How many of the possible scenarios have been tested.
Test data - has the program been tested for variations in data?

For a project like SimpleDBM, testing is additionally complicated by two things:

Liberal use of threads and multi-threading constructs complicate testing.
Recovery operations need the system to be crashed in various ways.

It is not always possible to write test cases that are non-intrusive. Sometimes it is necessary for the program being tested to cooperate with the test framework. For example, in the BTree split page test case, I need to be able to crash the system at specific points in order to test recovery. To do this, I use flags within the BTree module that the test framework can manipulate to trigger this behaviour. For normal operations, the flags can be switched off.

Thursday, September 15, 2005

Page oriented redo/undo versus Logical Undo

ARIES uses the term page-oriented redo/undo to describe logging that is always applied to a specific page. This means that if some page P1 generated a undo/redo log record L1, then during redo, the redo portion of L1 will be applied to P1, and during undo, the undo portion of L1 will be applied to P1.

The term Logical Undo is used to describe logging where the undo action may not always be applied to the page that generated the log record. However, Logical Undos can still be page-oriented. The difference is that the undo action may be applied to a different page. It is also possible for a logical undo operation to modify more than one page, and generate additional log records.

Since a Logical Undo can be page-oriented, the term page-oriented undo is ambiguous.

Monday, September 05, 2005

In favour of Checked Exceptions

I said in a previous post that I favour Checked Exceptions over Unchecked ones. I will try to explain my reasons for doing so.

A good discussion of the problems with Checked Exceptions is in this article by Brian Goetz. The main problems can be summarized as:

Unnecessary exposure of implementation details.
Unstable method signatures.
Unreadable code.
Exception swallowing.
Too much Exception wrapping.

Problems 1 and 2 are caused by poor design. By ensuring that exceptions are thrown at the right level of abstraction, these problems can be avoided. Of course, this means greater effort by the programmer, and also leads to problem 5 - ie - too much exception wrapping.

As for problem 3, I think that any piece of code that has proper error handling will always be less readable than code that has no error handling. This is the reason why so much text book code has very little or no error handling. In a production system, unlike in text books, readability is important, but proper error handling is even more important.

Since so many articles talk about the problems with Checked Exceptions, perhaps a novel approach to this issue is to look at the problems caused by Unchecked Exceptions. These, in my view, can be summarized as:

Lack of information in method signatures. So that by looking at a method signature, you cannot tell what exceptions are likely to be thrown.
Java compiler will not warn you if you ignore exceptions. Therefore, rather than thinking about every possible exception, you will be blissfully unaware of potential exceptional conditions. The result is likely to be an unstable system; one that crashes with indecipherable errors.
Loss of information. Since information about exceptional conditions is fairly local within a system, when an exception at a lower level is caught and wrapped at a higher level, there is potential for additional information to be supplied to diagnose the problem. Since unchecked exceptions pass through silently, all you will get is the cryptic error at the lowest level of abstraction with very little contextual information other than the stack trace.
Exception swallowing. Let us not forget that the most common method of swallowing Exceptions, ie, catching all Exceptions, will also swallow Unchecked Exceptions. I have seen programmers swallowing Exceptions because they find this an easy and simple way to fix that annoying NullPointerException.
Whether you like it or not, Java uses Checked Exceptions everywhere in its libraries. Therefore, even if you throw Unchecked Exceptions, chances are that programmers will still swallow Exceptions, because they still need to deal with the Checked Exceptions thrown by Java libraries.
As it is, programmers do not like to think about error handling. To encourage the use of Unchecked Exceptions is to encourage this behaviour, as there will be even less inclination to think about exceptional conditions.

If you are still not convinced, read this excellent interview with Bruce Lindsay, where he talks about the type of error handling that goes into building a DBMS.

Friday, September 02, 2005

BTrees and Logical Undos

Certain BTree operations require Logical Undo to be performed during rollbacks and restart recovery. This is because, in a BTree, keys can move from one page to another, and therefore an undo operation may have to be applied to a different page from the one where the original action took place.

To take an example, suppose transaction T1 deletes the key D from page P1. Another transaction T2 comes along and inserts the key B into page P1. This causes a page split, after which, all keys greater than B end up in a new page P2. Now, if the first transaction T1 aborts, then it needs to reinstate the deleted key D. However, if it looks at page P1, it will find that the deleted key no longer belongs there. Therefore it must locate the new page P2 and perform the undo action there.

There are a couple of design decisions that affect the complexity of the logical undo actions:

If the BTree is designed to physically remove keys when they are deleted, then undo actions require the key to be physically re-inserted. This means that during the undo of a key delete, page splits can occur if the insert cannot be performed due to the page being full. The splits can recurse up the tree causing more page splits.

The advantage of the physical removal option is that the BTree never contains data that is superfluous. If a large number of rows are deleted, the BTree will be immediately compacted, as the pages used by deleted keys would be removed. Keeping the BTree dense and small is good for performance as fewer pages need to be scanned during traversals.

The other approach is to use logical key deletes, where the key is not physically removed but merely marked as deleted. To undo the delete, only the mark needs to be reset. Logical deletes allows undo operations to proceed without page splits, and therefore Logical Undos can be page oriented.

The problem with logical deletes, of course is that these keys must be garbage collected at some point and key traversals need to obtain instant locks on the deleted keys to determine whether they are truly deleted or not. The presence of a large number of deleted keys can also make the BTree less dense, and increase the path for traversals.

It seems to me that the logical delete key approach is more suited to a scenario where there are frequent rollbacks.

The issue of logical deletes is not discussed at any great length anywhere. There is a brief discussion of this in Mohan's paper Commit_LSN: A Novel and Simple Method for Reducing Locking and Latching in Transaction Processing systems.

A BTree implementation that uses logical key deletes is in Apache Derby.

Thursday, September 01, 2005

Space Management

This week I am working on the Space Manager module which will handle page allocation in containers. This module is not so exciting, it is tedious work with lots of little details all of which have to be right for it to work correctly. While working on this, I did have some interesting experiences though.

I am a firm believer in Checked Exceptions. I think Unchecked Exceptions should be used very very rarely, and only after considerable thought. It seems to me that those who champion Unchecked Exceptions are primarily looking for clean code that is aesthetically pleasing and is not cluttered with a lot of error handling code. While this may be desirable for improving the readability of the code, correct error handling is more important than improving readability, especially in any mission critical system. Unchecked Exceptions have the big problem that the Java compiler does not flag it if you ignore such exceptions. Therefore, you can write code that happily ignores exceptions, and if this is part of a large system, believe me, bad things will happen.

I will write more on this subject in a separate post. The reason I brought this up is that in SimpleDBM, almost all exceptions are checked exceptions, and while implementing the Space Manager module, I encountered a problem that would not have occurred had I used unchecked exceptions. The Space Manager module is transactional, and therefore needs to implement the transactional redo/undo interfaces defined by the Transaction Manager. While implementing these interfaces, I suddenly found that because I was using checked exceptions, the existing redo/undo interfaces did not allow the Space Manager to throw exceptions that are unknown to the Transaction Manager. At first I thought that maybe I should expand the interfaces to allow this, but then, I realized that doing this would break the modular structure of the system, as it would create a cycle in the dependency graph between the two modules. Also, the Transaction Manager has to be able to manage any number of transactional modules, and therefore must not know about the specifics of any module. If I allowed the TM to know about exceptions thrown by the Space Manager, then it would mean that the same would have to be done for every other module that is implemented in future!

The solution to this problem is to specify the redo/undo interfaces in such a way that modules that implement these interfaces can wrap any undeclared exceptions in a specific exception that is meant to wrap such exceptions. This way, we can continue to use and benefit from checked exceptions, without having to resort to unchecked exceptions.