Wednesday, December 14, 2005

Free Space Management

Despite my previous resolve, I have started work on the Tuple Manager module this month. I'll talk about some of the issues and challenges in this module.

The Tuple Manager module provides a low-level interface for managing persistence of table rows. It is low-level in the sense that this module has no knowledge of what is contained in a table row. I use the term tuple instead of table row, but even this is not the right term, as a tuple means a collection of attributes.

To the Tuple Manager, tuples are just blobs of data that can span multiple pages. When a tuple is inserted for the first time, it is assigned a unique Location, which is really an abstraction of the ROWID concept in other databases. The Tuple Manager module implements the Location interface, but other modules do not need to know anything about the internal structure of these objects.

Like BTree indexes, tuples are stored in containers. A container that is specialized for storing tuples is called a Tuple Container. By design, only one type of tuple may be stored in a particular container. In a higher level module, a tuple can be mapped to a table row, and the container to a table.

I have kept the interface of Tuple Manager generic by ensuring that it knows very little about tuples. Unfortunately, this means that tuple updates cannot be handled efficiently, specially with regards to logging, as the contents of the both before and after images of the tuple must be logged. A possible optimisation would be to use some form of binary diff algorithm to generate a change vector, and store the change vector only in the log record. I will initially implement the less efficient logging method, and later on, when time permits, implement the optimisation.

Both BTrees and Tuple Containers need free space management. By free space management we mean the process of identifying pages where new data can go. In the case of BTrees, SimpleDBM uses space map pages that use one bit per page. This is okay, because in a BTree, a page is either allocated or not. My paper examines some of the space management issues in BTrees in more detail.

In case of Tuple Containers, I am using space map pages that use two bits to store the space information for a single page. This means that we can track the following states: full (3), two-thirds full (2), one-third full (1), and empty (0). Initially pages start out empty, but when they are used, their status changes as required.

There are a couple of issues related to space management that merit discussion. Unfortunately, there are very few papers that go into all the details. I have found the following papers very useful:

[1] C.Mohan and D.Haderle. Algorithms for Flexible Space Management in Transaction Systems Supporting Fine-Granularity Locking. In Proceedings of the International Conference on Extending Database Technology, March 1994.

[2] Mark L.McAuliffe, Michael J. Carey and Marvin H. Solomon. Towards Effective and Efficient Free Space Management. ACM SIGMOD Record. Proceedings of the 1996 ACM SIGMOD international conference on Management of Data, June 1996.

The first issue is how to handle deletes. There are two options.

The first option is to delete a tuple physically and update space map page to reflect the change. However, this poses the problem that if the transaction aborts and the tuple needs to be restored, then the restore will fail if some other transaction uses up the released space in the meantime. To prevent this, some extra information needs to be stored in the page to indicate that although the tuple has been deleted, its space is reserved, and cannot be used by any other transaction. One possible approach is to store the transaction id and the amount of space reserved using the space previously occupied by the tuple. If the tuple occupied more than one page, then space must be reserved on all affected pages, since otherwise, when the tuple is to be restored, the pages may no longer have space to hold the tuple data. If logical undo is implemented, then it is possible to avoid reserving space in pages other than the first page, because a logical undo will allow new pages to be commissioned if necessary to accomodate the restored tuple. Since the tuple's unique id (Location) is bound to the first page, this page must always have space available for the tuple to be restored.

If as suggested above, the space map information is updated as soon as the tuple is deleted, then other transactions looking for free space may end up visiting the pages that have been affected by the tuple delete. However, those transactions may discover when they access the page, that space is not actually available. As a solution to this problem, the space map page update could be deferred until the tuple delete is known to have been committed. However, this would be inefficient, as the transaction that performs the delete will have to visit all pages affected by deleted tuples at commit time, and update the free space map information for these pages.

The space map update could also be delegated to the next transaction that needs the space. The problem with this is that if the page remains marked as fully allocated, then no other transaction will visit that page unless the tuples on the page need to be updated. There is the risk that the tuple space will never be reclaimed.

The problem of unnecessary visits to a page containing reserved space can be avoided by techniques described in [2]. This involves maintaining a cache of recently used pages and avoiding scanning of the free space map as long as there is candidate page available in the cache. When a page is affected by a tuple delete, it is added to the cache provided that its total free space, including the space reserved for deleted tuple, is greater than the fullest page in the cache. If a transaction visits such a page and is unable to use the reserved space, it removes the page from the cache.

In summary then, the preferred option appears to be to update the space map information as soon as the tuple is deleted.

Physically deleting tuples affects the amount of logging that must be performed. Since the tuple's data is removed, the log must contain the entire contents of the deleted tuple. Similarly, when undoing the delete, the Compensation log record must again contain the full content of the tuple. Thus the tuple data gets logged twice potentially, once when the delete occurs, and again if the transaction aborts.

This brings us to the second option, which is to use logical deletes. In this solution, the tuple remains as is, but is marked as deleted. No space reservation is needed, as the tuple still exists. The space map information is updated as before, that is, at the time of tuple being deleted. Using logical deletes makes undo of such deletes a simple matter of resetting the deleted flag. Logging overhead is substantially reduced.

With logical deletes, however, none of the space can be released prior to the transaction commit. In contrast, with physical deletes, if logical undo is implemented, at least some of the space can be immediately released.

Whether logical or physical deletes are used, in both cases, we still have the issue of how to inform other transactions that the tuple space is still needed. In both cases, the solution is the same. The Lock Manager can be used to ascertain whether the deleted tuple is still locked. If not, then the transaction can infer that tuple delete has been committed. The Lock Manager solution works even if the ID of the transaction that deleted the tuple is unknown, as it relies upon the tuple's Location only. If each tuple is tagged with the ID of the last transaction that updated the tuple, then it would be possible to directly query the transaction table for the status of the transaction. However in this case, the system would have to maintain the status of all transactions, even those that have committed or aborted.

SimpleDBM maintains the status of only active transactions, and also does not tag tuples with the IDs of transactions. Hence, it is appropriate to use the Lock Manager solution in SimpleDBM.

I mentioned before that there were two issues related to Space Management that I wanted to discuss. The second issue is related to logging of free space map pages. I am still working on this issue, and have not reached any conclusions yet. I will blog on this subject once I have completed work on the Tuple Manager. In the meantime, if you are interested in the logging issues, I suggest reading [1].

Monday, December 05, 2005

December Priorities

The priority for December is to complete the documentation and create sample programs. Both of these are intended to help new developers to get started with SimpleDBM.

So far, I have created a BTreeDemo sample - here is a screenshot. The demo program allows a user to interactively manipulate a BTree index. The program creates two threads and allows user to switch from one to the other. This is a good way to simulate locking and concurrency issues.

I am working on a Transaction Manager demo, which will be based upon the One Bit Resource Manager described in section of the classic book Transaction Processing: Concepts and Techniques, by Jim Gray and Andreas Reuter.

The developer's guide is coming along slowly. The latest version can always be found here.

The other theme of this month is to carry on with more BTree test cases. I am reluctant to start work on anything else until I am completely satisfied with the testing.

I mentioned in my previous post that I would need to enhance the BTree module to allow generic multiple attribute index keys. This is now done, and a sample implementation of multiple attribute keys is available as part of the BTreeDemo program.

Tuesday, November 15, 2005

Design choices - flexibility versus performance part 2

When designing the interface for the BTree module, there are some conflicting priorities.

An important goal is to ensure that the interface is reusable and makes as little assumptions about the content of "keys" and "values" as possible. Some of the points to consider are:

a) Should the BTree module be aware of the type system available in the system?
b) Should it be aware of how multiple attribute keys are handled?

I chose to keep the level of abstraction higher. Therefore in my implementation of the BTree module, the module has no knowledge of what is contained in the key, whether it is made up of single or multiple attributes, etc. The advantage is that a key can be anything that satisfies a relatively simple interface defined by the system. The disadvantages are:

1) The system cannot perform specific performance optimisations that would be dependent upon the knowledge of the attribute structure and type system. For example, the system cannot perform compression on the key structure.
2) The system cannot support search operators at an attribute level. When searching and fetching keys and values, only one composite operator is supported - >=. That is, the fetch algorithm knows how to obtain the next key that is equal or greater than the specified key.

On the whole, I am happy with the interface as it provides greater reusability. It allows the BTree module to be independently useful.

A problem I have been grappling with recently is how to support data driven construction of keys, without breaking the high level interface defined so far. In a DBMS system, the definition of the keys is stored in System Catalogs. There has to be a way by which the key definition can be used to generate keys. In the current design, the BTree module expects to be given a factory class for instantiating keys. I thought of creating a Key factory that would read a key definition and then generate keys based upon the definition. However, the problem is that such a key factory would be capable of generating many different types of keys, whereas the current BTree interface expects a one to one relationship between the key factory and the type of key. I have come to the conclusion that I need to enhance the system in two ways:

Firstly, I need to support one to many relationship between the key factory and the key. Since the BTree instance is tied to a specific type of key, it therefore needs to be enhanced to supply an "id" that enables the key factory to generate the specific type of key required by the BTree instance. This means that instead of invoking:

key = keyFactory.getNewInstance()

the BTree instance would invoke:

key = keyFactory.getNewInstance(myid)

The key factory would be able to use the "id" to determine the type of key required.

The second enhancement I need is to do with performance. At present, the Object Registry always generates new instances of objects, which would be inefficient for a key factory that needs to maintain its own internal registry. I therefore need to enhance the registry to support Singletons - key factories need to be cached in the registry so that the same factory can be used by multiple BTree instances.

My final point about the tradeoffs between flexibility and performance is to do with storage structure of keys and log records. I have tried to make the storage structures self describing. This means that the keys and values, as well as the log records, must persist sufficient information to be able to reconstruct their state when required. In a multi-attribute key, for example, this means that each attribute must store its type information, data length, etc. along with the data. This naturally increases the storage space requirement of entities. The benefit is that the system does not require external support to determine how to read data back from persistent storage. For example, the BTree module does not require the System Catalalog to be available, it has sufficient information to be able to read and write keys to persistent storage.

Thursday, November 10, 2005


In a major refactoring exercise I have split the SimpleDBM modules into two higher level packages. The Latch Manager, the Object Registry and the Util packages all go under the package org.simpledbm.common. The rest of the packages go under org.simpledbm.rss. I decided to use the acronym RSS for the low level data management API in honour of System R. System R called its low level API Research Storage System or RSS in short. See the paper Morton M. Astrahan, Mike W. Blasgen, Donald D. Chamberlin, Kapali P. Eswaran, Jim Gray, Patricia P. Griffiths, W. Frank King III, Raymond A. Lorie, Paul R. McJones, James W. Mehl, Gianfranco R. Putzolu, Irving L. Traiger, Bradford W. Wade, Vera Watson: System R: Relational Approach to Database Management. ACM Trans. Database Syst. 1(2): 97-137(1976).

Of course, all this refactoring has left CVS in a mess, as now there are many empty subdirectories under src/org/simpledbm.

I have also been working on a Developer's Reference Manual. This will hopefully contain sufficient information for interested hackers and database technology enthusiasts to play around with various modules. You can access the document here. I would welcome any feedback.

Finally, the SimpleDBM project has now graduated out of the incubator at This should lead to greater exposure and interest in the project.

Monday, November 07, 2005

BTree Implementation Progress Report Part 4

BTree scans are now available as well. This means that the BTree implementation is feature complete, although, there are still some areas that need more work.

For the rest of November, I am going to concentrate on improving the code, refactoring bits that I don't like, updating documentation, and generally cleaning up the code. All this in preparation for a release of the code towards the end of the month.

From December onwards, work can start on the remaining bits in the Data Manager layer of SimpleDBM, i.e, tables. I have not yet decided whether knowledge about data types should be part of this layer or whether it is best left to a higher-level layer. I am tempted to keep the data layer as low level as possible; this will make it more reusable.

Wednesday, November 02, 2005

BTree Implementation Progress Report Part 3

Both insert key and delete key operations are now available. More test cases are being written; in order to test lock conflicts when the same key is concurrently inserted/deleted by different transactions, the new test cases have to use multiple threads. It is harder to debug such test cases, but Eclipse makes it easy. I simply run the JUnit tests in debug mode and set break points at appropriate places.

By end of this week I shall be working on the BTree scan operations. There aren't many research publications that deal with the inner workings of an index scan; the only paper that seems to discuss a real implementation is the paper by Mohan, C. An Efficient Method for Performing Record Deletions and Updates Using Index Scans, Proc. 28th International Conference on Very Large Databases, Hong Kong, August 2002. The paper discusses many of the performance issues with implementing scans. My implementation will suffer from the problems Mohan describes, as I want to do a basic implementation first before looking at optimisations.

To ensure that the BTree implementation is tested thoroughly, I have started to use the code coverage tool Clover. The vendor has graciously provided me a free license to use this tool. Using this tool I am able to determine which code paths have not been tested, and then write test cases to exercise those.

Friday, October 28, 2005

BTree Implementation Progress Report Part 2

Currently I am busy testing the code for inserting keys into the BTree. My testing methodology is:

  1. Write a JUnit test case.
  2. Debug the code, stepping through the entire test.
  3. Fix any bugs and retest until test runs okay.

Out of the 16 test cases written so far, only 4 are on insert - with many more to come.

To facilitate testing I need to construct trees with specific properties. The way I have chosen to do this is to specify the data in an XML file and then load it to create the desired tree. Just goes to show the amount of effort required to construct proper test cases. My current estimate is that I won't have a fully tested BTree implementation until end of November.

Thursday, October 13, 2005

BTree Implementation Progress Report

I have been working hard on getting the BTree implementation completed. It is about 75% complete, however, delivery will take longer as I need to ensure that adequate test cases are available to verify the correctness of the algorithms. Already, this is probably the largest module. It will also be one of the most complex.

As mentioned before, I am implementing the B-link tree algorithm as published in the paper Ibrahim Jaluta, Seppo Sippu and Eljas Soisalon-Soininen. Concurrency control and recovery for balanced B-link trees. The VLDB Journal, Volume 14, Issue 2 (April 2005), Pages: 257 - 277, ISSN:1066-8888. I think this will probably be one of the few BTree implementations that fully supports merging of pages when keys are deleted.

I have written a short paper discussing some space management issues in the implementation. The paper can be found here.

Sunday, October 02, 2005

Database Technology camps

Mainstream Relational Database technology is split into two camps.

On the one hand are the systems that are directly or indirectly based upon techniques and algorithms derived from IBM's System R project. Most published research also falls into this group.

In the second camp are systems that have been inspired by Oracle. Oracle has a very interesting history. Although Oracle started with the specifications of System R, it actually did not use any of the System R ideas. A good summary of how Oracle evolved can be found here.

SimpleDBM is based upon the System R heritage. The main reason for this is that I do not work in a company that makes DBMSes. Hence I do not have access to information other than published research. Since Oracle does not publish details of its algorithms, it would be much harder for me to implement a system based upon Oracle. I have, however, given much thought to some of the techniques used in Oracle, and hope to discuss some of these in future blogs and papers.

Friday, September 30, 2005

Design choices - flexibility versus performance

One of the design choices I made in SimpleDBM is to build the system as a collection of modules, and keep the modules loosely coupled via interfaces. There is also a hierarchy within the modules; some modules come before others, and provide services used by other modules. It is very important to enforce this hierarchy, because otherwise, the dependencies between the modules would become cyclical, making it much harder to replace a module.

A benefit of choosing this design is flexibility. For example, although there are different types of Log records in the system, the Log Manager has no knowledge of them. Similarly, the Transaction Manager interacts with disk pages without knowing much about the structure of those pages. Even the Buffer Manager, which is responsible for caching disk pages in memory, actually knows nothing about disk pages. Therefore, it is possible to implement new page types without impacting the Buffer Manager. As a final example, the module I am working on currently, the BTree manager, has no knowledge about the types of index keys and rowids it has to handle. It relies upon a couple of externally provided factory classes to generate, and manage these objects. I am therefore able to develop the BTree module in advance of defining the table row structure, the field types or even the field operations. From the BTree module's perspective, all that is needed is that the keys should be sortable and support the form of serialization that is defined by SimpleDBM.

The downside to this flexibility is increased overhead. There is extra memory and processing overhead, as various object types must be created and managed dynamically using reflection. There is extra overhead in maintaining type information and looking up type information. For instance, when the log records are retrieved during restart recovery, and we need to replay the changes that have occured within a BTree page, we need to first obtain instances of the index key generation factory and the rowid generation factory. The log records must save extra type information to allow this.


One constraint I have placed upon myself when coding is that I will not start work on a new module until I have created and executed test cases for the module I am currently working on. This slows me down, but the payoff is tremendous.

Writing test cases for a module often takes more time than coding the module itself. I think that this is not unusual, and any project that is serious about testing must factor in additional time and effort for this activity. In the company I work for, we never write automated tests. All testing is carried out by running the program, feeding input and checking output manually. This form of testing is important as well, but totally inadequate. A unit test written by the developer can test aspects that a user or operator can never have access to.

Of course, all test cases are not created equal. Often we hear about projects that use JUnit based testing framework, but this in itself does not prove how good the testing is. There are several things to look for:
  1. Test coverage - how much of the program has been subjected to testing.
  2. Test scenario - a program is meant to be used in many different scenarios. How many of the possible scenarios have been tested.
  3. Test data - has the program been tested for variations in data?

For a project like SimpleDBM, testing is additionally complicated by two things:

  1. Liberal use of threads and multi-threading constructs complicate testing.
  2. Recovery operations need the system to be crashed in various ways.

It is not always possible to write test cases that are non-intrusive. Sometimes it is necessary for the program being tested to cooperate with the test framework. For example, in the BTree split page test case, I need to be able to crash the system at specific points in order to test recovery. To do this, I use flags within the BTree module that the test framework can manipulate to trigger this behaviour. For normal operations, the flags can be switched off.

Thursday, September 15, 2005

Page oriented redo/undo versus Logical Undo

ARIES uses the term page-oriented redo/undo to describe logging that is always applied to a specific page. This means that if some page P1 generated a undo/redo log record L1, then during redo, the redo portion of L1 will be applied to P1, and during undo, the undo portion of L1 will be applied to P1.

The term Logical Undo is used to describe logging where the undo action may not always be applied to the page that generated the log record. However, Logical Undos can still be page-oriented. The difference is that the undo action may be applied to a different page. It is also possible for a logical undo operation to modify more than one page, and generate additional log records.

Since a Logical Undo can be page-oriented, the term page-oriented undo is ambiguous.

Monday, September 05, 2005

In favour of Checked Exceptions

I said in a previous post that I favour Checked Exceptions over Unchecked ones. I will try to explain my reasons for doing so.

A good discussion of the problems with Checked Exceptions is in this article by Brian Goetz. The main problems can be summarized as:
  1. Unnecessary exposure of implementation details.
  2. Unstable method signatures.
  3. Unreadable code.
  4. Exception swallowing.
  5. Too much Exception wrapping.

Problems 1 and 2 are caused by poor design. By ensuring that exceptions are thrown at the right level of abstraction, these problems can be avoided. Of course, this means greater effort by the programmer, and also leads to problem 5 - ie - too much exception wrapping.

As for problem 3, I think that any piece of code that has proper error handling will always be less readable than code that has no error handling. This is the reason why so much text book code has very little or no error handling. In a production system, unlike in text books, readability is important, but proper error handling is even more important.

Since so many articles talk about the problems with Checked Exceptions, perhaps a novel approach to this issue is to look at the problems caused by Unchecked Exceptions. These, in my view, can be summarized as:

  1. Lack of information in method signatures. So that by looking at a method signature, you cannot tell what exceptions are likely to be thrown.
  2. Java compiler will not warn you if you ignore exceptions. Therefore, rather than thinking about every possible exception, you will be blissfully unaware of potential exceptional conditions. The result is likely to be an unstable system; one that crashes with indecipherable errors.
  3. Loss of information. Since information about exceptional conditions is fairly local within a system, when an exception at a lower level is caught and wrapped at a higher level, there is potential for additional information to be supplied to diagnose the problem. Since unchecked exceptions pass through silently, all you will get is the cryptic error at the lowest level of abstraction with very little contextual information other than the stack trace.
  4. Exception swallowing. Let us not forget that the most common method of swallowing Exceptions, ie, catching all Exceptions, will also swallow Unchecked Exceptions. I have seen programmers swallowing Exceptions because they find this an easy and simple way to fix that annoying NullPointerException.
  5. Whether you like it or not, Java uses Checked Exceptions everywhere in its libraries. Therefore, even if you throw Unchecked Exceptions, chances are that programmers will still swallow Exceptions, because they still need to deal with the Checked Exceptions thrown by Java libraries.
  6. As it is, programmers do not like to think about error handling. To encourage the use of Unchecked Exceptions is to encourage this behaviour, as there will be even less inclination to think about exceptional conditions.

If you are still not convinced, read this excellent interview with Bruce Lindsay, where he talks about the type of error handling that goes into building a DBMS.

Friday, September 02, 2005

BTrees and Logical Undos

Certain BTree operations require Logical Undo to be performed during rollbacks and restart recovery. This is because, in a BTree, keys can move from one page to another, and therefore an undo operation may have to be applied to a different page from the one where the original action took place.

To take an example, suppose transaction T1 deletes the key D from page P1. Another transaction T2 comes along and inserts the key B into page P1. This causes a page split, after which, all keys greater than B end up in a new page P2. Now, if the first transaction T1 aborts, then it needs to reinstate the deleted key D. However, if it looks at page P1, it will find that the deleted key no longer belongs there. Therefore it must locate the new page P2 and perform the undo action there.

There are a couple of design decisions that affect the complexity of the logical undo actions:

If the BTree is designed to physically remove keys when they are deleted, then undo actions require the key to be physically re-inserted. This means that during the undo of a key delete, page splits can occur if the insert cannot be performed due to the page being full. The splits can recurse up the tree causing more page splits.

The advantage of the physical removal option is that the BTree never contains data that is superfluous. If a large number of rows are deleted, the BTree will be immediately compacted, as the pages used by deleted keys would be removed. Keeping the BTree dense and small is good for performance as fewer pages need to be scanned during traversals.

The other approach is to use logical key deletes, where the key is not physically removed but merely marked as deleted. To undo the delete, only the mark needs to be reset. Logical deletes allows undo operations to proceed without page splits, and therefore Logical Undos can be page oriented.

The problem with logical deletes, of course is that these keys must be garbage collected at some point and key traversals need to obtain instant locks on the deleted keys to determine whether they are truly deleted or not. The presence of a large number of deleted keys can also make the BTree less dense, and increase the path for traversals.

It seems to me that the logical delete key approach is more suited to a scenario where there are frequent rollbacks.

The issue of logical deletes is not discussed at any great length anywhere. There is a brief discussion of this in Mohan's paper Commit_LSN: A Novel and Simple Method for Reducing Locking and Latching in Transaction Processing systems.

A BTree implementation that uses logical key deletes is in Apache Derby.

Thursday, September 01, 2005

Space Management

This week I am working on the Space Manager module which will handle page allocation in containers. This module is not so exciting, it is tedious work with lots of little details all of which have to be right for it to work correctly. While working on this, I did have some interesting experiences though.

I am a firm believer in Checked Exceptions. I think Unchecked Exceptions should be used very very rarely, and only after considerable thought. It seems to me that those who champion Unchecked Exceptions are primarily looking for clean code that is aesthetically pleasing and is not cluttered with a lot of error handling code. While this may be desirable for improving the readability of the code, correct error handling is more important than improving readability, especially in any mission critical system. Unchecked Exceptions have the big problem that the Java compiler does not flag it if you ignore such exceptions. Therefore, you can write code that happily ignores exceptions, and if this is part of a large system, believe me, bad things will happen.

I will write more on this subject in a separate post. The reason I brought this up is that in SimpleDBM, almost all exceptions are checked exceptions, and while implementing the Space Manager module, I encountered a problem that would not have occurred had I used unchecked exceptions. The Space Manager module is transactional, and therefore needs to implement the transactional redo/undo interfaces defined by the Transaction Manager. While implementing these interfaces, I suddenly found that because I was using checked exceptions, the existing redo/undo interfaces did not allow the Space Manager to throw exceptions that are unknown to the Transaction Manager. At first I thought that maybe I should expand the interfaces to allow this, but then, I realized that doing this would break the modular structure of the system, as it would create a cycle in the dependency graph between the two modules. Also, the Transaction Manager has to be able to manage any number of transactional modules, and therefore must not know about the specifics of any module. If I allowed the TM to know about exceptions thrown by the Space Manager, then it would mean that the same would have to be done for every other module that is implemented in future!

The solution to this problem is to specify the redo/undo interfaces in such a way that modules that implement these interfaces can wrap any undeclared exceptions in a specific exception that is meant to wrap such exceptions. This way, we can continue to use and benefit from checked exceptions, without having to resort to unchecked exceptions.

Sunday, August 28, 2005

More on Transaction Manager

After many hours of debugging, I finally have a version of the Transaction Manager that successfully handles the BitMgr Test. The BitMgr is modeled after the OneBit Resource Manager described in Transaction Processing: Concepts and Techniques. It uses a single data page, containing a set of bits that can either be on or off. Despite the simple design, the One BitMgr allows many of the features of a transactional system to tested.

Although the ARIES algorithm is described in great detail by its inventors, there are still a few areas that are alluded to but not fully described in the paper. For example:

  1. A post commit action is something that must be done after a successful commit. An example of a post commit action is the deleting of a container after it has been dropped. Since deleting a container is not recoverable, it is necessary to defer this action until it is definitely known that the transaction will commit. The challenge is how to ensure that post commit actions are properly executed, despite failures that may disrupt the commit processing.
  2. ARIES uses the LSN within a page to track the status of an update. This does not map very well to actions that are not necessarily related to a page, and also when they need to be redone unconditionally at system restart. As an example, when a container is created, the system should log it in such a way that at restart the container is recreated if it was not successfully created before. The way to do this is to maintain the status of the container in a page. However, this page also must reside in some container, so we have a recursive situation. Another example is the action of opening a container, which must be redone at system restart unconditionally.
  3. If a transaction has rolled back to a Savepoint, then any locks acquired after the Savepoint was established can be released. However, how can we determine which locks are safe to delete? ARIES suggests using the LSN as the Savepoint marker, however, this cannot be used to determine which locks are safe to remove. For instance, read locks are not related to LSNs.
  4. The transaction manager interacts with many other modules, and it is a challenge to ensure high concurrency, and deadlock avoidance between it and other modules.

Tuesday, August 23, 2005

SimpleDBM's Transaction Manager

This week I am working on the Transaction Manager for SimpleDBM. The Transaction Manager is based upon the well known algorithm known as ARIES. There is a freely available paper at ARIES inventor C.Mohan's website that describes ARIES in quite a lot of detail.

I wrote a short description of ARIES, which can be found at Apache Derby website.

ARIES is based upon the recovery algorithms used in IBM's original Database 2 (DB2) product. A good description of the DB2 recovery method can be found in the paper Data Recovery in IBM Database 2, by R.A.Crus. The main contribution of ARIES is that during system restart, failed transactions are also redone, as opposed to redoing only active/committed transactions. This simplifies the subsequent recovery.

Confusion with terminology and Latch implementation

The terms Lock and Log have very specific meaning in a DBMS. A DBMS Lock is a dynamically allocated synchronisation primitive, supporting multiple lock modes, used to manage concurrent access to Table rows. Locks are allocated as part of a transaction and retained until the transaction commits. Unfortunately, the term Lock is also used in a programming language like Java to represent Monitors and Mutexes, for example, Java 5.0 uses names like ReadLock, WriteLock, ReentrantLock, and ReentrantReadWriteLock. In DBMS terms, these are not locks. The preferred DBMS term for such primitives is the Latch.

In a DBMS, the term Log means the Write Ahead Log, used for logging all changes within the DBMS. To confuse matters, programming languages uses logging toolkits for generating trace messages. A well-known example is the Log4J toolkit.

Today, I came across a paper called Concurrency control and recovery for balanced B-link trees, by Ibrahim Jaluta, Seppo Sippu, and Eljas Soisalon-Soininen. I am pretty excited about this paper as I am going to work on the BTree implementation for SimpleDBM soon, and have been considering various alternative algorithms. I notice that this paper assumes that Latches support three different lock modes - Shared, Update and Exclusive. Another requirement is to support upgrades from Update to Exclusive and downgrades from Exclusive to Update latches. These requirements have led me to create my own Latch implementation, as the Java ReentrantReadWriteLock I have been using until now only supports Shared and Exclusive lock modes. My implementation uses the new LockSupport class available in Java 5.0, which allows a Thread to be safely suspended and then resumed. I toyed with the idea of building my Latch implementation using the AbstractQueuedSynchronizer class, but must admit that the working of this class is beyond my comprehension.

Sunday, August 21, 2005

Fine grained locking and Linked Lists

The Buffer Manager in SimpleDBM uses a number of different locks in order to obtain fine-grained locking. The assumption is that with fine-grained locking, concurrency will be higher and the system will scale better with larger volumes and greater number of CPUs. However, whether this is actually achieved depends largely on the efficiency of the locks. It would be interesting to create a different version of the Buffer Manager that uses coarse locking, and then compare the performance of the two. I suspect that the benefits of fine-grained locking will only show when the buffer pool is large.

Fine grained locking does have couple of issues. It makes your code more complicated and difficult to maintain and debug. It also increases memory usage. To give you an example, in the Buffer Manager, we have following locks.

  1. A ReentrantReadWriteLock to protect access to LRU chain.
  2. A ReentrantReadWriteLock for each cached page.
  3. A ReentrantLock for each Buffer Control Block.
  4. A ReentrantReadWriteLock for each bucket in the Hash Table.

If we have a buffer pool of 500 pages, and a hash table of 193 buckets, then the total number of lock objects would be:

1 + 500 + 193 = 694 ReentrantReadWriteLocks
plus 500 ReentrantLocks.

I hope that the java.util.concurrent.lock package is upto this type of usage.

One problem in the Buffer Manager implementation is the single global LRU lock. It would be a good idea to split the global LRU list to multiple lists, each protected by its own lock.

Linked Lists

Linked Lists are used heavily in the Buffer Manager. The LRU list is a linked list. Each of the hash buckets has its own linked list.

The problem with the standard linked lists is that the remove operation requires a search within the list. This did not strike me until recently, when I was considering how to implement LRU placement hints. The larger the list grows, the longer it will take to search for the item to be removed. Since hash chains are expected to be small, maybe it is not an issue there. The LRU chain, however, is expected to be long. A possible solution (ugly) would be to implement some custom linked list code for the LRU list. Sigh.

Friday, August 19, 2005

Buffer Manager implementation

Just completed the initial port of the Buffer Manager. Also checked in all the source code in the UNSTABLE_V100 branch.

Sunday, August 14, 2005

Buffer Manager coming soon

This week I am working on porting the Buffer Manager module. I expect to upload the code for it by the end of the week.

Why Java?

I had started the DBMS project DM1 back in 2002 using C++. I said then that I had chosen to code DM1 in C++ because of lack of certain features in Java, such as pointers and efficient array handling. It seemed to me that a DBMS needed a language that allowed full control of memory.

SimpleDBM is a pure Java project. So what has triggered the change from C++ to Java?
  1. Firstly, I no longer think that in the scale of things, absolute control of memory is that important, unless the DBMS is being targetted at an embedded environment and must use as little memory as possible. The trade-off is much simpler memory management. In SimpleDBM, I rely completely on Java's garbage collection facilities.
  2. My productivity in Java is far higher. This is not only due to a better designed language, but also because of the availability of free IDEs such as Eclipse, which make development so much faster and pleasurable.
  3. I tried using the C++ try/catch construct and destructors for proper cleanup of resources. My experience is that it is unworkable in large projects. Java's concept of finally block combined with garbage collection of memory, is a better and neater option. The finally block provides control of the resource cleanup, whereas C++ destructors don't.
  4. The biggest problem I had with Java was the lack of adequate primitives for concurrency control. The default synchronisation mechanisms were simply not adequate for the type of fine-grained concurrency I wanted to have in the DBMS. Fortunately, due to wonderful efforts of Doug Lea and others, Java 5.0 has new concurrency packages that solve my problem. There are still a few issues, I think, but on the whole, if I had to pick the single most important factor in enabling my shift to Java, I would say this is it.
  5. Java is simply a better language than C++. Why I think this is so will have to the subject of another post, but in a few words, Java's strengths are smaller and simpler language, support for interfaces (a feature that was proposed for C++ and rejected), built-in support for Threads and concurrency, garbage collection, finally blocks, dynamic loading and binding, checked Exceptions.
  6. A testamount to the superiority of Java is that the ported version is superior to the original C++ version, with better separation of modules, greater concurrency.


Welcome to the official weblog for the SimpleDBM project. SimpleDBM is a new project that aims to implement relational database manager in Java. Please visit the SimpleDBM website for further details about the project.