Monday, September 28, 2009
On Optimistic Methods for Concurrency Control
In this paper, the authors present two methods of non-locking concurrency controls for DBMS: serial and parallel validation. These methods are “optimistic” because they rely on the hope that conflicts will not occur. The authors first make it clear that locking approaches to concurrency control have numerous disadvantages (locking overhead, deadlock). The argument for “optimistic” concurrency control is as follows: reads are completely unrestricted since they can never cause a loss of integrity, and writes are restricted. A transaction consists of two or three phases: a read phase, a validation phase, and a possible write phase. These methods may be superior to locking methods for systems where transaction conflict is highly unlikely, such as query-dominant systems and very large tree indexes. Such an optimistic system would be inefficient where the transaction conflict is not rare. The paper suggests that a system should vary the amount of locking versus optimistic approaches as the likelihood of transaction conflict in the system varies. However, the authors do not delve into such an analysis.
Granularity of Locks and Degrees of Consistency in a Shared Data Base
This paper discusses the granularity of locks in a data base management system: small granularity means high concurrency but high overhead, large granularity means low overhead but low concurrency. A fine lockable system is preferable for a simple transaction which accesses few records and a coarse lockable system is preferable for a complex transaction which accesses many records. Therefore, it is favorable to have lockable units of different granularities coexisting in the same system. This can be accomplished with the use of hierarchical locks, which allow one to lock a node and all of its descendants. Various access modes can be used to ensure that two lock requests by two different transactions are compatible (can be granted concurrently). These ideas can be generalized to work for directed acyclic graphs (DAG), which are really just more general than trees. We also have to remember that indices and files are created and destroyed continually and so the lock graphs are dynamic. While there are issues that must be considered to ensure correctness of this added complication to a DBMS, it is worthwhile due to the need for the performance gain that will result.
Tuesday, September 22, 2009
Experience With Processes and Monitors in Mesa
This paper discusses how to deal with threads (processes running concurrently within a process). This is the period 30 years ago when concurrency was increasing in programs. This paper tried to tackle the issues involved with using monitors to synchronize these threads. The authors chose to synchronize processes with preemptive scheduling of light-weight processes and monitors: need to do I/O, allow for multiprocessors, allow information hiding between modules, more structured locking.
This paper made an attempt to work out the subtle issues involved in implementing a system running many light-weight processes concurrently. Today, the layer of the system that allows for such concurrency functions effectively and allows us to develop many complex systems with very large concurrency.
This paper made an attempt to work out the subtle issues involved in implementing a system running many light-weight processes concurrently. Today, the layer of the system that allows for such concurrency functions effectively and allows us to develop many complex systems with very large concurrency.
Wednesday, September 16, 2009
Lightweight Recoverable Virtual Memory
Goal: How simple can a transactional facility be, while remaining a potent tool for fault-tolerance? The authors omitted what they could without crippling RVM.
RVM is intended for Unix applications with persistent data and the paper is presented in three parts: rationale, architecture, and implementation.
Rational: Existing solutions, such as Camelot, were too heavy-weight. The solution was a light version that provides only recoverable virtual memory. The central principle the authors adopted in designing RVM was to value simplicity over generality. The authors followed to systems principle of building simple components that do things well.
Architecture: LRVM is managed in segments as opposed to pages. Regions of the segments are mapped to Unix virtual memory. There is no aliasing: mappings cannot overlap in virtual memory and no region of a segment may be mapped more than once by the same process. Regions can be unmapped at any time, as long as they have no uncommitted transactions outstanding. The designers focused on using meta-data to minimize the disk space that recoverable memory would occupy. Many other small features were also omitted.
Implementation:
Log Management: No undo/redo since value logging strategy because it never reflects uncommitted changes to an external data segment. Crash recovery is idempotent by being done last. Log truncation is also handled.
Optimizations: Intra-transaction optimizations arise when set-range calls specifying identical, overlapping, or adjacent memory addresses are issued within a single transaction. Inter-transaction optimizations occur only in the context of no-flush transactions.
Result: RVM handles more transaction/sec than Camelot in all cases. However, the performance significantly drops when recoverable memory size approaches physical memory size.
Analysis: A lightweight RVM can provide the necessary functionally, while improving performance and modularity. This lesson can be applied in general when building a complex system. Additional features can be added later as components built on top of the system. While this paper did not demonstrate this idea, complex distributed systems today, such as Google, are built with layered component blocks providing a given functionality.
RVM is intended for Unix applications with persistent data and the paper is presented in three parts: rationale, architecture, and implementation.
Rational: Existing solutions, such as Camelot, were too heavy-weight. The solution was a light version that provides only recoverable virtual memory. The central principle the authors adopted in designing RVM was to value simplicity over generality. The authors followed to systems principle of building simple components that do things well.
Architecture: LRVM is managed in segments as opposed to pages. Regions of the segments are mapped to Unix virtual memory. There is no aliasing: mappings cannot overlap in virtual memory and no region of a segment may be mapped more than once by the same process. Regions can be unmapped at any time, as long as they have no uncommitted transactions outstanding. The designers focused on using meta-data to minimize the disk space that recoverable memory would occupy. Many other small features were also omitted.
Implementation:
Log Management: No undo/redo since value logging strategy because it never reflects uncommitted changes to an external data segment. Crash recovery is idempotent by being done last. Log truncation is also handled.
Optimizations: Intra-transaction optimizations arise when set-range calls specifying identical, overlapping, or adjacent memory addresses are issued within a single transaction. Inter-transaction optimizations occur only in the context of no-flush transactions.
Result: RVM handles more transaction/sec than Camelot in all cases. However, the performance significantly drops when recoverable memory size approaches physical memory size.
Analysis: A lightweight RVM can provide the necessary functionally, while improving performance and modularity. This lesson can be applied in general when building a complex system. Additional features can be added later as components built on top of the system. While this paper did not demonstrate this idea, complex distributed systems today, such as Google, are built with layered component blocks providing a given functionality.
Tuesday, September 15, 2009
Segment-Based Recovery: Write-ahead logging revisited
Large-scale distributed systems have challenges that are not adequately addressed by previous recovery systems, or previous architectures based on single systems in general. This paper proposes segment-oriented recovery to enable distributed recovery architectures that are not hindered by the tight coupling of components required by page-oriented recovery. The distributed variations are quite flexible and enable recovery to be a large-scale distributed service.
Page-oriented recovery leads to a tight coupling between the application, the buffer manager and the log manager. The tight coupling might be fine on a traditional single core machine, but it leads to performance issues when distributing the components to different machines and different cores. Segment-oriented recovery enables simpler and looser coupling among components. Write back caching reduces communication between the buffer manager and application, since the communication occurs only on cache eviction. Since there is no shared state, calls to the buffer manager and log manager can be asynchronous. The use of natural layouts for large objects allows DMA and zero-copy I/O in the local case. In the distributed case, this allows application data to be written without copying the data and the LSNs to the same machine. This allows for very flexible large-scale write-ahead logging as a service for cloud computing.
For small transactions, the networked version is roughly ten times slower than the local versions, but approximately 20 times faster than a distributed, page-oriented approach. As transaction sizes increase, segment-based recovery is better able to amortize network round trips due to log and buffer manager requests, and network throughput improves to more than 400 times that of the page-based approach. As above, the local versions of these benchmarks are competitive with local page-oriented approaches, especially for long transactions.
Analysis: The asynchronous approach is necessary for large-scale distributed systems. Also, recovery based on the granularity of application requests is more in line with the transactions in present large-scale Internet systems. However, the focus on large transaction size may not be valid for the cloud computing that this approach targets. The retrieval of a single web page can require communication with hundreds of small sub-services running on remote nodes. Perhaps some sort of lightweight version (we should consider letting go of some constraints if the application does not require them) would be preferable for the short transactions.
This up to date paper looking at modern challenges was well placed in the course
I hope we have more such updated papers throughout the course.
Page-oriented recovery leads to a tight coupling between the application, the buffer manager and the log manager. The tight coupling might be fine on a traditional single core machine, but it leads to performance issues when distributing the components to different machines and different cores. Segment-oriented recovery enables simpler and looser coupling among components. Write back caching reduces communication between the buffer manager and application, since the communication occurs only on cache eviction. Since there is no shared state, calls to the buffer manager and log manager can be asynchronous. The use of natural layouts for large objects allows DMA and zero-copy I/O in the local case. In the distributed case, this allows application data to be written without copying the data and the LSNs to the same machine. This allows for very flexible large-scale write-ahead logging as a service for cloud computing.
For small transactions, the networked version is roughly ten times slower than the local versions, but approximately 20 times faster than a distributed, page-oriented approach. As transaction sizes increase, segment-based recovery is better able to amortize network round trips due to log and buffer manager requests, and network throughput improves to more than 400 times that of the page-based approach. As above, the local versions of these benchmarks are competitive with local page-oriented approaches, especially for long transactions.
Analysis: The asynchronous approach is necessary for large-scale distributed systems. Also, recovery based on the granularity of application requests is more in line with the transactions in present large-scale Internet systems. However, the focus on large transaction size may not be valid for the cloud computing that this approach targets. The retrieval of a single web page can require communication with hundreds of small sub-services running on remote nodes. Perhaps some sort of lightweight version (we should consider letting go of some constraints if the application does not require them) would be preferable for the short transactions.
This up to date paper looking at modern challenges was well placed in the course
I hope we have more such updated papers throughout the course.
Tuesday, September 8, 2009
ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-ahead Logging
ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-ahead Logging
ARIES attempts to better address the problem of guaranteeing atomicity and durability of transactions despite concurrent executions of multiple transactions and various failures. ARIES uses a log to record the progress of a transaction and its actions that change recoverable data. The method of logging allows partial or total rollbacks of transactions during both normal and restart processing. A lot of issues were discussed such as latches and locks, fine-granularity locking, and buffer management. The paper goes over the data structures used by ARIES, how ARIES processes transactions, restart processing, and recovery.
There are 3 phases in the Aries recovery algorithm:
-Analysis: Scan the log forward from the most recent checkpoint to identify all Xacts that were active, and all dirty pages in the buffer pool at the time of the crash.
-Redo: Redoes all updates to dirty pages in the buffer pool to ensure that all logged updates are carried out and written to disk.
-Undo: Working backwards in the log, the writes of all Xacts that were active at the crash are undone by restoring the before value of the update.
A crash occurring during the recovery process is also handled.
This paper discussed a detailed system for providing atomicity while still allowing us to have multiple transactions at once and recover from inevitable failures.
ARIES attempts to better address the problem of guaranteeing atomicity and durability of transactions despite concurrent executions of multiple transactions and various failures. ARIES uses a log to record the progress of a transaction and its actions that change recoverable data. The method of logging allows partial or total rollbacks of transactions during both normal and restart processing. A lot of issues were discussed such as latches and locks, fine-granularity locking, and buffer management. The paper goes over the data structures used by ARIES, how ARIES processes transactions, restart processing, and recovery.
There are 3 phases in the Aries recovery algorithm:
-Analysis: Scan the log forward from the most recent checkpoint to identify all Xacts that were active, and all dirty pages in the buffer pool at the time of the crash.
-Redo: Redoes all updates to dirty pages in the buffer pool to ensure that all logged updates are carried out and written to disk.
-Undo: Working backwards in the log, the writes of all Xacts that were active at the crash are undone by restoring the before value of the update.
A crash occurring during the recovery process is also handled.
This paper discussed a detailed system for providing atomicity while still allowing us to have multiple transactions at once and recover from inevitable failures.
The HP AutoRAID Hierarchical Storage System
Management of redundant systems can be complicated. The HP AutoRAID attempts to provide a solution to this problem with a two-level storage hierarchy that is abstracted behind a single disk-array controller. Write-active data are mirrored for better performance while write-inactive data are stored in RAID 5 for better cost. The system begins by allocating mirrored space until no more can be stored this way and automatically reallocates some of the storage space to RAID 5. The HP AutoRAID adapts to the workload changes: newly active data are promoted to mirrored storage while data the have become less active are demoted to RAID 5. The HP AutoRAID also allows for easy disk upgrades of disks of different capacity, which allows for easy expansion and upgrade. The system can even have a second controller that it can automatically change over to in the case of failure of the primary controller. The system is simple to administer and setup. The HP AutoRAID system seems to do a good job of providing performance close to that of nonredundant systems for many workloads. At the same time, it provides full data redundancy and can tolerate failures of any single array component.
This system is a classic example of a system that does achieves its function while abstracting away how the function is carried out. Has this system become obsolete with the trend towards clusters of computers networked together? The cost of storage has gotten so cheap that RAID is not needed in such systems. Redundancy can be achieved by replicating data across multiple computers.
This system is a classic example of a system that does achieves its function while abstracting away how the function is carried out. Has this system become obsolete with the trend towards clusters of computers networked together? The cost of storage has gotten so cheap that RAID is not needed in such systems. Redundancy can be achieved by replicating data across multiple computers.
A History of Improving File System Performance
A Fast File System for UNIX
This paper begins by mentioning the types of applications, VLSI design and image processing, that do a small amount of processing on a large quantity of data and need to have a high throughput from the system. In the traditional UNIX file system, the inodes are not consecutively ordered which results in many disk accesses. To improve the file system throughput, a larger block size was used. But a larger block size results in a larger percentage of wasted space. The design team tried to counter this waste by putting small files in fragments within blocks. When this new file system was tested, there were order of magnitude increases in read and write bandwidth. The price that is paid is wasted disk space and increased CPU utilization. With CPU speed and disk size increasing exponentially and disk bandwidth lagging far behind, this seems like a desirable tradeoff.
The Design and Implementation of a Log-Structured File System
This paper reiterates the fact that processor speed is increasing exponentially, while memory and disk speeds are being left behind. While there are ways to increase disk throughput, there are no foreseeable improvements for access time. Information is spread around the disk, which causes too many disk accesses. So it seems that we should attempt to minimize disk accesses. A log-structured file system (LFS), which writes all modifications to disk sequentially, attempts to accomplish this goal.
A traditional file system achieves logical locality by assuming certain access patterns and pays extra on writes, to organize information optimally on disk for the assumed read patterns. In contrast, LFS achieves temporal locality: information that is created or modified at the same time will be grouped closely on disk. If temporal locality matches logical locality, as it does for a file that is written sequentially and then read sequentially, then a LFS should have about the same performance on large files as a traditional file system. If temporal locality differs from logical locality then the systems will perform differently: LFS will perform poorly on reads. This speeds up crash recovery because the system only needs to scan the most recent portion of the log, instead of the entire disk.
The main problem with LFS is degraded performance for logical reads, which is what journaling file systems try to address.
Analysis and Evolution of Journaling File Systems
This paper gives an overview of various journaling file systems in use today. Journaling tries to provide the best o both worlds: data is written to the log sequentially, then moved to is logical location. This provides fast crash recovery, while all normal reads occur from the standard blocks. Journaling is how modern commercial file systems work, such as Unix ext3 and Windows NTFS.
This paper begins by mentioning the types of applications, VLSI design and image processing, that do a small amount of processing on a large quantity of data and need to have a high throughput from the system. In the traditional UNIX file system, the inodes are not consecutively ordered which results in many disk accesses. To improve the file system throughput, a larger block size was used. But a larger block size results in a larger percentage of wasted space. The design team tried to counter this waste by putting small files in fragments within blocks. When this new file system was tested, there were order of magnitude increases in read and write bandwidth. The price that is paid is wasted disk space and increased CPU utilization. With CPU speed and disk size increasing exponentially and disk bandwidth lagging far behind, this seems like a desirable tradeoff.
The Design and Implementation of a Log-Structured File System
This paper reiterates the fact that processor speed is increasing exponentially, while memory and disk speeds are being left behind. While there are ways to increase disk throughput, there are no foreseeable improvements for access time. Information is spread around the disk, which causes too many disk accesses. So it seems that we should attempt to minimize disk accesses. A log-structured file system (LFS), which writes all modifications to disk sequentially, attempts to accomplish this goal.
A traditional file system achieves logical locality by assuming certain access patterns and pays extra on writes, to organize information optimally on disk for the assumed read patterns. In contrast, LFS achieves temporal locality: information that is created or modified at the same time will be grouped closely on disk. If temporal locality matches logical locality, as it does for a file that is written sequentially and then read sequentially, then a LFS should have about the same performance on large files as a traditional file system. If temporal locality differs from logical locality then the systems will perform differently: LFS will perform poorly on reads. This speeds up crash recovery because the system only needs to scan the most recent portion of the log, instead of the entire disk.
The main problem with LFS is degraded performance for logical reads, which is what journaling file systems try to address.
Analysis and Evolution of Journaling File Systems
This paper gives an overview of various journaling file systems in use today. Journaling tries to provide the best o both worlds: data is written to the log sequentially, then moved to is logical location. This provides fast crash recovery, while all normal reads occur from the standard blocks. Journaling is how modern commercial file systems work, such as Unix ext3 and Windows NTFS.
System R and Architecture of a DB Systems
System R
This paper talks about the initial origins of System R, goes through
the development of prototypes, and finally evaluates and draws
conclusions. The paper begins by talking about the relational data
model proposed by Codd in 1970. Codd proposed that data should be
represented by data values and never by any connections visible to the
user, and users should not have to specify algorithms when they make a
request. This is the original SQL DBMS. The question was whether the
second item could be achieved well. The paper continues by specifying
the key goals established for System R. The phase zero prototype was
just intended to determine the feasibility of design methods and
functions. Work then began on the phase one prototype. The paper
continues by talking about the details of the phase one prototype.
The evaluation of the phase one prototype was generally good and was
considered to have been successful in its goals: simplicity, power,
and independence of data. System R demonstrated that a production
database system was feasible. There is not much to criticize. This
is the foundation and inspiration for modern database systems. While
not as evolved as today’s production systems, the evolution has been a
progressive march over decades.
Architecture of a DB System
This is actually a good follow-up to the System R paper. System R
gives the original thinking when the first relational DBMS were being
developed. This paper goes over what has been learned/formalized over
the decades: process models, parallel architectures, ACID.
Essentially, this paper is an overview of various design
considerations that have come up in the history of DBMS.
I would be curious if we are going to discuss the implications of
today’s large scale systems. Not just your CAP theorem, but the
responses to these challenges (distributed file systems,
virtualization). Hopefully in a bit more detail than your syllabus
from last year. For example, with virtualization, could we look at
and discuss various approaches being considered to manage and spawn
VMs based upon need. What implications would this have with
distributed storage systems? It is not as simple as spawning VMs of a
stateless application layer.
This paper talks about the initial origins of System R, goes through
the development of prototypes, and finally evaluates and draws
conclusions. The paper begins by talking about the relational data
model proposed by Codd in 1970. Codd proposed that data should be
represented by data values and never by any connections visible to the
user, and users should not have to specify algorithms when they make a
request. This is the original SQL DBMS. The question was whether the
second item could be achieved well. The paper continues by specifying
the key goals established for System R. The phase zero prototype was
just intended to determine the feasibility of design methods and
functions. Work then began on the phase one prototype. The paper
continues by talking about the details of the phase one prototype.
The evaluation of the phase one prototype was generally good and was
considered to have been successful in its goals: simplicity, power,
and independence of data. System R demonstrated that a production
database system was feasible. There is not much to criticize. This
is the foundation and inspiration for modern database systems. While
not as evolved as today’s production systems, the evolution has been a
progressive march over decades.
Architecture of a DB System
This is actually a good follow-up to the System R paper. System R
gives the original thinking when the first relational DBMS were being
developed. This paper goes over what has been learned/formalized over
the decades: process models, parallel architectures, ACID.
Essentially, this paper is an overview of various design
considerations that have come up in the history of DBMS.
I would be curious if we are going to discuss the implications of
today’s large scale systems. Not just your CAP theorem, but the
responses to these challenges (distributed file systems,
virtualization). Hopefully in a bit more detail than your syllabus
from last year. For example, with virtualization, could we look at
and discuss various approaches being considered to manage and spawn
VMs based upon need. What implications would this have with
distributed storage systems? It is not as simple as spawning VMs of a
stateless application layer.
Computer Systems Discussions
I am currently taking Eric Brewer's graduate computer systems class at Berkeley.
CS262a: Advanced Topics in Computer Systems
http://www.cs.berkeley.edu/~brewer/cs262/
CS262a: Advanced Topics in Computer Systems
http://www.cs.berkeley.edu/~brewer/cs262/
Subscribe to:
Posts (Atom)