Blame: WAL - Waikato Linux Users Group

Annotated edit history of WAL version 5, including all changes. View license author blame.

Rev	Author	#	Line
5	AristotlePagaltzis	1	`<i>Aborted effort.%%%`
1	GuyThornley	2	`Close all that you have worked on.%%%`
5	AristotlePagaltzis	3	`You ask far too much.</i>`
1	GuyThornley	4
5	AristotlePagaltzis	5	`An [Acronym] for __W__rite-__a__head __l__og.`
1	GuyThornley	6
5	AristotlePagaltzis	7	`This is the mechanism used by [PostgreSQL] to guarantee ''durability'' in [ACID] compliant transactions. Without it, it is not possible to guarantee durability. The [WAL] is divided into ''segments''. After some configurable number of filled segments, a ''checkpoint'' is created. These terms will be explained later. First, however, it is important to understand what the [WAL] actually does.`
		8
		9	`!! The [WAL] concept`
1	GuyThornley	10
		11	`In order to provide durability, the database must be able to correctly recover all the changes made by commited transactions before a crash. Further, previously committed data must not be corrupted. The significant errors that can occur during a crash are:`
		12
5	AristotlePagaltzis	13	`* Filesystem metadata inconsistency`
		14	`* Missing writes (data did not get written)`
		15	`* Partial writes, or so-called 'torn pages' (only some part of a data page was written), which in turn may cause:`
1	GuyThornley	16	`* A disk I/O error (the disks internal sector [CRC] is not be correct)`
		17	`* Corrupted data (perhaps the disk attempted some sector remapping or [CRC] correction)`
		18	`* A data page with old and new content`
		19
5	AristotlePagaltzis	20	`Using fsync(2) on the modified data files does not provide the required durability. During fsync(2), a partial write may still result, corrupting ''previously committed data''. This is very real possibility under heavy [I/O] and an [OS] crash or power outage. Furthermore, if many different data files were touched by a transaction, all the data files need to be flushed using fsync(2) and very poor performance is the result.`
1	GuyThornley	21
5	AristotlePagaltzis	22	`The [WAL] solves these problems:`
3	GuyThornley	23
5	AristotlePagaltzis	24	`* All updates are logged sequentially, into the same file. Sequential write performance of modern disks is very good, and fsync(2) needs to be called only once.`
		25	`* The [WAL] records are guaranteed to be on disk before the data files are written.`
		26	`* The original database page is logged occasionally, and modifications to that base page are logged; thus any partial write to the main data files can be recovered correctly.`
		27	`* The data files can be written out on as as-needed basis, and the [OS] is left great liberty how it writes them out.`
		28	`* The [WAL] content itself is protected by an application level [CRC] performed by [PostgreSQL].`
		29
		30	`It is important to understand that before any data page can be written back to the FileSystem, the appropriate [WAL] records for the changes must be on disk. When removing a page from the shared buffers, [PostgreSQL] ensures the [WAL] record for the last update is actually on the disk.`
1	GuyThornley	31
		32	`!! Segments`
		33
5	AristotlePagaltzis	34	The [WAL], conceptually, is an infinite sequence of blocks which are numbered starting at zero. In [PostgreSQL] the default block size is 8kiB, and is also used for all heap and index data file pages. The first block is created when the database system is initialised at install time. To ease [WAL] file management, this infinite sequence of blocks is divided into segments which are 2048 blocks long. In the default configuration, this creates 16MiB files. New segments can be created quickly, and old segments can easily be unlinked or moved elsewhere for archival, as required.
1	GuyThornley	35
5	AristotlePagaltzis	36	`The log segments are named according to where in the sequence of the [WAL] they occur.`
		37
		38	`In principle, if the entire [WAL] from the database system initialisation is available, then the exact state of the database at any point in time can be recovered. In practice the storage requirements for this are probably infeasible, and recovery would take an inordinate amount of time in any case.`
1	GuyThornley	39
		40	`!! Checkpoints`
2	GuyThornley	41
5	AristotlePagaltzis	42	`Checkpoints are one solution to the storage and recovery time problem. After a configurable quantity of [WAL] activity, all changes made to the entire database are committed to disk. This operation, called a ''checkpoint'', can take quite a long time; several hundred MiB of data may need to be flushed to disk depending upon the size of memory caches.`
1	GuyThornley	43
5	AristotlePagaltzis	44	Once a checkpoint has completed, it is guaranteed that the all the changes from log segments before the checkpoint are correctly on the physical disk. Thus, when recovering from a crash, only the changes from the last fully completed checkpoint need to be replayed. Furthermore, the old log segment files can be easily renamed for use as new log segments; this allows new log segments with effectively no [I/O] overhead. Alternatively, the old log segments could be archived for use as an incremental backup from a previous database snapshot.
		45
		46	`The first time a data page is modified after a checkpoint, the entire original page is written the [WAL]. This provides the valid base page for incremental changes to work against. If the page in the main data file is corrupted by a partial write, it does not matter; the correct page can be reconstucted by replaying the WAL from the checkpoint.`
1	GuyThornley	47
		48	`[PostgreSQL] provides the <tt>CHECKPOINT</tt> [SQL] command to force a checkpoint to occur.`
2	GuyThornley	49
5	AristotlePagaltzis	50	`!! See also`
2	GuyThornley	51
5	AristotlePagaltzis	52	`* [PostgreSQLNotes] for how to tune the [WAL]`
		53	`* PgBench for some benchmarking data`
		54	`* [Employing the WAL to implement on-line backups \| http://www.postgresql.org/docs/8.0/static/backup-online.html] (as of Pg 8.0)`

Last edited on Friday, June 23, 2006 3:55:31 am by AristotlePagaltzis

Edit PageHistory Diff Info LikePages

PHP Warning