Diff: WAL - Waikato Linux Users Group

Differences between current version and previous revision of WAL.

Other diffs: Previous Major Revision, Previous Author, or view the Annotated Edit History

Newer page:	version 5	Last edited on Friday, June 23, 2006 3:55:31 am	by AristotlePagaltzis
Older page:	version 4	Last edited on Thursday, June 22, 2006 10:59:59 pm	by GuyThornley	Revert

@@ -1,94 +1,54 @@

- '' Aborted effort.%%%

+ Aborted effort.%%%

Close all that you have worked on.%%%

- You ask far too much.''

+ You ask far too much.

-[~~WAL] is the [TLA~~ ] for ~~the [PostgreSQL] write~~ -ahead log. The write-ahead log is the mechanism used by [PostgreSQL] to guarantee *durability* in [ACID | WikiPedia:ACID] compliant transactions. Without the WAL, it is not possible to guarantee durability .

+An [Acronym ] for __W__rite -__a__head __l__og .

-In [PostgreSQL], ~~the~~ WAL is divided into __ segments__ . After some configurable number of filled segments, a __ checkpoint__ is created. These terms will be explained later. First, however, it is important to understand what the WAL actually does.

+This is the mechanism used by [PostgreSQL] to guarantee ''durability'' in [ACID] compliant transactions. Without it , it is not possible to guarantee durability. The [ WAL] is divided into '' segments'' . After some configurable number of filled segments, a '' checkpoint'' is created. These terms will be explained later. First, however, it is important to understand what the [ WAL] actually does.

+

+!! The [WAL] concept

-~~!! The WAL concept~~

In order to provide durability, the database must be able to correctly recover all the changes made by commited transactions before a crash. Further, previously committed data must not be corrupted. The significant errors that can occur during a crash are:

-# Filesystem metadata inconsistency

-# Missing writes (data did not get written)

-# Partial writes, or so-called 'torn pages' (only some part of a data page was written), which in turn may cause:

+* Filesystem metadata inconsistency

+* Missing writes (data did not get written)

+* Partial writes, or so-called 'torn pages' (only some part of a data page was written), which in turn may cause:

* A disk I/O error (the disks internal sector [CRC] is not be correct)

* Corrupted data (perhaps the disk attempted some sector remapping or [CRC] correction)

* A data page with old and new content

-Using ~~<tt>~~ fsync()~~</tt>~~ on the modified data files does not provide the required durability. During ~~<tt>~~ fsync()~~</tt>~~ a partial write may still result, corrupting ~~~~ previously committed data~~~~ . This is very real possibility under heavy IO and an OS crash or power outage. Furthermore, if many different data files were touched by a transaction, all the data files need to be flushed using ~~<tt>~~ fsync()~~</tt>~~ and very poor performance is the result.

+Using fsync(2 ) on the modified data files does not provide the required durability. During fsync(2 ), a partial write may still result, corrupting '' previously committed data'' . This is very real possibility under heavy [I/O] and an [ OS] crash or power outage. Furthermore, if many different data files were touched by a transaction, all the data files need to be flushed using fsync(2 ) and very poor performance is the result.

-The WAL solves these problems:

-~~# All updates are logged sequentially, into the same file. Sequential write performance of modern disks is very good, and <tt>fsync()</tt> needs to be called only once~~

-~~# The WAL records are guaranteed to be on disk before the data files are written~~

-~~# The original database page is logged occasionally, and modifications to that base page are logged; thus any partial write to the main data files can be recovered correctly~~

-~~# The data files can be written out on as as-needed basis, and the OS is left great liberty how it writes them out~~

-~~# The WAL content itself is protected by an application level [CRC] performed by [PostgreSQL]~~

+The [ WAL] solves these problems:

-It is important to understand that before any data page can be written back to the ~~filesystem~~ , the appropriate WAL records for the changes must be on disk. When removing a page from the shared buffers, [PostgreSQL] ensures the WAL record for the last update is actually on the disk.

+* All updates are logged sequentially, into the same file. Sequential write performance of modern disks is very good, and fsync(2) needs to be called only once.

+* The [WAL] records are guaranteed to be on disk before the data files are written.

+* The original database page is logged occasionally, and modifications to that base page are logged; thus any partial write to the main data files can be recovered correctly.

+* The data files can be written out on as as-needed basis, and the [OS] is left great liberty how it writes them out.

+* The [WAL] content itself is protected by an application level [CRC] performed by [PostgreSQL].

+

+ It is important to understand that before any data page can be written back to the FileSystem , the appropriate [ WAL] records for the changes must be on disk. When removing a page from the shared buffers, [PostgreSQL] ensures the [ WAL] record for the last update is actually on the disk.

!! Segments

-The WAL, conceptually, is an infinite sequence of blocks which are numbered starting at zero. In [PostgreSQL] the default block size is 8kiB, and is also used for all heap and index data file pages. The first block is created when the database system is initialised at install time. To ease WAL file management, this infinite sequence of blocks is divided into segments which are 2048 blocks long. In the default configuration, this creates 16MiB files. New segments can be created quickly, and old segments can easily be unlinked or moved elsewhere for archival, as required.

-The ~~log segments~~ are ~~named according to where in~~ the sequence of the ~~WAL they occur~~ .

+The [WAL], conceptually, is an infinite sequence of blocks which are numbered starting at zero. In [PostgreSQL] the default block size is 8kiB, and is also used for all heap and index data file pages. The first block is created when the database system is initialised at install time. To ease [WAL] file management, this infinite sequence of blocks is divided into segments which are 2048 blocks long. In the default configuration, this creates 16MiB files. New segments can be created quickly, and old segments can easily be unlinked or moved elsewhere for archival, as required .

-In principle, if the entire WAL from the database system initialisation is available, then the exact state of the database at any point in time can be recovered. In ~~practise~~ the storage requirements for this are probably infeasible, and recovery would take an inordinate amount of time in any case.

+The log segments are named according to where in the sequence of the [WAL] they occur.

+

+ In principle, if the entire [ WAL] from the database system initialisation is available, then the exact state of the database at any point in time can be recovered. In practice the storage requirements for this are probably infeasible, and recovery would take an inordinate amount of time in any case.

!! Checkpoints

-Checkpoints are one solution to the storage and recovery time problem. After a configurable quantity of WAL activity, all changes made to the entire database are committed to disk. This operation, called a __checkpoint__, can take quite a long time; several hundred MiB of data may need to be flushed to disk depending upon the size of memory caches.

-~~Once~~ a ~~checkpoint has completed~~ , ~~it is guaranteed that the~~ all ~~the~~ changes ~~from log segments before~~ the ~~checkpoint~~ are ~~correctly on the physical~~ disk. ~~Thus~~ , ~~when recovering from~~ a ~~crash~~ , ~~only the changes from the last fully completed checkpoint~~ need to be ~~replayed. Furthermore,~~ the old log segment files can be easily renamed for use as new log segments; this allows new log segments with effectively no I/O overhead. Alternatively, the old log segments could be archived for use as an incremental backup from a previous database snapshot .

+Checkpoints are one solution to the storage and recovery time problem. After a configurable quantity of [WAL] activity , all changes made to the entire database are committed to disk. This operation , called a ''checkpoint'' , can take quite a long time; several hundred MiB of data may need to be flushed to disk depending upon the size of memory caches .

-The first time a data page is modified after a checkpoint, the entire original page is written the WAL. This provides the valid base page for incremental changes to work against. If the page in the main data file is corrupted by a partial write, it does not matter; the correct page can be reconstucted by replaying the WAL from the checkpoint.

+Once a checkpoint has completed, it is guaranteed that the all the changes from log segments before the checkpoint are correctly on the physical disk. Thus, when recovering from a crash, only the changes from the last fully completed checkpoint need to be replayed. Furthermore, the old log segment files can be easily renamed for use as new log segments; this allows new log segments with effectively no [I/O] overhead. Alternatively, the old log segments could be archived for use as an incremental backup from a previous database snapshot.

+

+ The first time a data page is modified after a checkpoint, the entire original page is written the [ WAL] . This provides the valid base page for incremental changes to work against. If the page in the main data file is corrupted by a partial write, it does not matter; the correct page can be reconstucted by replaying the WAL from the checkpoint.

[PostgreSQL] provides the <tt>CHECKPOINT</tt> [SQL] command to force a checkpoint to occur.

-!! ~~Tuning the WAL~~

-Some configuration parameters for configuring WAL behaviour are available. They are all documented in the online [PostgreSQL documentation | http://www.postgresql.org/docs/]. A good overview of some of the specific configuration parameters is available [here | http://www.postgresql.org/docs/8./static/wal-configuration.html].

-

-Many people extol the virtue of placing the WAL on its own device, seperate from the main data files. After performing some extensive PgBench testing, this author questions that advice; the performance increase appears to be related to the increase in number of spindles more than the specific disk configuration. Compare the 4-disk RAID1+0 and dual RAID1 configurations. Those tests, however, were performed using a battery-backed write cache; it is entirely possible that without a write-cache the <tt>fsync()</tt> seek latency would significantly impact the main data file I/O.

-

-~~Some additional configuration notes:~~

-

-~~<tt>[commit_delay | http://www.postgresql.org/docs/8./static/runtime-config.html#GUC-COMMIT-DELAY]</tt>%%%~~

-~~<tt>[commit_siblings | http://www.postgresql.org/docs/8./static/runtime-config.html#GUC-COMMIT-SIBLINGS]</tt>%%%~~

- ~~These settings trade throughput for <tt>COMMIT</tt> latency. The circumstances under which increasing <tt>commit_delay</tt> has most notable effect are when:~~

- ~~# There is no battery-backed write cache~~

- ~~# There are a lot of small transactions~~

- ~~# The WAL device has a high <tt>fsync()</tt> latency~~

- ~~# There are a large number of concurrent connections~~

-

- ~~The author's favourite settings for these parameters are~~

-

- ~~<tt>commit_delay = 20000</tt>%%%~~

- ~~<tt>commit_siblings = 3</tt>%%%~~

-

-~~<tt>[wal_buffers | http://www.postgresql.org/docs/8./static/runtime-config.html#GUC-WAL-BUFFERS]</tt>%%%~~

- This setting controls the upper bound on the size of <tt>write()</tt> requests to WAL segment files. Modern operating systems and disks work very well with writes in the range of 128kiB to 512kiB, often reaching peak I/O bandwidth with <tt>write()</tt> requests in the range.

-

- Before generating some new WAL activity, [PostgreSQL] writes out the WAL buffers if more than half are full. Given a default block size of 8 kiB, and the disk performance above, this suggests that <tt>wal_buffers = 128</tt> is a good starting point. The default setting of <tt>8</tt> is surely too small for most systems.

-

- Doing <tt>COMMIT</tt> both writes out all the used WAL buffers and issues an expensive <tt>fsync()</tt>. Thus small transactions come with a heavy I/O penalty, and the WAL buffers are wasted. Using <tt>commit_delay</tt> may help significantly in this situation.

-

-~~<tt>[checkpoint_segments | http://www.postgresql.org/docs/8./static/runtime-config.html#GUC-CHECKPOINT-SEGMENTS]</tt>%%%~~

-~~<tt>[checkpoint_timeout | http://www.postgresql.org/docs/8./static/runtime-config.html#GUC-CHECKPOINT-TIMEOUT]</tt>%%%~~

-~~<tt>[checkpoint_warning | http://www.postgresql.org/docs/8./static/runtime-config.html#GUC-CHECKPOINT-WARNING]</tt>%%%~~

- Good settings for these parameters are particularly hard to determine. On one hand, you want to allow [PostgreSQL] and the operating system perform write-caching of the main data files; on the other hand, you want to constrain crash-recovery times to reasonable levels.

-

- ~~Current reccomendations are to have checkpoints occur no more frequently than 5 minute intervals. Thus the settings~~

-

- ~~<tt>checkpoint_timeout = 600</tt>%%%~~

- ~~<tt>checkpoint_warning = 300</tt>%%%~~

-

- ~~seem reasonable.~~

-

- It is impossible to tune <tt>checkpoint_segments</tt> for worst-case WAL activity, such as a bulk load using <tt>COPY FROM STDIN</tt> [SQL] command. This might generate 30MiB/s or more of WAL activity. 30MiB/s for 5 minutes is 8.7GiB of WAL data! The best suggestion appears to be let the server run for a while, then study the timestamps on the WAL segment files in <tt>pg_xlog</tt> to calculate the WAL write rate.

-

- ~~Once the WAL write rate is deduced, the number of log segments required for 10 minutes of WAL activity can be calculated as <tt>checkpoint_segments = (wal_write_rate * 600) / 16</tt>~~

-

- It is the author's experience that crash recovery from the WAL is a fairly slow operation. I/O rates in the region of 2MiB/s to 4MiB/s should be expected. The <tt>checkpoint_segments</tt> setting could be limited to the longest permissible recovery time. For example, to complete recovery within 5 minutes at 3.0MiB/sec, the maximum setting of <tt>checkpoint_segments</tt> is <tt>(300 * 3) / 16 = 56</tt>

-

-~~!! Incremental backup using WAL segments~~

-~~Note: The author has __not__ attempted incremental backups with this mechanism!~~

+!! See also

-~~Given~~ the ~~principles of~~ WAL ~~operation, it should be possible~~ to ~~achieve incremental~~ on-line backups ~~using WAL segments. As of [PostgreSQL] 8., the required configuration directives and server-side support has been implemented. The full procedure is documented [here~~ | http://www.postgresql.org/docs/8./static/backup-online.html].

+* [PostgreSQLNotes] for how to tune the [WAL]

+* PgBench for some benchmarking data

+* [Employing the WAL to implement on-line backups | http://www.postgresql.org/docs/8./static/backup-online.html] (as of Pg 8 .)