/web/jsoc/htdocs/production/postgres_backup_restore.html

Postgres Backup and Restore




Notes
1. Backup
2. archive_command
3. file-system-level backup
4. Restore
5. restore_command
6. Example
6.1 Example dcs0
7. pg_dump


Notes    (Top)

NOTE: This assumes that we're not using any tablespaces.
NOTE: Where does slony fit into this?

1. BACKUP    (Top)

We will do an On-line backup and point-in-time recovery (PITR) for our
Postgres Data Base.
(http://www.postgresql.org/docs/8.1/interactive/backup-online.html)
This consists of combining a file-system-level backup with backup of the 
Write Ahead Log (WAL) files. 
If recovery is needed, we restore the file backup and then replay from the 
backed-up WAL files to bring the backup up to current time.

So we first have to set up a procedure for archiving the WAL files.
The system physically divides the WAL sequence into WAL segment files, which 
are 16MB apiece, and located in /c/pgsql/pg_xlog. We must set up a shell
command to archive completed WAL segment files. 
(NOTE: The /c/pgsql/pg_xlog files are also backed up to tape as part of
the normal nightly incremental backup procedure.)

The shell command to use is specified by the archive_command configuration
parameter, which in practice will always be placed in the postgresql.conf
file (/d/pgsql/data/postgresql.conf).  In this string, any %p is replaced by
the absolute path of the file to archive, while any %f is replaced by the file
name only. Write %% if you need to embed an actual % character in the command. 

The archive command will be executed under the ownership of the same user
that the PostgreSQL server is running as. 

It is important that the archive command return zero exit status if and only
if it succeeded. Upon getting a zero result, PostgreSQL will assume that the
WAL segment file has been successfully archived, and will remove or recycle
it. However, a nonzero status tells PostgreSQL that the file was not archived;
it will try again periodically until it succeeds.

The archive command should generally be designed to refuse to overwrite any
pre-existing archive file. 

(NOTE: We do not currently backup partial WAL segments.)
If you are concerned about being able to recover right up to the current
instant, you may want to take additional steps to ensure that the current,
partially-filled WAL segment is also copied someplace. This is particularly
important if your server generates only little WAL traffic (or has slack
periods where it does so), since it could take a long time before a WAL
segment file is completely filled and ready to archive. One possible way to
handle this is to set up a cron job that periodically (once a minute, perhaps)
identifies the current WAL segment file and saves it someplace safe. Then the
combination of the archived WAL segments and the saved current segment will be
enough to ensure you can always restore to within a minute of current time.

NOTE: Seperate backups must be made of manually edited configuration files:
	/d/pgsql/data/postgresql.conf
	/d/pgsql/data/pg_hba.conf
	/d/pgsql/data/pg_ident.conf
This is automatically done in "2. tar the data dir:" below. But, if they are
subsequently changed, they should be saved.

2. archive_command    (Top)

****************************************************************************

Here is our archive_command found in /d/pgsql/data/postgresql.conf: 

archive_command = 'test ! -f /var/lib/pgsql/backups/wal_arch/%f.gz && gzip -1
-c %p > /var/lib/pgsql/backups/wal_arch/%f.gz'

****************************************************************************

3. file-system-level backup    (Top)

Now that we have archiving of the WAL segment files, we must make a
file-system-level backup. Here's what to do:

1. Login as the postgres user (su - postgres):

> psql -h hmidb jsoc
jsoc=# SELECT pg_start_backup('2006_05_30_JSOC');


2. tar the data dir:

cd /d/pgsql; gtar -czf /d/pgsql/backups/base/backup_2006_05_30.tgz data

NOTE: Some backup tools that you might wish to use emit warnings or errors if
the files they are trying to copy change while the copy proceeds. This
situation is normal, and not an error, when taking a base backup of an active
database.
NOTE: The normal nightly incremental backup to tape will copy /d/pgsql/backups
files, so the .tgz file and the archived WAL files will be on tape in case
of catastrophic disk failures.
NOTE: A seperate backup tape for off-site storage will be made whenever 
the tar of the data dir in done.

Copy this tar file to another machine for extra safety, e.g.:

(you might have to source the ssh-agent info first)
source /tmp/pgaginfo
scp -p /d/pgsql/backups/base/backup_2006_05_30.tgz jim@xim:/scr111/postgres

3. Connect to the database as a superuser and do:

> psql -h hmidb jsoc
jsoc=# SELECT pg_stop_backup();

To make use of this backup, you will need to keep around all the WAL segment
files generated during and after the file system backup. To aid you in doing
this, the pg_stop_backup function creates a backup history file that is
immediately stored into the WAL archive area. This file is named after the
first WAL segment file that you need to have to make use of the backup. For
example, if the starting WAL file is 0000000100001234000055CD the backup
history file will be named something like
0000000100001234000055CD.007C9330.backup. (The second number in the file name
stands for an exact position within the WAL file, and can ordinarily be
ignored.) Once you have safely archived the file system backup and the WAL
segment files used during the backup (as specified in the backup history
file), all archived WAL segments with names numerically less are no longer
needed to recover the file system backup and may be deleted. However, you
should consider keeping several backup sets to be absolutely certain that you
can recover your data. Keep in mind that only completed WAL segment files are
archived, so there will be delay between running pg_stop_backup and the
archiving of all WAL segment files needed to make the file system backup
consistent.

 It's also worth noting that the pg_start_backup function makes a file named
backup_label in the database cluster directory, which is then removed again by
pg_stop_backup. This file will of course be archived as a part of your backup
dump file. The backup label file includes the label string you gave to
pg_start_backup, as well as the time at which pg_start_backup was run, and the
name of the starting WAL file. In case of confusion it will therefore be
possible to look inside a backup dump file and determine exactly which backup
session the dump file came from.

4. Restore    (Top)

1. su - postgres

2. Stop the postmaster, if it's running.

-bash-3.00$ kill `head -1 /var/lib/pgsql/data/postmaster.pid`

If this doesn't work do this as root:
/sbin/service postgresql stop
^D

3. Copy the data directory and pg_xlog dir to a temporary location in case 
you need them later:

cp -rp /d/pgsql/data /c/pgsql/data_tmp
cp -rp /c/pgsql/pg_xlog /c/pgsql/data_tmp

NOTE: If /c does not have enough space then copy off to another machine with
adequate space. If this really starts to get large, then at least copy
/c/pgsql/pg_xlog, as it may contain logs which were not archived before the
system went down.
 
4.  Clean out all existing files and subdirectories under the data directory:

/bin/rm -rf /d/pgsql/data

5. Restore the database files from your backup dump:

cd /d/pgsql; gtar -xzf /d/pgsql/backups/base/backup_2006_05_30.tgz data

6.  If you had unarchived WAL segment files in /c/pgsql/data_tmp/pg_xlog, 
copy them zipped to /var/lib/pgsql/backups/wal_arch/%f.gz. 

7.  Create a recovery command file recovery.conf in /d/pgsql/data containing
the restore_command as follows:
(You may also want to temporarily modify pg_hba.conf to prevent ordinary users 
from connecting until you are sure the recovery has worked.)
restore_command    (Top)
****************************************************************************

restore_command = 'test -f /var/lib/pgsql/backups/wal_arch/%f.gz && gunzip
-1 -c /var/lib/pgsql/backups/wal_arch/%f.gz > %p'

****************************************************************************

8. Start the postmaster. The postmaster will go into recovery mode and proceed 
to read through the archived WAL files it needs. Upon completion of the 
recovery process, the postmaster will rename recovery.conf to recovery.done.

#Do as root:
#/sbin/service postgresql start
#^D
As user postgres:
postmaster -D /var/lib/pgsql/data > & /tmp/postmaster.log &


9. Let in the users by restoring pg_hba.conf to normal.

NOTE: If you were running a sum_svc, you must now stop and restart it
so that it may connect to the new instance of the DB.

6. Example    (Top)

(See below for example for Data Capture systems dcs0-dcs2)

Start the Postgres (if not already running) as user postgres and save 
a baseline tar file of the data directory:

hmidb:/var/lib/pgsql> postmaster -D /var/lib/pgsql/data > & /tmp/postmaster.log&
hmidb:/var/lib/pgsql> psql -h hmidb jsoc

joc=# SELECT pg_start_backup('2006_09_01_JSOC');
 pg_start_backup
-----------------
 2C/EEEFB8C0
(1 row)

jsoc=# \q

hmidb:/var/lib/pgsql> cd /d/pgsql
hmidb:/d/pgsql> gtar -czf /d/pgsql/backups/base/backup_2006_09_01_JSOC.tgz data
hmidb:/d/pgsql> psql -h hmidb jsoc
jsoc=# SELECT pg_stop_backup();
 pg_stop_backup
----------------
 2C/EEEFB910
(1 row)

jsoc=# \q
hmidb:/d/pgsql>

###########################################################################
To later restore:

Do as root:

[postgres@hmidb pgsql]# /sbin/service postgresql stop
Stopping postgresql service:

postgres@hmidb pgsql]# exit
hmidb:/var/lib/pgsql> whoami
postgres

Save current data & logs:
hmidb:/var/lib/pgsql> cp -rp /d/pgsql/data /c/pgsql/data_tmp
hmidb:/var/lib/pgsql> cp -rp /c/pgsql/pg_xlog /c/pgsql/data_tmp

Clean out existing file:
hmidb:/var/lib/pgsql> /bin/rm -rf /d/pgsql/data

Restore the baseline data:
hmidb:/var/lib/pgsql> cd /d/pgsql
hmidb:/d/pgsql> gtar -xzf /d/pgsql/backups/base/backup_2006_09_01_JSOC.tgz data

Zip any WAL segment not archived yet, e.g.:
> cd /c/pgsql/data_tmp/pg_xlog
> gzip -1 -c 000000010000002C000000EE > /var/lib/pgsql/backups/wal_arch/000000010000002C000000EE.gz


Make sure you have the restore command in /d/pgsql/data/recovery.conf:
restore_command = 'test -f /var/lib/pgsql/backups/wal_arch/%f.gz && gunzip
-1 -c /var/lib/pgsql/backups/wal_arch/%f.gz > %p'

Start the Postgres:
hmidb:/d/pgsql> postmaster -D /var/lib/pgsql/data > & /tmp/postmaster.log &


6.1 Example dcs0-2    (Top)

This is done on each Data Capture system in turn: dcs0, dcs1, dcs2.

Start the Postgres (if not already running) as user postgres and save 
a baseline tar file of the data directory:

dcs0:/var/lib/pgsql> postmaster -D /var/lib/pgsql/data > & /tmp/postmaster.log&
dcs0:/var/lib/pgsql> psql jsocdc

jocdc=# SELECT pg_start_backup('2008_01_31_JSOCDC_dcs0');
 pg_start_backup 
-----------------
 0/CAAB5DC0
(1 row)

jsocdc=# \q

dcs0:/var/lib//pgsql> gtar -czf /var/lib/pgsql/backups/base/backup_2008_01_31_JSOCDC_dcs0.tgz data

(NOTE: if get an error like this, it is ok)
gtar: data/base/16384/16385: file changed as we read it

dcs0:/var/lib/pgsql> psql jsocdc
jsocdc=# SELECT pg_stop_backup();
 pg_stop_backup 
----------------
 0/CAB06F28
(1 row)

jsoc=# \q
dcs0:/var/lib/pgsql>

NOTE: Also do for dcs1 and dcs2
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Also do below according to which dcs is the current backup, 
e.g. if dcs1 is doing HMI and dcs2 is doing AIA, then dcs0 is the 
current backup, and so you would do like this:

Make sure that this cron job is running on dcs1 which
will rsync the dcs1 database backups every 15 minutes onto dcs0:
0,15,30,45 * * * * /var/lib/pgsql/rsync_pg_dcs1_to_dcs0.pl
And likwise on dcs2:
0,15,30,45 * * * * /var/lib/pgsql/rsync_pg_dcs2_to_dcs0.pl

Also make sure that this cron job is running on dcs1 which
will rsync the dcs1 /var/lib/pgsql/backups to 
hmidb:/var/lib/pgsql/backups/dcs1
00 6,18 * * * /var/lib/pgsql/rsync_pg_dcs1_to_hmidb.pl
And likewise on dcs2:
00 6,18 * * * /var/lib/pgsql/rsync_pg_dcs2_to_hmidb.pl


NOTE: If dcs1 or dcs2 is rebooted, please update as appropriate 
/var/lib/pgsql/rsync_pg_dcs1_to_dcs0.source and
/var/lib/pgsql/ssh_rsync_pg_dcs1_to_hmidb.source  to set appropriately
SSH_AGENT_PID=11970
SSH_AUTH_SOCK=/tmp/ssh-BbZpg11969/agent.11969
where the values are obtained by doing:
> eval `ssh-agent`
> ssh-add ~/.ssh/id_rsa
and then echo $SSH_AUTH_SOCK and $SSH_AGENT_PID to get the
correct values to put into /var/lib/pgsql/rsync_pg_dcs1_to_dcs0.source
and /var/lib/pgsql/ssh_rsync_pg_dcs1_to_hmidb.source.

###########################################################################
To later restore:

Do as root:

dcs0 lib/pgsql# whoami
root
dcs0 lib/pgsql# /sbin/service postgresql stop
Shutting down PostgreSQLpostmaster stopped

dcs0 lib/pgsql# exit
dcs0:/var/lib/pgsql> whoami
postgres

Save current data & logs:
dsc0:/var/lib/pgsql> cp -rp /var/lib/pgsql/data /var/lib/pgsql/data_tmp
dsc0:/var/lib/pgsql> cp -rp /var/lib/pgsql/data/pg_xlog /var/lib/pgsql/data_tmp

Clean out existing file:
dsc0:/var/lib/pgsql> /bin/rm -rf /var/lib/pgsql/data

Restore the baseline data:
dcs0:/var/lib/pgsql> gtar -xzf /var/lib/pgsql/backups/base/backup_2008_01_31_JSOCDC_dcs0.tgz data

Zip any WAL segment not archived yet, e.g.:
> cd /var/lib/pgsql/data_tmp/pg_xlog
> gzip -1 -c 000000010000002C000000EE > /var/lib/pgsql/backups/wal_arch/000000010000002C000000EE.gz


Make sure you have the restore command in /var/lib/pgsql/data/recovery.conf:
restore_command = 'test -f /var/lib/pgsql/backups/wal_arch/%f.gz && gunzip
-1 -c /var/lib/pgsql/backups/wal_arch/%f.gz > %p'

Start the Postgres:
dcs0:/var/lib/pgsql> postmaster -D /var/lib/pgsql/data &

7. pg_dump    (Top)

In the beginning, the DB is save with pg_dump so that if lost entirely,
all the tables and indexes and permissions, etc. can be restored.

postgres@dcs1:~> pwd
/var/lib/pgsql
postgres@dcs1:~> whoami
postgres
postgres@dcs1:~> pg_dump --create jsocdc > pg_dump.create.jsocdc

In order to restore, first drop the database, then:

put the script, pg_dump.create.jsocdc, into psql as user postgres.

Also do a --clean pg_dump:

postgres@dcs1:~> pg_dump --clean jsocdc > pg_dump.clean.jsocdc

Last modified @ 09:40 Tue Mar 18 2008