Databases need maintenance. Usually there are vendor-specific reasons, but most reasons fall under the categories of space re-allocation and data de-fragmentation.
In PostGres, those two categories are Vacuuming and Reindexing, respectively.
Routine Vacuum
Why vacuum?
1. To recover disk space occupied by updated or deleted rows.
2. To update data statistics used by the PostgreSQL query planner. (VACUUM ANALYZE)
3. To protect against loss of very old data due to transaction ID wraparound
For more information, see PostgreSQL Docs
Why can't vacuum sometimes remove dead rows?
There exists one or more open transaction somewhere in the db cluster, i.e., any db under the same postmaster.
Consider the following transactions:
User: pat Time User: chris
BEGIN TRANSACTION; T1
T2 DELETE FROM vactest WHERE ...;
T3 ANALYZE VERBOSE vactest;
T4 VACUUM VERBOSE vactest;
At time T1, pat begins a new transaction. At a later time T2, chris deletes some rows from table vactest. A subsequent ANALYZE statement at time T3 shows the deleted rows as dead rows in table vactest. At time T4, VACUUM fails to remove these dead rows. This is because the dead rows would be visible to pat, hence the database cannot remove them. In fact, I think the database is doing a very conservative estimate on whether the deads rows will be visible, i.e., neither transaction isolation level nor access permission matters. Suppose in some future time, pat ends the transaction. After that, chris will be able to remove those dead rows with vacuum.
If we change the timing a little in the above example,
User: pat Time User: chris
T1 DELETE FROM vactest WHERE ...;
BEGIN TRANSACTION; T2
T3 ANALYZE VERBOSE vactest;
T4 VACUUM VERBOSE vactest;
Note now pat starts the transaction after chris has deleted the row. This time VACUUM succeeds in removing those dead rows.
These two connection may or may not be even on the same database, as long as they are under the same postmaster, the results are the same.
There are tons of discussion on this subject on PostgreSQL mailing list. Here are two examples: http://archives.postgresql.org/pgsql-general/2006-12/msg00316.php http://archives.postgresql.org/pgsql-hackers/2006-12/msg00717.php
Unfortunately for us, one of the first things drms_server does is to open a transaction, drms_server.c:215. Unless drms_commit() is issued somewhere, this transaction will not be closed until drms_server exits. This leaves a big opportunity for long running transactions, both for time-consumming data processing and runaway drms_server programs.
There are many reasons (I am told they are buried in some emails) why drms_server is designed this way. Because serialiable transaction is a strict requirement, it will not do to move the 'BEGIN TRANSACTION' statement close to the query or insert statements.
In summary, the rate we can effectively removing dead rows is determined by the longest running transaction even with autovacuum turned on.
Caveats and Unknowns
1. Problem with VACUUM FULL
2. Which is better? whole database vacuum or per-table vacuum. There are recommendations to turn off autovacuum in favor of manual vacuum on tables for data warehousing type of applications, like ours.
3. What about REINDEX?
4. Transaction ID wraparound problem
Crons and Scripts
Crons to be included may include:
1. Store Daily backup copies of user logins, internal memory/settings configurations, template databases and any database layout information
2. Store any Slony-1 configuration settings upon change or 1/week, or as determined by Slony-1 DBA
3. WAL file backups and ingest into warm-standby (and any other duplicate servers)
4. Regular (weekly?) automated creation of Data Definition Language scripts that would be able to completely and quickly recreate our database from "scratch" in the event of catastrophic failure.
5. Daily "space" update to tell the DBA the percentage of usable disk space left for PostGres's data.
Some useful scripts may include:
1. Updating database statistics
2. Full database vacuum
3. Cleanup of extra WAL files after a given time has passed
4. Cleanup of other backup files (see Crons, above) after N days, if required
5. Full backup to tape or snapshot disk
6. Space statistics report
7. "Who's doing what" user activity report - may be useful to monitor traffic
8. Emergency response (killed process ids, locked jobs, who to page or email if problems)
9. Load full database from tape or disk
10. Consistency checking utility scripts, such as Checksum of data files.
Documentation
A procedure document written expressly for purposes of switching the database users from primary to warm standby configuration would be useful to the database administrator in times of emergency. This should include information about doing tape loads, backups and transaction log loading or unloading. There are many steps involved in swap-overs; a written procedure will cut down on forgotten steps.
