Differences between revisions 3 and 5 (spanning 2 versions)

NetDRMS - a shared data management system

Introduction

In order to process, archive, and distribute the substantial quantity of data flowing from the Atmospheric Imaging Assembly (AIA) and Helioseismic and Magnetic Imager (HMI) instruments on the Solar Dynamics Observatory (SDO), the Joint Science Operations Center (JSOC) has developed its own data management system. This system, the Data Record Management System (DRMS), consists of data series, each of which is a collection of related data. For example, there exists a data series named hmi.M_45s, which contains the HMI 45-second cadence magnetograms. Each data series consists of several DRMS objects: records, keywords, segments, and links. A DRMS record is the smallest unit of data-series data. Typically, it represents data for a single observation in time (hence the term series in data series), but there is no restriction on how a user organizes their data. A data series may contain one or more DRMS keywords, each of which represents a named bit of metadata. For example, many data series contain a DRMS keyword named CRPIX1. A DRMS segment is a collection of data that contains storage/retrieval information needed by DRMS to locate auxiliary data files. These data files contain large sets of data like image arrays. Generally, they are image files, but what they contain is arbitrary and user-defined. A data series optionally contains one or more DRMS links, each of which is a collection of data that links the data series to other DRMS data series. Each DRMS record contains record-specific values for the DRMS keywords, segments, and links. In this way, one record may have one set of keyword, segment, and link values, and another record may have a different set of these values.

The Storage Unit Management System (SUMS) is the file-management system that contains the data files that DRMS records refer to. Each DRMS segment value is used by DRMS code to derive the SUMS file-system path to a single data file. Because each DRMS series may contain multiple DRMS segments, each DRMS record may point to more than one data file.

To manage all these data, DRMS comprises several components, one of which is a database instance in a relational-database management system (PostgreSQL). The DRMS Library code uses a database instance and several tables to implement the DRMS objects. For each data-series record, there exists a database table that contains one row per each DRMS record. The columns of each of these records contain the DRMS keyword, segment, and link values - bits of data that are all small enough to efficiently fit in a database record. The data-file data are too large to fit into a database record, so those data reside in data files in SUMS. The DRMS-segment values point to the data files, using a unique identifier called a SUNUM. SUMS itself comprises several components, one of which is another database instance that contains several database tables. When DRMS needs a data file, it requests the file from SUMS by providing SUMS with a SUNUM, and then SUMS consults its database tables to derive the path to the data file. SUMS shuttles files between hard disk (aka the disk cache) and tape, so data files have no permanent file path. Therefore, when DRMS requests the path to a file, SUMS must obtain the current path by consulting a database table.

Building Your Own DRMS and SUMS

Sites other than the JSOC can DRMS data series. They can maintain local copies of the DRMS and SUMS data created at the JSOC. And they can create their own DRMS data, of which other sites can maintain local copies. To participate in this network of sites sharing data, a site (aka a node) must install a DRMS/SUMS system to become a NetDRMS site. Once a member of a this network, a NetDRMS site can selectively share specific data series - it is not necessary to share all series.

There are three fundamental requiremants for setting up and operating a DRMS system:

* Reserved disk space to serve as the SUMS disk cache. * A database server running Postgres version 8.4. * A "current" copy of the JSOC software tree, available from Stanford.

Setting up a SUMS

The SUMS disk area can be as simple as a directory, but it is probably better to assign at least one disk partition to the SUMS cache. Unless a tape library also exists, the SUMS partition(s) must be large enough to store all the data segments in the DRMS that are to be archived locally. For datasets for which other DRMS servers provide the permanent archive, the local SUMS will serve only as a local cache, so size is dictated by expected usage.

The directory or directories to be used for SUMS must be owned by a user named production (can be any uid) and belong to a group named SOI (can be any gid), and have a permissions mask of 8354 (drwxrwsr-x). The group SOI should include as members any users who will be writing data into the DRMS by running modules or otherwise.

Setting up the Postgres Database server

You should have Postgres Version 8.1 or higher installed; JSOC database servers are currently (Oct 2006) running on the following systems:

a 64-bit dual-core xeon running Red Hat Enterprise Linux 4 with Postgres v. 8.1.2
a 32-bit dual-core pentium 4 running Scientific Linux (?; equinox) with Postgres v. 8.1.4

Populating the Database

First, you must create the database tables required for SUMS. You can do so by running the following psql commands:

create table SUM_MAIN (
 ONLINE_LOC             VARCHAR(80) NOT NULL,
 ONLINE_STATUS          VARCHAR(5),
 ARCHIVE_STATUS         VARCHAR(5),
 OFFSITE_ACK            VARCHAR(5),
 HISTORY_COMMENT        VARCHAR(80),
 OWNING_SERIES          VARCHAR(80),
 STORAGE_GROUP          integer,
 STORAGE_SET            integer,
 BYTES                  bigint,
 DS_INDEX               bigint,
 CREATE_SUMID           bigint NOT NULL,
 CREAT_DATE             timestamp(0),
 ACCESS_DATE            timestamp(0),
 USERNAME               VARCHAR(10),
 ARCH_TAPE              VARCHAR(20),
 ARCH_TAPE_POS          VARCHAR(15),
 ARCH_TAPE_FN           integer,
 ARCH_TAPE_DATE         timestamp(0),
 WARNINGS               VARCHAR(260),
 STATUS                 integer,
 SAFE_TAPE              VARCHAR(20),
 SAFE_TAPE_POS          VARCHAR(15),
 SAFE_TAPE_FN           integer,
 SAFE_TAPE_DATE         timestamp(0),
 constraint pk_summain primary key (DS_INDEX)
);

create table SUM_OPEN (
    SUMID      bigint not null,
    OPEN_DATE  timestamp(0),
    constraint pk_sumopen primary key (SUMID)
);

create table SUM_PARTN_ALLOC (
    wd                 VARCHAR(80) not null,
    sumid              bigint not null,
    status             integer not null,
    bytes              bigint,
    effective_date     VARCHAR(20),
    archive_substatus  integer,
    group_id           integer,
    ds_index           bigint not null,
    safe_id            integer
);

create table SUM_PARTN_AVAIL (
       partn_name    VARCHAR(80) not null,
       total_bytes   bigint not null,
       avail_bytes   bigint not null,
       pds_set_num   integer not null,
       constraint pk_sumpartnavail primary key (partn_name)
);

create table SUM_TAPE (
        tapeid          varchar(20) not null,
        nxtwrtfn        integer not null,
        spare           integer not null,
        group_id        integer not null,
        avail_blocks    bigint not null,
        closed          integer not null,
        last_write      timestamp(0),
        constraint pk_tape primary key (tapeid)
);

create sequence SUM_SEQ
  increment 1
  start 2
  no maxvalue
  no cycle
  cache 50;

create sequence SUM_DS_INDEX_SEQ
  increment 1
  start 1
  no maxvalue
  no cycle
  cache 10;

create table SUM_FILE (
        tapeid          varchar(20) not null,
        filenum         integer not null,
        gtarblock       integer,
        md5cksum        varchar(36) not null,
        constraint pk_file primary key (tapeid, filenum)
       );

create table SUM_GROUP (
        group_id        integer not null,
        retain_days     integer not null,
        effective_date  VARCHAR(20),
        constraint pk_group primary key (group_id)
       );

(These are contained in the scripts create_tables.sql, sum_file.sql, and sum_group.sql in the JSOC software library base/sums/scripts/postgres.) For example, if you have created a database named mydb on a server named myserver (and had one of those scripts in your wd), you could enter the command

  psql -h myserver mydb -f create_tables.sql

Or you could simply enter the commands by hand. (You should be the database administrator when you create these tables.)

-  ⇤ ← Revision 3 as of 2010-09-03 08:14:14 → 
  Size: 5836
  Editor: rick
  Comment:
+   ← Revision 5 as of 2013-02-26 04:05:54 → ⇥
  Size: 8801
  Editor: DNab4211fe
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-'''N.B.''' These instruction are largely if not entirely supplanted by the more detailed
instructions for installing and upgrading NetDRMS at
http://vso.stanford.edu/netdrms/
+= NetDRMS - a shared data management system =
-Line 5:
+Line 3:
-= Build Your Own DRMS =
+== Introduction ==
-Line 7:
+Line 5:
-The JSOC data archive is designed to be replicable and able to function with other
cooperating data archives sharing the same basic architecture. Individual archives
can selectively share data in their DRMS databases with other archives and serve as
either master or slave for data record information on a dataset by dataset basis.
It is also possible for data archives to share cached data segments in their individual
Storage Unit Management Systems (SUMS). This page provides information for developers
outside the JSOC who wish to set up such cooperating archives.
+In order to process, archive, and distribute the substantial quantity of data flowing from the Atmospheric Imaging Assembly (AIA) and Helioseismic and Magnetic Imager (HMI) instruments on the Solar Dynamics Observatory (SDO), the Joint Science Operations Center (JSOC) has developed its own data management system. This system, the Data Record Management System (DRMS), consists of ''data series'', each of which is a collection of related data. For example, there exists a data series named hmi.M_45s, which contains the HMI 45-second cadence magnetograms. Each data series consists of several DRMS objects: records, keywords, segments, and links. A DRMS record is the smallest unit of data-series data. Typically, it represents data for a single observation in time (hence the term ''series'' in data series), but there is no restriction on how a user organizes their data. A data series may contain one or more DRMS keywords, each of which represents a named bit of metadata. For example, many data series contain a DRMS keyword named CRPIX1. A DRMS segment is a collection of data that contains storage/retrieval information needed by DRMS to locate auxiliary data files. These data files contain large sets of data like image arrays. Generally, they are image files, but what they contain is arbitrary and user-defined. A data series optionally contains one or more DRMS links, each of which is a collection of data that ''links'' the data series to other DRMS data series. Each DRMS record contains record-specific values for the DRMS keywords, segments, and links. In this way, one record may have one set of keyword, segment, and link values, and another record may have a different set of these values.

The Storage Unit Management System (SUMS) is the file-management system that contains the data files that DRMS records refer to. Each DRMS segment value is used by DRMS code to derive the SUMS file-system path to a single data file. Because each DRMS series may contain multiple DRMS segments, each DRMS record may ''point'' to more than one data file. 

To manage all these data, DRMS comprises several components, one of which is a database instance in a relational-database management system (PostgreSQL). The DRMS Library code uses a database instance and several tables to implement the DRMS objects. For each data-series record, there exists a database table that contains one row per each DRMS record. The columns of each of these records contain the DRMS keyword, segment, and link values - bits of data that are all small enough to efficiently fit in a database record. The data-file data are too large to fit into a database record, so those data reside in data files in SUMS. The DRMS-segment values ''point'' to the data files, using a unique identifier called a SUNUM. SUMS itself comprises several components, one of which is another database instance that contains several database tables. When DRMS needs a data file, it ''requests'' the file from SUMS by providing SUMS with a SUNUM, and then SUMS consults its database tables to derive the path to the data file. SUMS shuttles files between hard disk (aka the disk cache) and tape, so data files have no permanent file path. Therefore, when DRMS requests the path to a file, SUMS must obtain the current path by consulting a database table.

== Building Your Own DRMS and SUMS ==

Sites other than the JSOC can DRMS data series. They can maintain local copies of the DRMS and SUMS data created at the JSOC. And they can create their own DRMS data, of which other sites can maintain local copies. To participate in this network of sites sharing data, a site (aka a node) must install a DRMS/SUMS system to become a NetDRMS site. Once a member of a this network, a NetDRMS site can selectively share specific data series - it is not necessary to share all series.
 Line 17:
-- A reserved disk space to serve as the SUMS disk cache. (A tape library for permanent
offline or near-line storage is nice, but not essential. The details of setting up a
tape library are NOT discussed elsewhere.)

- A database server running Postgres version ...

- A "current" copy of the JSOC software tree, available from Stanford through ....
+* Reserved disk space to serve as the SUMS disk cache. 
* A database server running Postgres version 8.4.
* A "current" copy of the JSOC software tree, available from Stanford.