Size: 5836
Comment:
|
Size: 8801
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
'''N.B.''' These instruction are largely if not entirely supplanted by the more detailed instructions for installing and upgrading NetDRMS at http://vso.stanford.edu/netdrms/ |
= NetDRMS - a shared data management system = |
Line 5: | Line 3: |
= Build Your Own DRMS = | == Introduction == |
Line 7: | Line 5: |
The JSOC data archive is designed to be replicable and able to function with other cooperating data archives sharing the same basic architecture. Individual archives can selectively share data in their DRMS databases with other archives and serve as either master or slave for data record information on a dataset by dataset basis. It is also possible for data archives to share cached data segments in their individual Storage Unit Management Systems (SUMS). This page provides information for developers outside the JSOC who wish to set up such cooperating archives. |
In order to process, archive, and distribute the substantial quantity of data flowing from the Atmospheric Imaging Assembly (AIA) and Helioseismic and Magnetic Imager (HMI) instruments on the Solar Dynamics Observatory (SDO), the Joint Science Operations Center (JSOC) has developed its own data management system. This system, the Data Record Management System (DRMS), consists of ''data series'', each of which is a collection of related data. For example, there exists a data series named hmi.M_45s, which contains the HMI 45-second cadence magnetograms. Each data series consists of several DRMS objects: records, keywords, segments, and links. A DRMS record is the smallest unit of data-series data. Typically, it represents data for a single observation in time (hence the term ''series'' in data series), but there is no restriction on how a user organizes their data. A data series may contain one or more DRMS keywords, each of which represents a named bit of metadata. For example, many data series contain a DRMS keyword named CRPIX1. A DRMS segment is a collection of data that contains storage/retrieval information needed by DRMS to locate auxiliary data files. These data files contain large sets of data like image arrays. Generally, they are image files, but what they contain is arbitrary and user-defined. A data series optionally contains one or more DRMS links, each of which is a collection of data that ''links'' the data series to other DRMS data series. Each DRMS record contains record-specific values for the DRMS keywords, segments, and links. In this way, one record may have one set of keyword, segment, and link values, and another record may have a different set of these values. The Storage Unit Management System (SUMS) is the file-management system that contains the data files that DRMS records refer to. Each DRMS segment value is used by DRMS code to derive the SUMS file-system path to a single data file. Because each DRMS series may contain multiple DRMS segments, each DRMS record may ''point'' to more than one data file. To manage all these data, DRMS comprises several components, one of which is a database instance in a relational-database management system (PostgreSQL). The DRMS Library code uses a database instance and several tables to implement the DRMS objects. For each data-series record, there exists a database table that contains one row per each DRMS record. The columns of each of these records contain the DRMS keyword, segment, and link values - bits of data that are all small enough to efficiently fit in a database record. The data-file data are too large to fit into a database record, so those data reside in data files in SUMS. The DRMS-segment values ''point'' to the data files, using a unique identifier called a SUNUM. SUMS itself comprises several components, one of which is another database instance that contains several database tables. When DRMS needs a data file, it ''requests'' the file from SUMS by providing SUMS with a SUNUM, and then SUMS consults its database tables to derive the path to the data file. SUMS shuttles files between hard disk (aka the disk cache) and tape, so data files have no permanent file path. Therefore, when DRMS requests the path to a file, SUMS must obtain the current path by consulting a database table. == Building Your Own DRMS and SUMS == Sites other than the JSOC can DRMS data series. They can maintain local copies of the DRMS and SUMS data created at the JSOC. And they can create their own DRMS data, of which other sites can maintain local copies. To participate in this network of sites sharing data, a site (aka a node) must install a DRMS/SUMS system to become a NetDRMS site. Once a member of a this network, a NetDRMS site can selectively share specific data series - it is not necessary to share all series. |
Line 17: | Line 17: |
- A reserved disk space to serve as the SUMS disk cache. (A tape library for permanent offline or near-line storage is nice, but not essential. The details of setting up a tape library are NOT discussed elsewhere.) - A database server running Postgres version ... - A "current" copy of the JSOC software tree, available from Stanford through .... |
* Reserved disk space to serve as the SUMS disk cache. * A database server running Postgres version 8.4. * A "current" copy of the JSOC software tree, available from Stanford. |
NetDRMS - a shared data management system
Introduction
In order to process, archive, and distribute the substantial quantity of data flowing from the Atmospheric Imaging Assembly (AIA) and Helioseismic and Magnetic Imager (HMI) instruments on the Solar Dynamics Observatory (SDO), the Joint Science Operations Center (JSOC) has developed its own data management system. This system, the Data Record Management System (DRMS), consists of data series, each of which is a collection of related data. For example, there exists a data series named hmi.M_45s, which contains the HMI 45-second cadence magnetograms. Each data series consists of several DRMS objects: records, keywords, segments, and links. A DRMS record is the smallest unit of data-series data. Typically, it represents data for a single observation in time (hence the term series in data series), but there is no restriction on how a user organizes their data. A data series may contain one or more DRMS keywords, each of which represents a named bit of metadata. For example, many data series contain a DRMS keyword named CRPIX1. A DRMS segment is a collection of data that contains storage/retrieval information needed by DRMS to locate auxiliary data files. These data files contain large sets of data like image arrays. Generally, they are image files, but what they contain is arbitrary and user-defined. A data series optionally contains one or more DRMS links, each of which is a collection of data that links the data series to other DRMS data series. Each DRMS record contains record-specific values for the DRMS keywords, segments, and links. In this way, one record may have one set of keyword, segment, and link values, and another record may have a different set of these values.
The Storage Unit Management System (SUMS) is the file-management system that contains the data files that DRMS records refer to. Each DRMS segment value is used by DRMS code to derive the SUMS file-system path to a single data file. Because each DRMS series may contain multiple DRMS segments, each DRMS record may point to more than one data file.
To manage all these data, DRMS comprises several components, one of which is a database instance in a relational-database management system (PostgreSQL). The DRMS Library code uses a database instance and several tables to implement the DRMS objects. For each data-series record, there exists a database table that contains one row per each DRMS record. The columns of each of these records contain the DRMS keyword, segment, and link values - bits of data that are all small enough to efficiently fit in a database record. The data-file data are too large to fit into a database record, so those data reside in data files in SUMS. The DRMS-segment values point to the data files, using a unique identifier called a SUNUM. SUMS itself comprises several components, one of which is another database instance that contains several database tables. When DRMS needs a data file, it requests the file from SUMS by providing SUMS with a SUNUM, and then SUMS consults its database tables to derive the path to the data file. SUMS shuttles files between hard disk (aka the disk cache) and tape, so data files have no permanent file path. Therefore, when DRMS requests the path to a file, SUMS must obtain the current path by consulting a database table.
Building Your Own DRMS and SUMS
Sites other than the JSOC can DRMS data series. They can maintain local copies of the DRMS and SUMS data created at the JSOC. And they can create their own DRMS data, of which other sites can maintain local copies. To participate in this network of sites sharing data, a site (aka a node) must install a DRMS/SUMS system to become a NetDRMS site. Once a member of a this network, a NetDRMS site can selectively share specific data series - it is not necessary to share all series.
There are three fundamental requiremants for setting up and operating a DRMS system:
* Reserved disk space to serve as the SUMS disk cache. * A database server running Postgres version 8.4. * A "current" copy of the JSOC software tree, available from Stanford.
Setting up a SUMS
The SUMS disk area can be as simple as a directory, but it is probably better to assign at least one disk partition to the SUMS cache. Unless a tape library also exists, the SUMS partition(s) must be large enough to store all the data segments in the DRMS that are to be archived locally. For datasets for which other DRMS servers provide the permanent archive, the local SUMS will serve only as a local cache, so size is dictated by expected usage.
The directory or directories to be used for SUMS must be owned by a user named production (can be any uid) and belong to a group named SOI (can be any gid), and have a permissions mask of 8354 (drwxrwsr-x). The group SOI should include as members any users who will be writing data into the DRMS by running modules or otherwise.
Setting up the Postgres Database server
You should have Postgres Version 8.1 or higher installed; JSOC database servers are currently (Oct 2006) running on the following systems:
- a 64-bit dual-core xeon running Red Hat Enterprise Linux 4 with Postgres v. 8.1.2
- a 32-bit dual-core pentium 4 running Scientific Linux (?; equinox) with Postgres v. 8.1.4
Populating the Database
First, you must create the database tables required for SUMS. You can do so by running the following psql commands:
create table SUM_MAIN ( ONLINE_LOC VARCHAR(80) NOT NULL, ONLINE_STATUS VARCHAR(5), ARCHIVE_STATUS VARCHAR(5), OFFSITE_ACK VARCHAR(5), HISTORY_COMMENT VARCHAR(80), OWNING_SERIES VARCHAR(80), STORAGE_GROUP integer, STORAGE_SET integer, BYTES bigint, DS_INDEX bigint, CREATE_SUMID bigint NOT NULL, CREAT_DATE timestamp(0), ACCESS_DATE timestamp(0), USERNAME VARCHAR(10), ARCH_TAPE VARCHAR(20), ARCH_TAPE_POS VARCHAR(15), ARCH_TAPE_FN integer, ARCH_TAPE_DATE timestamp(0), WARNINGS VARCHAR(260), STATUS integer, SAFE_TAPE VARCHAR(20), SAFE_TAPE_POS VARCHAR(15), SAFE_TAPE_FN integer, SAFE_TAPE_DATE timestamp(0), constraint pk_summain primary key (DS_INDEX) ); create table SUM_OPEN ( SUMID bigint not null, OPEN_DATE timestamp(0), constraint pk_sumopen primary key (SUMID) ); create table SUM_PARTN_ALLOC ( wd VARCHAR(80) not null, sumid bigint not null, status integer not null, bytes bigint, effective_date VARCHAR(20), archive_substatus integer, group_id integer, ds_index bigint not null, safe_id integer ); create table SUM_PARTN_AVAIL ( partn_name VARCHAR(80) not null, total_bytes bigint not null, avail_bytes bigint not null, pds_set_num integer not null, constraint pk_sumpartnavail primary key (partn_name) ); create table SUM_TAPE ( tapeid varchar(20) not null, nxtwrtfn integer not null, spare integer not null, group_id integer not null, avail_blocks bigint not null, closed integer not null, last_write timestamp(0), constraint pk_tape primary key (tapeid) ); create sequence SUM_SEQ increment 1 start 2 no maxvalue no cycle cache 50; create sequence SUM_DS_INDEX_SEQ increment 1 start 1 no maxvalue no cycle cache 10; create table SUM_FILE ( tapeid varchar(20) not null, filenum integer not null, gtarblock integer, md5cksum varchar(36) not null, constraint pk_file primary key (tapeid, filenum) ); create table SUM_GROUP ( group_id integer not null, retain_days integer not null, effective_date VARCHAR(20), constraint pk_group primary key (group_id) );
(These are contained in the scripts create_tables.sql, sum_file.sql, and sum_group.sql in the JSOC software library base/sums/scripts/postgres.) For example, if you have created a database named mydb on a server named myserver (and had one of those scripts in your wd), you could enter the command
psql -h myserver mydb -f create_tables.sql
Or you could simply enter the commands by hand. (You should be the database administrator when you create these tables.)