Differences between revisions 4 and 5

Overview of JSOC Data Series, DRMS, and SUMS

In the following we describe the logical organization of the HMI/AIA JSOC Data Record Management System (DRMS), also referred to as the JSOC catalog below, and define a number of terms used to describe data in the JSOC at various levels of abstraction. This section is based on the DRMS description by Rasmus Larsen linked at the bottom of the page.

The JSOC data series

Data is stored in the JSOC in "Data Series." A Data Series (or dataseries) is a basic sequence of similar data objects, typically "images" or other binary data along with associated metadata. A dataseries consists of a sequence of Data Records. Usually, each datarecord is the data for one step in "time". Most but certainly not all dataseries are sequences in time. They can be in principle any list of data objects. A good way to think about a dataseries is as a table of rows and columns where each row is a record. The columns contain metadata and descriptors of access to the binary data objects.

A datarecord is the basic "atomic unit" of a dataseries, or more precisely: The smallest unit that will be individually registered and available for export from a data series in the JSOC catalog. Most (if not all) access to the JSOC archive by both pipeline processing modules and external data export services will be in terms of data records. In other words, what we in informally call the "JSOC catalog" is first and foremost a data record catalog.

A datarecord consists of Keyword tagged meta-data describing the record and 0 or more named Datasegments usually containing binary arrays of data values. All datarecords in a given dataseries have the same set of keyword and datasegment names and associated record specific values. The dataseries description and the datarecords are maintained in a relational database called DRMS (Data Record Managment System). DRMS is implemented as a set of [http://www.postgresql.org/ PostgreSQL] tables. There is one database table for each series containing the values of keywords, segment metadata, and links for all data record in the series. The values for a single data record are contained in a single row in that table.

In summary:

A Dataseries consists of a sequence of:
- Records which consist of a set of:
  - Keywords and
  - Segments which consist of:
    - structure information and
    - storage unit identifier
  - Links which provide pointers to associated records in other series.

DataRecords

Datarecords contain several types of metadata including keyword values, segment descriptors, record links, and some processing information. Each record in a particular series is given a record number (called recnum) which serves as its ultimate identification in the database. Usually one or more keywords are designated prime keys which are the primary way records are identified for the user. The prime keys are used together to uniquely identify a dataseries record and are used to define the main index for the series. Any records with same sets of prime key values are treated as different versions of the same record. Thus the most recent instance of any record in a given series may be found by specifying the values of the prime keys for that series. The pre-defined keyword "recnum" is used for the main index in the case that no prime keys are defined. If a record with prime keys has been modified, older versions of the record will still be in the table but will have smaller values of recnum.

In order to access a set of records from a series a description must be provided to select the desired records. We call that description a "Dataset Name". Thus, in JSOC/DRMS a dataset name is actually a database query. The DRMS dataset name rules have been defined to provide user friendly (well it is the goal) names that are easy to remember and use.

Keywords

A data record contains zero or more (typically many) named keywords that each map to a value of a simple type such as integer, float, string, or time associated with the record. Keywords are often used to store meta-data describing properties, history and/or context of the main image/observable data stored in the record's data segments. This is a concept familiar from standard file-based data formats, such as FITS, where the FITS header keywords would correspond to the JSOC keywords and the primary binary arrays or tables would correspond to the JSOC data segments.

In the JSOC catalog keywords values are stored in database tables separate from the files holding the data segments. This makes it possible to:

modify keyword values without having to locate, access and possibly rewrite files on disk or tape,
rapidly finding data records whose keywords satisfy a given condition by executing a database query,
rapidly extract time series of the values of keywords from all or a subset of records in a series. This can be useful for, e.g., trend analysis or time series analysis of global properties, e.g. mean value or other image statistics, of data products.

There is one database table for each series containing the values of keywords and links for all data record in the series. The values for a single data record will be contained in a single row in that table.

Prime Keywords

For many series a primary index associated with the principal axis (e.g. time or (lattitude, longitude)) associated with each datarecord is desired. The intention is that the primary index maps to a unique value or slot on the principal axis. There might exist multiple versions of the "same observation" (e.g. newer versions could be created to include earlier missing data or to fix a bad calibration). Since there might be multiple versions of the "same" record, the primary index does not uniquely identify a data record.

The primary index consists of one or more keyword values that are logically concatenated to form the full index. If two records have keywords values that differ on any of the keywords comprising the primary index, they are considered different data record (w.r.t. the primary index), otherwise they are considered only different versions of the same data record (w.r.t. the primary index). The default behavior of the JSOC is to return the most recent version of a datarecord for a given primary index. Since record numbers (recnums) are assigned in order of creation the most recent version is record with the highest recnum. The primary index has two crucial uses in the JSOC:

It allows users to reference data records by their primary index, which will generally have some physical meaning (e.g. for a time series it could be the time or even the number of seconds or hours since some epoch). This will also allow programs and scripts to step through datasets in logical (e.g. time) order, rather than in record creation order as given by the record number which is arbitrary.
It allows the JSOC database system to maintain column indexes on the keywords corresponding to the primary index of a series. This vastly speeds up queries that select sets of records based on the primary index (possibly in combination with other criteria), and this is probably majority of all queries in the system.

Segments

A data record contains zero or more named data segments. The data segments contain data of large volume associated with the data record. While the DRMS record contains the description of each datasegment, the information contained in a datasegment is not stored in the database but is stored in Storage Units "owned" by SUMS (Storage Unit Management System). Storageunits are simply directories containing files. SUMS itself maintains tables in PostgreSQL to track storageunit locations on disk and/or tape. A storage unit may contain data for 1 or more datasegments for 1 or more datarecords.

The segment metadata includes information such as the storage protocol, compression information, image dimensions, etc. The contents of the data segments for a given data record is stored in files in a directory on disk (possibly also archived on tape). To make transfer and storage of JSOC data more managable and efficient, data segments for multiple data records in a given series may be grouped together and managed as a single storage unit. This typically gives rise to a directory structure like

/SUM1/D012342/S00000/fd_V.fits
                     image.png
                     small_fd_V.png
              S00001/fd_V.fits
                     image.png
                     small_fd_V.png
              S00002/fd_V.fits
                     image.png
                     small_fd_V.png

where in this example the records are assumed to contain three segments named fd_V, image, and small_fd_V stored in fits, .png and .png files respectively. In general the file name for each data segment is of the form:

<disk>/D<storageunit>/S<slotnumber>/<segmentname>.<protocol>.

Note that the default naming system does not contain the record identification. This can be determined via the storage unit number if needed. When data is exported outside the JSOC, filenames containing the seriesname and prime key values are usually created.

Links

A data record contains zero or more named links. Links are pointers between data records and make it possible for data records to inherit keyword values from each other, and to capture other dependencies between them such as processing history. For example, a data record can contain links to the data records that were used in creating it, such as a dopplergram data record pointing to the filtergrams from which is was created. Links come in two varieties, static and dynamic:

A static link points to a specific data record in the target series identified by (target series name, record number).
A dynamic link is represented by (target series name, primary index value) and points to the latest version among the data records with the specified value of the primary index in the target series. The DRMS resolves/binds the link to the record number of latest version at the time whenever the data record containing the link is opened.

DRMS

The Data Record Management System (DRMS) consists of a set of database tables and software to manage those tables allowing the user to create and use datarecords. The implementation is described in detail in XXXX. The users view is usually from the web (e.g. [http://jsoc.stanford.edu/ajax/lookdata.html lookdata.html], from shell level commands (e.g. [http://jsoc.stanford.edu/doxygen_html/group__show__info.html show_info]), or from compiled programs using the DRMS API described at [http://jsoc.stanford.edu/doxygen_html/group__c__api.html JSOC API man pages], or from user built support for e.g. IDL.

SUMS

SUMS details

See: [wiki:SumsDataModel SUMS - the Storage Unit Management System] for more information about SUMS implementation and use in the JSOC system.

Storage Units

The atomic unit of data that is managed by the JSOC storage system is called a storage unit. The JSOC storage system is therefore denoted Storage Unit Management System (SUMS). Each storage unit contains the data segment part of one or more datarecords from a single dataseries, and corresponds to the contents of a single directory [possibly with subdirectories for each datarecord]. A storage unit index (denoted sunum or internally as DSIndex for historical reasons) is stored with each data record and identifies the storage unit holding the data segments for the record. A storage unit may be stored online on magnetic disk, offline e.g. on a magnetic tape in a cabinet, or nearline on a tape in a robotic tape library. (The particular storage media is not important to the concept). In response to a user's request to access a particular datarecord the JSOC catalog will identify the storage unit containing that datarecord by looking up its sunum. The sunum is an index into the SUMS internal catalog which tracks the location of each storage unit. If the requested storage unit is not online the SUMS will allocate storage space, name a directory, and copy the storage unit into that directory. The SUMS will report the working directory pathname to the JSOC catalog where it is accessible to the user. All storage units are owned and managed by the SUMS. Storage unit are "write-once" objects and clients of SUMS can only perform two operations on them:

open existing unit as read-only,
create new unit.

Deletion or modification of storage units is restricted to SUMS administrative programs and requires special privileges. The datasegments for a particular dataseries will in general be stored in many storage units. The default size (number of records) of a unit is specified when a series is created. It is chosen based on knowledge of the size of the data records and how they are likely to be computed, such that a storage unit corresponds to the output of a "natural" processing batch and/or is a convenient size to handle for data export.

Older Documents

There are several older documents that while not accurate in describing the JSOC system as it is now implemented, do contain useful information about the design and intent and usage ideas. These are:

Design discussions:
NASA requested overview, CDRL 326a,b,c

-  ⇤ ← Revision 4 as of 2009-02-17 07:24:18 → 
  Size: 10550
  Editor: DNab42dffa
  Comment:
+   ← Revision 5 as of 2009-02-17 08:55:39 → ⇥
  Size: 14204
  Editor: DNab42dffa
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 23:
-In summary (click for more details)
+In summary:
 Line 81:
+A data record contains zero or more named data segments. The data segments contain data of large volume associated with the data record. While the DRMS record contains the description of each datasegment, the information contained in a datasegment is not stored in the database but is stored in Storage Units "owned" by SUMS (Storage Unit Management System).  Storageunits are simply directories containing files.  SUMS itself maintains tables in PostgreSQL to track storageunit locations on disk and/or tape.  A storage unit may contain data for 1 or more datasegments for 1 or more datarecords.
-Line 82:
+Line 83:
-While the DRMS record contains the description of each datasegment, the information contained a datasegment is not stored in the database but is stored in Storage Units "owned" by SUMS (Storage Unit Management System).  Storageunits are simply directories containing files.  SUMS itself maintains tables in PostgreSQL to track storageunit locations on disk and/or tape.  A storage unit may contain data for 1 or more datasegments for 1 or more datarecords.
+The segment metadata includes information such as the storage protocol, compression information, image dimensions, etc.
The contents of the data segments for a given data record is stored in files in a directory on disk (possibly also archived on tape). To make transfer and storage of JSOC data more managable and efficient, data segments for multiple data records in a given series may be grouped together and managed as a single storage unit. This typically gives rise to a directory structure like
{{{
/SUM1/D012342/S00000/fd_V.fits
                     image.png
                     small_fd_V.png
              S00001/fd_V.fits
                     image.png
                     small_fd_V.png
              S00002/fd_V.fits
                     image.png
                     small_fd_V.png
}}}
where in this example the records are assumed to contain three segments named fd_V, image, and small_fd_V stored in fits, .png and .png files respectively. In general the file name for each data segment is of the form:
{{{
<disk>/D<storageunit>/S<slotnumber>/<segmentname>.<protocol>.
}}}
Note that the default naming system does not contain the record identification.  This can be determined via the storage unit number if needed.  When data is exported outside the JSOC, filenames containing the seriesname and prime key values are usually created.
-Line 85:
+Line 103:
-Line 95:
+Line 112:
+The Data Record Management System (DRMS) consists of a set of database tables and software to manage those tables allowing the user to create and use datarecords.  The implementation is described in detail in XXXX.  The users view is usually from the web (e.g. [http://jsoc.stanford.edu/ajax/lookdata.html lookdata.html], from shell level commands (e.g. [http://jsoc.stanford.edu/doxygen_html/group__show__info.html show_info]), or from compiled programs using the DRMS API described at [http://jsoc.stanford.edu/doxygen_html/group__c__api.html JSOC API man pages], or from user built support for e.g. IDL.
-Line 96:
+Line 115:
-[wiki:SumsDataModel SUMS - the Storage Unit Management System]
== Implementation ==
-Line 99:
+Line 116:
-The JSOC Application Programming Interface (API) provides a set of functions, with bindings
to host languages including C, and FORTRAN, and maybe someday IDL and MATLAB, that allow programs to
connect to the JSOC environment and retrieve and manipulate data records. The API contains
groups of functions that
+=== SUMS details ===
See: [wiki:SumsDataModel SUMS - the Storage Unit Management System] for more information about SUMS implementation and use in the JSOC system.
-Line 104:
+Line 119:
- * create new or modify existing data series.
 * create, read or update data records,
 * query the JSOC catalog database to retrieve data records whose keywords satisfy a given
condition,
 * get and set the contents of keywords, links and data segments,

The API is described in the man pages and elsewhere in this wiki.
+=== Storage Units ===
The atomic unit of data that is managed by the JSOC storage system is called a ''storage unit''. The
JSOC storage system is therefore denoted Storage Unit Management System (SUMS). Each storage
unit contains the data segment part of one or more datarecords from a single dataseries, and corresponds to the
contents of a single directory [possibly with subdirectories for each datarecord]. A storage unit index (denoted sunum or internally as DSIndex for historical reasons) is stored with each data record and identifies the storage unit holding the data segments for the record.
A storage unit may be stored online on magnetic disk, offline e.g. on a magnetic tape in a cabinet,
or nearline on a tape in a robotic tape library. (The particular storage media is not important to
the concept). In response to a user's request to access a particular datarecord the JSOC catalog
will identify the storage unit containing that datarecord by looking up its sunum. The sunum
is an index into the SUMS internal catalog which tracks the location of each storage unit. If the
requested storage unit is not online the SUMS will allocate storage space, name a directory, and
copy the storage unit into that directory. The SUMS will report the working directory pathname to
the JSOC catalog where it is accessible to the user. All storage units are owned and managed by
the SUMS.
Storage unit are "write-once" objects and clients of SUMS can only perform two operations on
them: 
 * open existing unit as read-only, 
 * create new unit. 
Deletion or modification of storage units is restricted to SUMS administrative programs and requires special privileges.
The datasegments for a particular dataseries will in general be stored in many storage units. The
default size (number of records) of a unit is specified when a series is created. It is chosen based on knowledge of the size of the data records and how they are likely to be computed, such that a storage unit
corresponds to the output of a "natural" processing batch and/or is a convenient size to handle
for data export.