Differences between revisions 6 and 7

Overview of JSOC Data Series, DRMS, and SUMS

In the following we describe the logical organization of the HMI/AIA JSOC Data Record Management System (DRMS), also referred to as the JSOC catalog below, and define a number of terms used to describe data in the JSOC at various levels of abstraction. This section is based on the DRMS description by Rasmus Larsen linked at the bottom of the page.

The JSOC Data Series

Data are stored in the JSOC in "Data Series." A data series (sometimes dataseries) is a basic sequence of similar data objects, typically "images" or other binary data along with associated metadata. A data series consists of a sequence of Data Records. Usually, each data record is the data for one step in "time". Most but certainly not all data series are sequences in time. They can be in principle any list of data objects. A good way to think about a data series is as a table of rows and columns where each row is a record. The columns contain metadata and descriptors of access to the binary data objects.

A data record is the basic "atomic unit" of a data series, or more precisely, the smallest unit that will be individually registered and available for export from a data series in the JSOC catalog. Most (if not all) access to the JSOC archive by both pipeline processing modules and external data export services will be in terms of data records. In other words, what we in informally call the "JSOC catalog" is first and foremost a data record catalog.

A data record consists of Keyword-tagged metadata describing the record and 0 or more named data segments usually containing binary arrays of data values. All data records in a given data series have the same set of keyword and data segment names and associated record-specific values. The data series description and the data records are maintained in a relational data base called DRMS (Data Record Management System). DRMS is implemented as a set of [http://www.postgresql.org/ PostgreSQL] tables. There is one data base table for each series containing the values of keywords, segment metadata, and links for all data record in the series. The values for a single data record are contained in a single row in that table.

In summary:

A Data Series consists of a sequence of:
- Records which consist of a set of:
  - Keywords and
  - Segments which consist of:
    - structure information and
    - storage unit identifier
  - Links that provide pointers to associated records in other series.

Data Records

Data records contain several types of metadata including keyword values, segment descriptors, record links, and some processing information. Each record in a particular series is given a record number (called recnum) which serves as its ultimate identification in the database. Usually one or more keywords are designated prime keys, which are the primary way records are identified for the user. The prime keys are used together to uniquely identify a dataseries record and are used to define the main index for the series. Any records with the same set of prime key values are treated as different versions of the same record. Thus the most recent instance of any record in a given series may be found by specifying the values of the prime keys for that series. The pre-defined keyword "recnum" is used for the main index in the case that no prime keys are defined. If a record with prime keys has been modified, older versions of the record will still be in the table but will have smaller values of recnum.

In order to access a set of records from a series a description must be provided to select the desired records. We call that description a "Dataset Name". Thus, in JSOC/DRMS a dataset name is actually a database query. The DRMS dataset name rules have been defined to provide user friendly (well it is the goal) names that are easy to remember and use.

Keywords

A data record contains zero or more (typically many) named keywords that each map to a value of a simple type such as integer, float, string, or time associated with the record. Keywords are often used to store meta-data describing properties, history and/or context of the main image/observable data stored in the record's data segments. This is a concept familiar from standard file-based data formats, such as FITS, where the FITS header keywords would correspond to the JSOC keywords and the primary binary arrays or tables would correspond to the JSOC data segments.

In the JSOC catalog keywords values are stored in database tables separate from the files holding the data segments. This makes it possible to:

modify keyword values without having to locate, access and possibly rewrite files on disk or tape,
rapidly find data records whose keywords satisfy a given condition by executing a database query,
rapidly extract time series of the values of keywords from all or a subset of records in a series. This can be useful for, say, trend analysis or time series analysis of global properties, e.g. the mean value or other image statistics of data products.

There is one database table for each series containing the values of keywords and links for all data records in the series. The values for a single data record will be contained in a single row in that table.

Prime Keywords and the Primary Index

For many data series it is useful to identify a primary index associated with the principal axis of the data records (e.g. time or (latitude, longitude)). The intention is that the primary index maps to a unique value or slot on the principal axis. There might exist multiple versions of the "same observation" (e.g. newer versions could be created to include earlier missing data or to fix a bad calibration). Since there might be multiple versions of the "same" record, the primary index does not uniquely identify a data record.

The primary index consists of one or more prime keyword values that are logically concatenated to form the full index. If two records have keywords values that differ on any of the keywords comprising the primary index, they are considered different data records (w.r.t. the primary index), otherwise they are considered to be only different versions of the same data record (w.r.t. the primary index). The default behavior of the JSOC is to return the most recent version of a data record for a given primary index. Since record numbers (recnums) are assigned in order of creation the most recent version is the record with the highest recnum. The primary index has two crucial uses in the JSOC:

It allows users to reference data records by their primary index, which will generally have some physical meaning (e.g. for a time series it could be the time or even the number of seconds or hours since some epoch). This will also allow programs and scripts to step through data sets in logical (e.g. time) order, rather than in record creation order as given by the record number which is arbitrary.
It allows the JSOC database system to maintain column indexes on the keywords corresponding to the primary index of a series. This vastly speeds up queries that select sets of records based on the primary index (possibly in combination with other criteria), and this is probably majority of all queries in the system.

Segments

A data record contains zero or more named data segments. The data segments contain data of large volume associated with the data record. While the DRMS record contains the description of each data segment, the information contained in a data segment is not stored in the data base but is stored in Storage Units "owned" by SUMS (Storage Unit Management System). Storage units are simply directories containing files. SUMS itself maintains tables in PostgreSQL to track storage unit locations on disk and/or tape. A storage unit may contain data for one or more data segments for one or more data records.

The segment metadata includes information such as the storage protocol, compression information, image dimensions, etc. The contents of the data segments for a given data record are stored in files in a directory on disk (possibly also archived on tape). To make transfer and storage of JSOC data more manageable and efficient, data segments for multiple data records in a given series may be grouped together and managed as a single storage unit. This typically gives rise to a directory structure like

/SUM1/D012342/S00000/fd_V.fits
                     image.png
                     small_fd_V.png
              S00001/fd_V.fits
                     image.png
                     small_fd_V.png
              S00002/fd_V.fits
                     image.png
                     small_fd_V.png

where in this example the records are assumed to contain three segments named fd_V, image, and small_fd_V stored in fits, .png and .png files respectively. In general the file name for each data segment is of the form:

<disk>/D<storageunit>/S<slotnumber>/<segmentname>.<protocol>

Note that the default naming system does not contain the record identification. This can be determined via the storage unit number if needed. When data are exported outside the JSOC, file names containing the series name and prime keyword values are usually created.

Links

A data record contains zero or more named links. Links are pointers between data records and make it possible for data records to inherit keyword values from each other, and to capture other dependencies between them such as processing history. For example, a data record can contain links to the data records that were used in creating it, such as a dopplergram data record pointing to the filtergrams from which is was created. Links come in two varieties, static and dynamic:

A static link points to a specific data record in the target series identified by (target series name, record number).
A dynamic link is represented by (target series name, primary index value) and points to the latest version among the data records with the specified value of the primary index in the target series. The DRMS resolves/binds the link to the record number of latest version each time the data record containing the link is opened.

DRMS

The Data Record Management System (DRMS) consists of a set of data base tables and software to manage those tables, allowing the user to create and use data records. The implementation is described in detail in XXXX. The user's view is usually from the web (e.g. [http://jsoc.stanford.edu/ajax/lookdata.html lookdata.html], from shell level commands (e.g. [http://jsoc.stanford.edu/doxygen_html/group__show__info.html show_info]), from compiled programs using the DRMS API described at [http://jsoc.stanford.edu/doxygen_html/group__c__api.html JSOC API man pages], or from user-built support systems, such as IDL.

SUMS

SUMS details

See: [wiki:SumsDataModel SUMS - the Storage Unit Management System] for more information about SUMS implementation and use in the JSOC system.

Storage Units

The atomic unit of data managed by the JSOC storage system is called a storage unit. The JSOC storage system is therefore denoted Storage Unit Management System (SUMS). Each storage unit contains the data segment part of one or more data records from a single data series. Each storage unut corresponds to the contents of a single directory [possibly with subdirectories for each data record]. A storage unit index (denoted sunum, or internally as DSIndex for historical reasons) is stored with each data record and identifies the storage unit holding the data segments for the record. A storage unit may be stored online on magnetic disk, nearline on a tape in a robotic tape library, or offline, e.g. on a magnetic tape in a cabinet. The particular storage media is not important to the concept. In response to a user's request to access a particular data record the JSOC catalog will identify the storage unit containing that data record by looking up its sunum. The sunum is an index into the SUMS internal catalog which tracks the location of each storage unit. If the requested storage unit is not online the SUMS will allocate storage space, name a directory, and copy the storage unit into that directory. The SUMS will report the working directory pathname to the JSOC catalog where it is accessible to the user. All storage units are owned and managed by the SUMS.

Storage unit are "write-once" objects and clients of SUMS can only perform two operations on them:

open an existing unit as read-only
create new unit.

Deletion or modification of storage units is restricted to SUMS administrative programs and requires special privileges. The data segments for a particular data series will in general be stored in many storage units. The default size (number of records) of a unit is specified when a series is created. It is chosen based on knowledge of the size of the data records and how they are likely to be computed, such that a storage unit corresponds to the output of a "natural" processing batch and/or is a convenient size to handle for data export.

Older Documents

There are several older documents that while not accurate in describing the JSOC system as it is now implemented, do contain useful information about the design and intent and usage ideas. These are:

Design discussions:
NASA requested overview, CDRL 326a,b,c

-  ⇤ ← Revision 6 as of 2009-04-17 03:24:54 → 
  Size: 14208
  Editor: solpc2
  Comment:
+   ← Revision 7 as of 2011-05-20 03:17:04 → ⇥
  Size: 14393
  Editor: l4-m0
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 7:
-== The JSOC data series ==
+== The JSOC Data Series ==
 Line 9:
-Data is stored in the JSOC in "Data Series."  A Data Series (or dataseries) is a basic sequence of similar data objects, typically "images" or other binary data along with associated metadata. A dataseries consists of a sequence of Data Records. Usually, each datarecord is the data for one step in "time". Most but certainly not all dataseries are sequences in time. They can be in principle any list of data objects. A good way to think about a dataseries is as a table of rows and columns where each row is a record.  The columns contain metadata and descriptors of access to the binary data objects.
+Data are stored in the JSOC in '''"Data Series."'''  A data series (sometimes ''dataseries'') is a basic sequence of similar data objects, typically "images" or other binary data along with associated metadata. A data series consists of a sequence of '''Data Records.''' Usually, each data record is the data for one step in "time". Most but certainly not all data series are sequences in time. They can be in principle any list of data objects. A good way to think about a data series is as a table of rows and columns where each row is a record.  The columns contain metadata and descriptors of access to the binary data objects.
 Line 11:
-A datarecord is the basic "atomic unit" of a dataseries, or more precisely: The smallest unit that
+A ''data record'' is the basic "atomic unit" of a data series, or more precisely, the smallest unit that
 Line 15:
-catalog" is fi rst and foremost a data record catalog.
+catalog" is first and foremost a data record catalog.
 Line 17:
-A datarecord consists of Keyword tagged meta-data describing the record and 0 or more named Datasegments usually containing binary arrays of data values.  All datarecords in a given dataseries have the same set of keyword and datasegment names and associated record specific values.  The dataseries description and the datarecords are maintained in a relational database called DRMS (Data Record Managment System). DRMS is implemented as a set of [http://www.postgresql.org/ PostgreSQL] tables.  
There is one database table for each series containing the values of keywords, segment metadata, and links for all
data record in the series. The values for a single data record are contained in a single row in
that table.
+A ''data record'' consists of Keyword-tagged metadata describing the record and 0 or more named '''data segments''' usually containing binary arrays of data values.  All data records in a given data series have the same set of keyword and data segment names and associated record-specific values.  The data series description and the data records are maintained in a relational data base called DRMS (Data Record Management System). DRMS is implemented as a set of [http://www.postgresql.org/ PostgreSQL] tables.  
There is one data base table for each series containing the values of keywords, segment metadata, and links for all
data record in the series. The values for a single data record are contained in a single row in that table.
-Line 25:
+Line 24:
- * A '''Dataseries''' consists of a sequence of:
+ * A '''Data Series''' consists of a sequence of:
-Line 31:
+Line 30:
-   * '''Links''' which provide pointers to associated records in other series.
+   * '''Links''' that provide pointers to associated records in other series.
-Line 33:
+Line 32:
-=== DataRecords ===
+=== Data Records ===
-Line 35:
+Line 34:
-Datarecords contain several types of metadata including keyword values, segment descriptors, record links, and some
+'''Data records''' contain several types of metadata including keyword values, segment descriptors, record links, and some
-Line 38:
+Line 37:
-Usually one or more keywords are designated ''prime keys'' which are the primary way records are identified for the user.  The prime keys are used together to uniquely identify a dataseries record and are used to define the main index for the series.  Any records with same sets of prime key values are treated as different versions of the same record.  Thus the most recent instance of any record in a given series may be found by specifying the values of the prime keys for that series.  The pre-defined keyword "recnum" is used for the main index in the case that no prime keys are defined.  If a record with prime keys has been modified, older versions of the record will still be in the table but will have smaller values of recnum.
+Usually one or more keywords are designated ''prime keys,'' which are the primary way records are identified for the user.  The prime keys are used together to uniquely identify a dataseries record and are used to define the main index for the series.  Any records with the same set of prime key values are treated as different versions of the same record.  Thus the most recent instance of any record in a given series may be found by specifying the values of the prime keys for that series.  The pre-defined keyword "recnum" is used for the main index in the case that no prime keys are defined.  If a record with prime keys has been modified, older versions of the record will still be in the table but will have smaller values of recnum.
-Line 40:
+Line 39:
-In order to access a set of records from a series a description must be provided to select the desired records.  We call that description a "Dataset Name".  Thus, in JSOC/DRMS a dataset name is actually a database query.  The DRMS dataset name rules have been defined to provide user friendly (well it is the goal) names that are easy to remember and use.
+In order to access a set of records from a series a description must be provided to select the desired records.  We call that description a '''"Dataset Name".'''  Thus, in JSOC/DRMS a dataset name is actually a database query.  The DRMS dataset name rules have been defined to provide user friendly (well it is the goal) names that are easy to remember and use.
-Line 44:
+Line 43:
-A data record contains zero or more (typically many) named keywords that each map to a value
+A data record contains zero or more (typically many) named '''keywords''' that each map to a value
-Line 54:
+Line 53:
- * rapidly finding data records whose keywords satisfy a given condition by executing a database query,
 * rapidly extract time series of the values of keywords from all or a subset of records in a series. This can be useful for, e.g., trend analysis or time series analysis of global properties, e.g. mean value or other image statistics, of data products.
+ * rapidly find data records whose keywords satisfy a given condition by executing a database query,
 * rapidly extract time series of the values of keywords from all or a subset of records in a series. This can be useful for, say, trend analysis or time series analysis of global properties, e.g. the mean value or other image statistics of data products.
-Line 58:
+Line 57:
-data record in the series. The values for a single data record will be contained in a single row in
that table.
+data records in the series. 
The values for a single data record will be contained in a single row in that table.
-Line 61:
+Line 60:
-=== Prime Keywords ===
+=== Prime Keywords and the Primary Index ===
-Line 63:
+Line 62:
-For many series a primary index associated with the principal axis (e.g. time or (lattitude, longitude))
associated with each datarecord is desired. The intention is that the primary index
+For many data series it is useful to identify a primary index associated with the principal axis of the data records (e.g. time or (latitude, longitude)). The intention is that the primary index
-Line 66:
+Line 64:
-observation" (e.g. newer versions could be created to include earlier missing data or to fi x a bad calibration).
+observation" (e.g. newer versions could be created to include earlier missing data or to fix a bad calibration).
-Line 69:
+Line 67:
-The primary index consists of one or more keyword values that are logically concatenated to form the full
+The '''primary index''' consists of one or more '''prime keyword''' values that are logically concatenated to form the full
-Line 71:
+Line 69:
-the keywords comprising the primary index, they are considered diff erent data record (w.r.t. the
primary index), otherwise they are considered only different versions of the same data record (w.r.t.
the primary index). The default behavior of the JSOC is to return the most recent version of a datarecord
+the keywords comprising the primary index, they are considered different data records (w.r.t. the
primary index), otherwise they are considered to be only different versions of the same data record (w.r.t.
the primary index). The default behavior of the JSOC is to return the most recent version of a data record
-Line 75:
+Line 73:
-version is record with the highest recnum. The primary index has two crucial uses in the
+version is the record with the highest recnum. The primary index has two crucial uses in the
-Line 77:
+Line 75:
- * It allows users to reference data records by their primary index, which will generally have some physical meaning (e.g. for a time series it could be the time or even the number of seconds or hours since some epoch). This will also allow programs and scripts to step through datasets in logical (e.g. time) order, rather than in record creation order as given by the record number which is arbitrary.
+ * It allows users to reference data records by their primary index, which will generally have some physical meaning (e.g. for a time series it could be the time or even the number of seconds or hours since some epoch). This will also allow programs and scripts to step through data sets in logical (e.g. time) order, rather than in record creation order as given by the record number which is arbitrary.
 Line 81:
-A data record contains zero or more named data segments. The data segments contain data of large volume associated with the data record. While the DRMS record contains the description of each datasegment, the information contained in a datasegment is not stored in the database but is stored in Storage Units "owned" by SUMS (Storage Unit Management System).  Storageunits are simply directories containing files.  SUMS itself maintains tables in PostgreSQL to track storageunit locations on disk and/or tape.  A storage unit may contain data for 1 or more datasegments for 1 or more datarecords.
+A data record contains zero or more named '''data segments.''' The data segments contain data of large volume associated with the data record. While the DRMS record contains the description of each data segment, the information contained in a data segment is not stored in the data base but is stored in Storage Units "owned" by SUMS (Storage Unit Management System).  Storage units are simply directories containing files.  SUMS itself maintains tables in PostgreSQL to track storage unit locations on disk and/or tape.  A storage unit may contain data for one or more data segments for one or more data records.
-Line 84:
+Line 85:
-The contents of the data segments for a given data record is stored in files in a directory on disk (possibly also archived on tape). To make transfer and storage of JSOC data more managable and efficient, data segments for multiple data records in a given series may be grouped together and managed as a single storage unit. This typically gives rise to a directory structure like
+The contents of the data segments for a given data record are stored in files in a directory on disk (possibly also archived on tape). To make transfer and storage of JSOC data more manageable and efficient, data segments for multiple data records in a given series may be grouped together and managed as a single storage unit. This typically gives rise to a directory structure like
-Line 96:
+Line 97:
-where in this example the records are assumed to contain three segments named fd_V, image, and small_fd_V stored in fits, .png and .png files respectively. In general the fi le name for each data segment is of the form:
+where in this example the records are assumed to contain three segments named fd_V, image, and small_fd_V stored in fits, .png and .png files respectively. In general the file name for each data segment is of the form:
-Line 98:
+Line 99:
-<disk>/D<storageunit>/S<slotnumber>/<segmentname>.<protocol>.
+<disk>/D<storageunit>/S<slotnumber>/<segmentname>.<protocol>
-Line 100:
+Line 101:
-Note that the default naming system does not contain the record identification.  This can be determined via the storage unit number if needed.  When data is exported outside the JSOC, filenames containing the seriesname and prime key values are usually created.
+Note that the default naming system does not contain the record identification.  This can be determined via the storage unit number if needed.  When data are exported outside the JSOC, file names containing the series name and prime keyword values are usually created.
-Line 103:
+Line 104:
-Line 107:
+Line 109:
- * A static link points to a specifi c data record in the target series identifi ed by (target series name, record number).
 * A dynamic link is represented by (target series name, primary index value) and points to the latest version among the data records with the specifi ed value of the primary index in the target series. The DRMS resolves/binds the link to the record number of latest version at the time whenever the data record containing the link is opened.
+ * A static link points to a specific data record in the target series identified by (target series name, record number).
 * A dynamic link is represented by (target series name, primary index value) and points to the latest version among the data records with the specified value of the primary index in the target series. The DRMS resolves/binds the link to the record number of latest version each time the data record containing the link is opened.
-Line 112:
+Line 114:
-The Data Record Management System (DRMS) consists of a set of database tables and software to manage those tables allowing the user to create and use datarecords.  The implementation is described in detail in XXXX.  The users view is usually from the web (e.g. [http://jsoc.stanford.edu/ajax/lookdata.html lookdata.html], from shell level commands (e.g. [http://jsoc.stanford.edu/doxygen_html/group__show__info.html show_info]), or from compiled programs using the DRMS API described at [http://jsoc.stanford.edu/doxygen_html/group__c__api.html JSOC API man pages], or from user built support for e.g. IDL.
+The Data Record Management System (DRMS) consists of a set of data base tables and software to manage those tables, allowing the user to create and use data records.  The implementation is described in detail in XXXX.  The user's view is usually from the web (e.g. [http://jsoc.stanford.edu/ajax/lookdata.html lookdata.html], from shell level commands (e.g. [http://jsoc.stanford.edu/doxygen_html/group__show__info.html show_info]), from compiled programs using the DRMS API described at [http://jsoc.stanford.edu/doxygen_html/group__c__api.html JSOC API man pages], or from user-built support systems, such as IDL.
-Line 120:
+Line 122:
-The atomic unit of data that is managed by the JSOC storage system is called a ''storage unit''. The
+The atomic unit of data managed by the JSOC storage system is called a '''storage unit.''' The
-Line 122:
+Line 125:
-unit contains the data segment part of one or more datarecords from a single dataseries, and corresponds to the
contents of a single directory [possibly with subdirectories for each datarecord]. A storage unit index (denoted sunum or internally as DSIndex for historical reasons) is stored with each data record and identifies the storage unit holding the data segments for the record.
A storage unit may be stored online on magnetic disk, offline e.g. on a magnetic tape in a cabinet,
or nearline on a tape in a robotic tape library. (The particular storage media is not important to
the concept). In response to a user's request to access a particular datarecord the JSOC catalog
will identify the storage unit containing that datarecord by looking up its sunum. The sunum
+unit contains the data segment part of one or more data records from a single data series. 
Each storage unut corresponds to the
contents of a single directory [possibly with subdirectories for each data record]. A storage unit index (denoted sunum, or internally as DSIndex for historical reasons) is stored with each data record and identifies the storage unit holding the data segments for the record.
A storage unit may be stored online on magnetic disk, nearline on a tape in a robotic tape library, 
or offline, e.g. on a magnetic tape in a cabinet. The particular storage media is not important to
the concept. In response to a user's request to access a particular data record the JSOC catalog
will identify the storage unit containing that data record by looking up its sunum. The sunum
-Line 133:
+Line 137:
-Line 134:
+Line 139:
-them: 
 * open existing unit as read-only,
+them:

 * open an existing unit as read-only
-Line 137:
+Line 143:
-Line 138:
+Line 145:
-The datasegments for a particular dataseries will in general be stored in many storage units. The
default size (number of records) of a unit is specifi ed when a series is created. It is chosen based on knowledge of the size of the data records and how they are likely to be computed, such that a storage unit
+The data segments for a particular data series will in general be stored in many storage units. The
default size (number of records) of a unit is specified when a series is created. It is chosen based on knowledge of the size of the data records and how they are likely to be computed, such that a storage unit