Differences between revisions 3 and 4

Overview of JSOC Data Series, DRMS, and SUMS

In the following we describe the logical organization of the HMI/AIA JSOC Data Record Management System (DRMS), also referred to as the JSOC catalog below, and define a number of terms used to describe data in the JSOC at various levels of abstraction. This section is based on the DRMS description by Rasmus Larsen linked at the bottom of the page.

The JSOC data series

Data is stored in the JSOC in "Data Series." A Data Series (or dataseries) is a basic sequence of similar data objects, typically "images" or other binary data along with associated metadata. A dataseries consists of a sequence of Data Records. Usually, each datarecord is the data for one step in "time". Most but certainly not all dataseries are sequences in time. They can be in principle any list of data objects. A good way to think about a dataseries is as a table of rows and columns where each row is a record. The columns contain metadata and descriptors of access to the binary data objects.

A datarecord is the basic "atomic unit" of a dataseries, or more precisely: The smallest unit that will be individually registered and available for export from a data series in the JSOC catalog. Most (if not all) access to the JSOC archive by both pipeline processing modules and external data export services will be in terms of data records. In other words, what we in informally call the "JSOC catalog" is first and foremost a data record catalog.

A datarecord consists of Keyword tagged meta-data describing the record and 0 or more named Datasegments usually containing binary arrays of data values. All datarecords in a given dataseries have the same set of keyword and datasegment names and associated record specific values. The dataseries description and the datarecords are maintained in a relational database called DRMS (Data Record Managment System). DRMS is implemented as a set of [http://www.postgresql.org/ PostgreSQL] tables. There is one database table for each series containing the values of keywords, segment metadata, and links for all data record in the series. The values for a single data record are contained in a single row in that table.

In summary (click for more details)

A Dataseries consists of a sequence of:
- Records which consist of a set of:
  - Keywords and
  - Segments which consist of:
    - structure information and
    - storage unit identifier
  - Links which provide pointers to associated records in other series.

DataRecords

Datarecords contain several types of metadata including keyword values, segment descriptors, record links, and some processing information. Each record in a particular series is given a record number (called recnum) which serves as its ultimate identification in the database. Usually one or more keywords are designated prime keys which are the primary way records are identified for the user. The prime keys are used together to uniquely identify a dataseries record and are used to define the main index for the series. Any records with same sets of prime key values are treated as different versions of the same record. Thus the most recent instance of any record in a given series may be found by specifying the values of the prime keys for that series. The pre-defined keyword "recnum" is used for the main index in the case that no prime keys are defined. If a record with prime keys has been modified, older versions of the record will still be in the table but will have smaller values of recnum.

In order to access a set of records from a series a description must be provided to select the desired records. We call that description a "Dataset Name". Thus, in JSOC/DRMS a dataset name is actually a database query. The DRMS dataset name rules have been defined to provide user friendly (well it is the goal) names that are easy to remember and use.

Keywords

A data record contains zero or more (typically many) named keywords that each map to a value of a simple type such as integer, float, string, or time associated with the record. Keywords are often used to store meta-data describing properties, history and/or context of the main image/observable data stored in the record's data segments. This is a concept familiar from standard file-based data formats, such as FITS, where the FITS header keywords would correspond to the JSOC keywords and the primary binary arrays or tables would correspond to the JSOC data segments.

In the JSOC catalog keywords values are stored in database tables separate from the files holding the data segments. This makes it possible to:

modify keyword values without having to locate, access and possibly rewrite files on disk or tape,
rapidly finding data records whose keywords satisfy a given condition by executing a database query,
rapidly extract time series of the values of keywords from all or a subset of records in a series. This can be useful for, e.g., trend analysis or time series analysis of global properties, e.g. mean value or other image statistics, of data products.

There is one database table for each series containing the values of keywords and links for all data record in the series. The values for a single data record will be contained in a single row in that table.

Prime Keywords

For many series a primary index associated with the principal axis (e.g. time or (lattitude, longitude)) associated with each datarecord is desired. The intention is that the primary index maps to a unique value or slot on the principal axis. There might exist multiple versions of the "same observation" (e.g. newer versions could be created to include earlier missing data or to fix a bad calibration). Since there might be multiple versions of the "same" record, the primary index does not uniquely identify a data record.

The primary index consists of one or more keyword values that are logically concatenated to form the full index. If two records have keywords values that differ on any of the keywords comprising the primary index, they are considered different data record (w.r.t. the primary index), otherwise they are considered only different versions of the same data record (w.r.t. the primary index). The default behavior of the JSOC is to return the most recent version of a datarecord for a given primary index. Since record numbers (recnums) are assigned in order of creation the most recent version is record with the highest recnum. The primary index has two crucial uses in the JSOC:

It allows users to reference data records by their primary index, which will generally have some physical meaning (e.g. for a time series it could be the time or even the number of seconds or hours since some epoch). This will also allow programs and scripts to step through datasets in logical (e.g. time) order, rather than in record creation order as given by the record number which is arbitrary.
It allows the JSOC database system to maintain column indexes on the keywords corresponding to the primary index of a series. This vastly speeds up queries that select sets of records based on the primary index (possibly in combination with other criteria), and this is probably majority of all queries in the system.

Segments

While the DRMS record contains the description of each datasegment, the information contained a datasegment is not stored in the database but is stored in Storage Units "owned" by SUMS (Storage Unit Management System). Storageunits are simply directories containing files. SUMS itself maintains tables in PostgreSQL to track storageunit locations on disk and/or tape. A storage unit may contain data for 1 or more datasegments for 1 or more datarecords.

Links

A data record contains zero or more named links. Links are pointers between data records and make it possible for data records to inherit keyword values from each other, and to capture other dependencies between them such as processing history. For example, a data record can contain links to the data records that were used in creating it, such as a dopplergram data record pointing to the filtergrams from which is was created. Links come in two varieties, static and dynamic:

A static link points to a specific data record in the target series identified by (target series name, record number).
A dynamic link is represented by (target series name, primary index value) and points to the latest version among the data records with the specified value of the primary index in the target series. The DRMS resolves/binds the link to the record number of latest version at the time whenever the data record containing the link is opened.

DRMS

SUMS

[wiki:SumsDataModel SUMS - the Storage Unit Management System]

Implementation

The JSOC Application Programming Interface (API) provides a set of functions, with bindings to host languages including C, and FORTRAN, and maybe someday IDL and MATLAB, that allow programs to connect to the JSOC environment and retrieve and manipulate data records. The API contains groups of functions that

create new or modify existing data series.
create, read or update data records,
query the JSOC catalog database to retrieve data records whose keywords satisfy a given

condition,

get and set the contents of keywords, links and data segments,

The API is described in the man pages and elsewhere in this wiki.

Older Documents

There are several older documents that while not accurate in describing the JSOC system as it is now implemented, do contain useful information about the design and intent and usage ideas. These are:

Design discussions:
NASA requested overview, CDRL 326a,b,c

-  ⇤ ← Revision 3 as of 2007-11-20 08:58:10 → 
  Size: 3795
  Editor: tucano
  Comment:
+   ← Revision 4 as of 2009-02-17 07:24:18 → ⇥
  Size: 10550
  Editor: DNab42dffa
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 2:
+In the following we describe the logical organization of the HMI/AIA JSOC Data Record Management System
(DRMS), also referred to as the JSOC catalog below, and define a number of terms used to describe
data in the JSOC at various levels of abstraction. ''This section is based on the DRMS description by Rasmus Larsen linked at the bottom of the page.''
-Line 5:
+Line 9:
-Data is stored in the HMI/AIA JSOC in "Data Series."  A Data Series (or dataseries) is a basic sequence of like data objects, typically "images" or other binary data along with associated meta-data. A dataseries consists of a sequence of Data Records. Usually, each datarecord is the data for one step in "time". Most but certainly not all dataseries are sequences in time. They can be in principle any list of data objects. A good way to think about a dataseries is as a table of rows and columns.
+Data is stored in the JSOC in "Data Series."  A Data Series (or dataseries) is a basic sequence of similar data objects, typically "images" or other binary data along with associated metadata. A dataseries consists of a sequence of Data Records. Usually, each datarecord is the data for one step in "time". Most but certainly not all dataseries are sequences in time. They can be in principle any list of data objects. A good way to think about a dataseries is as a table of rows and columns where each row is a record.  The columns contain metadata and descriptors of access to the binary data objects.

A datarecord is the basic "atomic unit" of a dataseries, or more precisely: The smallest unit that
will be individually registered and available for export from a data series in the JSOC catalog. Most
(if not all) access to the JSOC archive by both pipeline processing modules and external data export
services will be in terms of data records. In other words, what we in informally call the "JSOC
catalog" is first and foremost a data record catalog.
-Line 8:
+Line 18:
+There is one database table for each series containing the values of keywords, segment metadata, and links for all
data record in the series. The values for a single data record are contained in a single row in
that table.
-Line 9:
+Line 22:
-While the DRMS record contains the description of each datasegment, the information contained a datasegment is not stored in the database but is stored in Storage Units "owned" by SUMS (Storage Unit Management System).  Storageunits are simply directories.  SUMS itself maintains tables in PostgreSQL to track storageunits locations on disk and/or tape.  A storage unit may contain 1 or more datasegments for 1 or more datarecords.
-Line 13:
+Line 25:
- * A '''Dataseries''' consists of a set of:
+ * A '''Dataseries''' consists of a sequence of:
-Line 19:
+Line 31:
+   * '''Links''' which provide pointers to associated records in other series.
-Line 20:
+Line 33:
-Usually one or more keywords are designated '''prime keys'''.  The prime keys must together uniquely identify a record and are sued to define the main index for the series.  Any records with same sets of prime key values are assumed to be different versions of the same record.  Thus the current version of any record in a given series may be found by specifying the values of the prime keys for that series.  All series have one pre-defined keyword called "recnum" which is has a unique value for each record and is used for the main index in the case that no prime keys are defined.
+=== DataRecords ===

Datarecords contain several types of metadata including keyword values, segment descriptors, record links, and some
processing information.  Each record in a particular series is given a record number (called ''recnum'') which serves
as its ultimate identification in the database.
Usually one or more keywords are designated ''prime keys'' which are the primary way records are identified for the user.  The prime keys are used together to uniquely identify a dataseries record and are used to define the main index for the series.  Any records with same sets of prime key values are treated as different versions of the same record.  Thus the most recent instance of any record in a given series may be found by specifying the values of the prime keys for that series.  The pre-defined keyword "recnum" is used for the main index in the case that no prime keys are defined.  If a record with prime keys has been modified, older versions of the record will still be in the table but will have smaller values of recnum.
-Line 23:
+Line 41:
+=== Keywords ===

A data record contains zero or more (typically many) named keywords that each map to a value
of a simple type such as integer, float, string, or time associated with the record. Keywords are often
used to store meta-data describing properties, history and/or context of the main image/observable
data stored in the record's data segments. This is a concept familiar from standard file-based data
formats, such as FITS, where the FITS header keywords would correspond to the JSOC keywords
and the primary binary arrays or tables would correspond to the JSOC data segments.

In the JSOC catalog keywords values are stored in database tables separate from the files
holding the data segments. This makes it possible to:
 * modify keyword values without having to locate, access and possibly rewrite files on disk or tape,
 * rapidly finding data records whose keywords satisfy a given condition by executing a database query,
 * rapidly extract time series of the values of keywords from all or a subset of records in a series. This can be useful for, e.g., trend analysis or time series analysis of global properties, e.g. mean value or other image statistics, of data products.

There is one database table for each series containing the values of keywords and links for all
data record in the series. The values for a single data record will be contained in a single row in
that table.

=== Prime Keywords ===

For many series a primary index associated with the principal axis (e.g. time or (lattitude, longitude))
associated with each datarecord is desired. The intention is that the primary index
maps to a unique value or slot on the principal axis.  There might exist multiple versions of the "same
observation" (e.g. newer versions could be created to include earlier missing data or to fix a bad calibration).
Since there might be multiple versions of the "same" record, the primary index does not uniquely identify a data record.

The primary index consists of one or more keyword values that are logically concatenated to form the full
index. If two records have keywords values that differ on any of
the keywords comprising the primary index, they are considered different data record (w.r.t. the
primary index), otherwise they are considered only different versions of the same data record (w.r.t.
the primary index). The default behavior of the JSOC is to return the most recent version of a datarecord
for a given primary index. Since record numbers (recnums) are assigned in order of creation the most recent
version is record with the highest recnum. The primary index has two crucial uses in the
JSOC:
 * It allows users to reference data records by their primary index, which will generally have some physical meaning (e.g. for a time series it could be the time or even the number of seconds or hours since some epoch). This will also allow programs and scripts to step through datasets in logical (e.g. time) order, rather than in record creation order as given by the record number which is arbitrary.
 * It allows the JSOC database system to maintain column indexes on the keywords corresponding to the primary index of a series. This vastly speeds up queries that select sets of records based on the primary index (possibly in combination with other criteria), and this is probably majority of all queries in the system.

=== Segments ===

While the DRMS record contains the description of each datasegment, the information contained a datasegment is not stored in the database but is stored in Storage Units "owned" by SUMS (Storage Unit Management System).  Storageunits are simply directories containing files.  SUMS itself maintains tables in PostgreSQL to track storageunit locations on disk and/or tape.  A storage unit may contain data for 1 or more datasegments for 1 or more datarecords.

=== Links ===

A data record contains zero or more named links. Links are pointers between data records and
make it possible for data records to inherit keyword values from each other, and to capture other
dependencies between them such as processing history. For example, a data record can contain links
to the data records that were used in creating it, such as a dopplergram data record pointing to the filtergrams from which is was created. Links come in two varieties, static and dynamic:
 * A static link points to a specific data record in the target series identified by (target series name, record number).
 * A dynamic link is represented by (target series name, primary index value) and points to the latest version among the data records with the specified value of the primary index in the target series. The DRMS resolves/binds the link to the record number of latest version at the time whenever the data record containing the link is opened.
-Line 30:
+Line 99:
+The JSOC Application Programming Interface (API) provides a set of functions, with bindings
to host languages including C, and FORTRAN, and maybe someday IDL and MATLAB, that allow programs to
connect to the JSOC environment and retrieve and manipulate data records. The API contains
groups of functions that

 * create new or modify existing data series.
 * create, read or update data records,
 * query the JSOC catalog database to retrieve data records whose keywords satisfy a given
condition,
 * get and set the contents of keywords, links and data segments,

The API is described in the man pages and elsewhere in this wiki.
-Line 31:
+Line 113:
-Line 34:
+Line 117:
-    * [http://hmi.stanford.edu/development/JSOC_Documents/Drafts/dataset_naming_proposal.pdf] - Phil
    * [http://hmi.stanford.edu/development/JSOC_Documents/Drafts/JSOC_common_library.pdf] - Phil
    * [http://hmi.stanford.edu/development/JSOC_Documents/Drafts/Strategy_multiple_environments.pdf] - Phil
    * [http://hmi.stanford.edu/development/JSOC_Documents/DRMS_V10.pdf] - Rasmus
+  * [http://hmi.stanford.edu/development/JSOC_Documents/Drafts/dataset_naming_proposal.pdf] - Phil
  * [http://hmi.stanford.edu/development/JSOC_Documents/Drafts/JSOC_common_library.pdf] - Phil
  * [http://hmi.stanford.edu/development/JSOC_Documents/Drafts/Strategy_multiple_environments.pdf] - Phil
  * [http://hmi.stanford.edu/development/JSOC_Documents/DRMS_V10.pdf] - Rasmus
-Line 39:
+Line 122:
-    * [http://hmi.stanford.edu/doc/SOC_GDS_Plan/JSOC_GDS_Plan_Overview.pdf] - Phil
    * [http://hmi.stanford.edu/doc/SOC_GDS_Plan/JSOC_Data_Processing_Plan.pdf] - Jim
    * [http://hmi.stanford.edu/doc/SOC_GDS_Plan/HMI_pipeline_JSOC_dataproducts.pdf] - Rasmus
+  * [http://hmi.stanford.edu/doc/SOC_GDS_Plan/JSOC_GDS_Plan_Overview.pdf] - Phil
  * [http://hmi.stanford.edu/doc/SOC_GDS_Plan/JSOC_Data_Processing_Plan.pdf] - Jim
  * [http://hmi.stanford.edu/doc/SOC_GDS_Plan/HMI_pipeline_JSOC_dataproducts.pdf] - Rasmus