Overview of JSOC Data Series, DRMS, and SUMS
In the following we describe the logical organization of the HMI/AIA JSOC Data Record Management System (DRMS), also referred to as the JSOC catalog below, and define a number of terms used to describe data in the JSOC at various levels of abstraction. This section is based on the DRMS description by Rasmus Larsen linked at the bottom of the page.
The JSOC Data Series
Data are stored in the JSOC in "Data Series." A data series (sometimes dataseries) is a basic sequence of similar data objects, typically "images" or other binary data along with associated metadata. A data series consists of a sequence of Data Records. Usually, each data record is the data for one step in "time". Most but certainly not all data series are sequences in time. They can be in principle any list of data objects. A good way to think about a data series is as a table of rows and columns where each row is a record. The columns contain metadata and descriptors of access to the binary data objects.
A data record is the basic "atomic unit" of a data series, or more precisely, the smallest unit that will be individually registered and available for export from a data series in the JSOC catalog. Most (if not all) access to the JSOC archive by both pipeline processing modules and external data export services will be in terms of data records. In other words, what we in informally call the "JSOC catalog" is first and foremost a data record catalog.
A data record consists of Keyword-tagged metadata describing the record and 0 or more named data segments usually containing binary arrays of data values. All data records in a given data series have the same set of keyword and data segment names and associated record-specific values. The data series description and the data records are maintained in a relational data base called DRMS (Data Record Management System). DRMS is implemented as a set of PostgreSQL tables. There is one data base table for each series containing the values of keywords, segment metadata, and links for all data record in the series. The values for a single data record are contained in a single row in that table.
In summary:
A Data Series consists of a sequence of:
Records which consist of a set of:
Keywords and
Segments which consist of:
- structure information and
storage unit identifier
Links that provide pointers to associated records in other series.
Data Records
Data records contain several types of metadata including keyword values, segment descriptors, record links, and some processing information. Each record in a particular series is given a record number (called recnum) which serves as its ultimate identification in the database. Usually one or more keywords are designated prime keys, which are the primary way records are identified for the user. The prime keys are used together to uniquely identify a dataseries record and are used to define the main index for the series. Any records with the same set of prime key values are treated as different versions of the same record. Thus the most recent instance of any record in a given series may be found by specifying the values of the prime keys for that series. The pre-defined keyword "recnum" is used for the main index in the case that no prime keys are defined. If a record with prime keys has been modified, older versions of the record will still be in the table but will have smaller values of recnum.
In order to access a set of records from a series a description must be provided to select the desired records. We call that description a "Dataset Name". Thus, in JSOC/DRMS a dataset name is actually a database query. The DRMS dataset name rules have been defined to provide user friendly (well it is the goal) names that are easy to remember and use.
Keywords
A data record contains zero or more (typically many) named keywords that each map to a value of a simple type such as integer, float, string, or time associated with the record. Keywords are often used to store meta-data describing properties, history and/or context of the main image/observable data stored in the record's data segments. This is a concept familiar from standard file-based data formats, such as FITS, where the FITS header keywords would correspond to the JSOC keywords and the primary binary arrays or tables would correspond to the JSOC data segments.
In the JSOC catalog keywords values are stored in database tables separate from the files holding the data segments. This makes it possible to:
- modify keyword values without having to locate, access and possibly rewrite files on disk or tape,
- rapidly find data records whose keywords satisfy a given condition by executing a database query,
- rapidly extract time series of the values of keywords from all or a subset of records in a series. This can be useful for, say, trend analysis or time series analysis of global properties, e.g. the mean value or other image statistics of data products.
There is one database table for each series containing the values of keywords and links for all data records in the series. The values for a single data record will be contained in a single row in that table.
Prime Keywords and the Primary Index
For many data series it is useful to identify a primary index associated with the principal axis of the data records (e.g. time or (latitude, longitude)). The intention is that the primary index maps to a unique value or slot on the principal axis. There might exist multiple versions of the "same observation" (e.g. newer versions could be created to include earlier missing data or to fix a bad calibration). Since there might be multiple versions of the "same" record, the primary index does not uniquely identify a data record.
The primary index consists of one or more prime keyword values that are logically concatenated to form the full index. If two records have keywords values that differ on any of the keywords comprising the primary index, they are considered different data records (w.r.t. the primary index), otherwise they are considered to be only different versions of the same data record (w.r.t. the primary index). The default behavior of the JSOC is to return the most recent version of a data record for a given primary index. Since record numbers (recnums) are assigned in order of creation the most recent version is the record with the highest recnum. The primary index has two crucial uses in the JSOC:
- It allows users to reference data records by their primary index, which will generally have some physical meaning (e.g. for a time series it could be the time or even the number of seconds or hours since some epoch). This will also allow programs and scripts to step through data sets in logical (e.g. time) order, rather than in record creation order as given by the record number which is arbitrary.
- It allows the JSOC database system to maintain column indexes on the keywords corresponding to the primary index of a series. This vastly speeds up queries that select sets of records based on the primary index (possibly in combination with other criteria), and this is probably majority of all queries in the system.
Segments
A data record contains zero or more named data segments. The data segments contain data of large volume associated with the data record. While the DRMS record contains the description of each data segment, the information contained in a data segment is not stored in the data base but is stored in Storage Units "owned" by SUMS (Storage Unit Management System). Storage units are simply directories containing files. SUMS itself maintains tables in PostgreSQL to track storage unit locations on disk and/or tape. A storage unit may contain data for one or more data segments for one or more data records.
The segment metadata includes information such as the storage protocol, compression information, image dimensions, etc. The contents of the data segments for a given data record are stored in files in a directory on disk (possibly also archived on tape). To make transfer and storage of JSOC data more manageable and efficient, data segments for multiple data records in a given series may be grouped together and managed as a single storage unit. This typically gives rise to a directory structure like
/SUM1/D012342/S00000/fd_V.fits image.png small_fd_V.png S00001/fd_V.fits image.png small_fd_V.png S00002/fd_V.fits image.png small_fd_V.png
where in this example the records are assumed to contain three segments named fd_V, image, and small_fd_V stored in fits, .png and .png files respectively. In general the file name for each data segment is of the form:
<disk>/D<storageunit>/S<slotnumber>/<segmentname>.<protocol>
Note that the default naming system does not contain the record identification. This can be determined via the storage unit number if needed. When data are exported outside the JSOC, file names containing the series name and prime keyword values are usually created.
Links
A data record contains zero or more named links. Links are pointers between data records and make it possible for data records to inherit keyword values from each other, and to capture other dependencies between them such as processing history. For example, a data record can contain links to the data records that were used in creating it, such as a dopplergram data record pointing to the filtergrams from which is was created. Links come in two varieties, static and dynamic:
- A static link points to a specific data record in the target series identified by (target series name, record number).
- A dynamic link is represented by (target series name, primary index value) and points to the latest version among the data records with the specified value of the primary index in the target series. The DRMS resolves/binds the link to the record number of latest version each time the data record containing the link is opened.
DRMS
The Data Record Management System (DRMS) consists of a set of data base tables and software to manage those tables, allowing the user to create and use data records. The implementation is described in detail in XXXX. The user's view is usually from the web (e.g. lookdata.html, from shell level commands (e.g. show_info), from compiled programs using the DRMS API described at JSOC API man pages, or from user-built support systems, such as IDL.
SUMS
SUMS details
See: SUMS - the Storage Unit Management System for more information about SUMS implementation and use in the JSOC system.
Storage Units
The atomic unit of data managed by the JSOC storage system is called a storage unit. The JSOC storage system is therefore denoted Storage Unit Management System (SUMS). Each storage unit contains the data segment part of one or more data records from a single data series. Each storage unut corresponds to the contents of a single directory [possibly with subdirectories for each data record]. A storage unit index (denoted sunum, or internally as DSIndex for historical reasons) is stored with each data record and identifies the storage unit holding the data segments for the record. A storage unit may be stored online on magnetic disk, nearline on a tape in a robotic tape library, or offline, e.g. on a magnetic tape in a cabinet. The particular storage media is not important to the concept. In response to a user's request to access a particular data record the JSOC catalog will identify the storage unit containing that data record by looking up its sunum. The sunum is an index into the SUMS internal catalog which tracks the location of each storage unit. If the requested storage unit is not online the SUMS will allocate storage space, name a directory, and copy the storage unit into that directory. The SUMS will report the working directory pathname to the JSOC catalog where it is accessible to the user. All storage units are owned and managed by the SUMS.
Storage unit are "write-once" objects and clients of SUMS can only perform two operations on them:
- open an existing unit as read-only
- create new unit.
Deletion or modification of storage units is restricted to SUMS administrative programs and requires special privileges. The data segments for a particular data series will in general be stored in many storage units. The default size (number of records) of a unit is specified when a series is created. It is chosen based on knowledge of the size of the data records and how they are likely to be computed, such that a storage unit corresponds to the output of a "natural" processing batch and/or is a convenient size to handle for data export.
Older Documents
There are several older documents that while not accurate in describing the JSOC system as it is now implemented, do contain useful information about the design and intent and usage ideas. These are:
- Design discussions:
http://hmi.stanford.edu/development/JSOC_Documents/Drafts/old/dataset_naming_proposal.pdf - Phil
http://hmi.stanford.edu/development/JSOC_Documents/Drafts/JSOC_common_library.pdf - Phil
http://hmi.stanford.edu/development/JSOC_Documents/Drafts/Strategy_multiple_environments.pdf - Phil
http://hmi.stanford.edu/development/JSOC_Documents/DRMS_V10.pdf - Rasmus
- NASA requested overview, CDRL 326a,b,c