The JSOC storage unit

Next: Naming Up: JSOC catalog organization Previous: Data segments Contents

The JSOC storage unit

The atomic unit of data that is managed by the JSOC storage system is called a storage unit. The JSOC storage system is therefore denoted Storage Unit Management System (SUMS). Each storage unit contains the data segment part of datarecords from a single dataseries, and corresponds to the contents of a single directory, [possibly with subdirectories for each datarecord]. A storage unit index (denoted DSIndex for historical reasons) is stored with each data record and identifies the storage unit holding the data segments for the record.

**Figure 1:** Logical structure of a JSOC data series.
$\begin{figure}\centerline{\psfig{figure=jsoc_series.ps,width=6in}}\end{figure}$

A storage unit may be stored online on magnetic disk, offline e.g. on a magnetic tape in a cabinet, or nearline on a tape in a robotic tape library. (The particular storage media is not important to the concept). In response to a user's request to access a particular datarecord the JSOC catalog will identify the storage unit containing that datarecord by looking up its DSIndex. The DSIndex is an index into the SUMS internal catalog which tracks the location of each storage unit. If the requested storage unit is not online the SUMS will allocate storage space, name a directory, and copy the storage unit into that directory. The SUMS will report the working directory pathname to the JSOC catalog where it is accessible to the user. All storage units are owned and managed by the SUMS.

Storage unit are ``write-once'' objects and clients of SUMS can only perform two operations on them: 1) open existing unit as read-only, 2) create new unit. Deletion or modification of storage units will be restricted to SUMS administrative programs and will require special privileges.

The datarecords from a particular dataseries will in general be stored in many storage units. The default size of a unit is specified when a series is created. It should be chosen based on knowledge of the size of the data records and how they are likely to be computed, such that a storage unit corresponds to the output of a ``natural'' processing batch and/or is a convenient size to handle for data export and efficient transfer to tape (the tape archival service can probably further bundle multiple units, if available, together in a single tar command to gain further efficiency).

One of the goals of the JSOC data model is allowing user to make a new version of a data record when the value of one or more keywords change without having to copy the large files making up the data segments. This is accomplished by allowing multiple data record to point to the same storage unit. A small example will illustrate this: Consider a simple data series where records have two keywords ``seriesnum'' (which is the primary index) and ``x'' and a single data segment ``data''. Assume that each storage unit contains a single data record. Let us consider the following sequence of events:

new record: create new record with recordnum=0, seriesnum=1, x=10.0, store ``data'' in new unit
keyword change: create new version of record with recordnum=1, x=10.1
data segment change: create new version of record with recordnum=2, store updated data in new unit

Here is what happens in each step:

The first step creates a new entry in the DRMS database holding the keyword values. SUMS creates a new directory and inserts a new storage unit entry unit into its database. The file containing the data segment is stored in the new directory.
The second step only modifies the value of a keyword and gives rise to a new entry in the DRMS database. The data segment part is unchanged and the DSIndex refers to the data stored in step 1.
In the final the data segment is modified and this gives rise to new entries in both the DRMS and SUMS databases, i.e. a new data record and a new storage unit.

The three records are all different version of the same object since they correspond to the same value for the primary index (seriesnum). The final state of the database tables would look something like this:

figure=unit_multi_version.eps,width=5in

[Question from Jesper Schou: Is the example above the only case where records are allowed to point to the same storage units? Should it be allowed that records with different primary index or even records from different series point to the same storage unit? It sort of wreaks the whole data concept. Are there examples where it would be really useful in a way that cannot be accomplished with links?]

Next: Naming Up: JSOC catalog organization Previous: Data segments Contents

Philip Scherrer 2006-06-17