show_coverage - Examine data completeness of series of certain types
[DRMS Utilities]

Collaboration diagram for show_coverage - Examine data completeness of series of certain types:
show_coverage {ds=}<seriesname> [-h]
show_coverage {ds=}<seriesname> [-igqostmv] [low=<starttime>] [high=<stoptime>] [block=<blocklength>] [key=<pkey>] [mask=<badbits>] [ignore=<ignorebits>] [<other>=


show_coverage_sock {same options as above}

Show_coverage generates a "record-completeness map" for a given series over an interval of a single prime-key. For each expected record in a range of time (or a range of another quantity, like the Frame Serial Number), show_coverage determines if an actual record exists. To determine which records are "expected", show_coverage assumes that records are generated at a regular cadence over this quantity. For example, if a record is normally generated once every second, and the first record was created at noon, then we can expect records to exist at 12:00:00, 12:00:01, 12:00:02, etc. In general, there is at least one such quantity in most data series that satisfy this assumption. More complex cadence patterns (such as non-regularly-spaced ones) are neither supported nor needed. This quantity is usually a keyword that represents time. Sometimes it is an ID or serial number such that each record has a unique ID/serial number.

In keeping with the assumption that records are generated at a regular cadence over some quantity, the quantity used to define the range of expected records must support the representation of regularly spaced values. There is a catch if the quantity is a "time" keyword, however. DRMS stores time keywords as floating-point numbers, but due to limits of precision when representing floating-point numbers in computing, it is not possible to consistently specify the regularly-spaced values in a range of times. This problem with precision of floating-point numbers has been solved by the adoption of a scheme whereby the floating-point values are "slotted" - each floating-point value is mapped to a single integer value. This integer value is known as the "index value". This concept of slotting is discussed in detail elsewhere.

As a consequence of floating-point imprecision, show_coverage requires that the quanity used to define the range of expected records be a slotted time quantity (if the quantity is a time keyword). Otherwise the quantity must be an integer-type of keyword. This allows the individual values in the range to be expressed as integers internally.

The quantity used to define a range of regularly-spaced values is specified in the key parameter. The first and last values in this range a specified with the low and high parameters, respectively. Providing low and high is optional. If low is not specified, then the quantity's value for the first series' record is used. Similarly, if high is not specified, then the quantity's value for the last series' record is used. The assumed cadence comes from the data series itself. Slotted keywords have a defined cadence built-in (the *_step auxiliary keyword defines this value), so if a slotted keyword is used for the key quantity, the cadence is determined from an auxiliary keyword. If the key quantity is a keyword of integer data type, then the cadence assume is 1. For example, if the key quantify is FSN, and low is 100 and high is 200, then the expected values for FSN are 100, 101, 102, etc.

show_coverage provides some short-cuts for specifying the low and high values for certain types of key quantities. If key is a

Since it needs a way to know if any given record is expected the program only works for series with an integer or slotted prime key. It will fail to be helpful for series that do not expect each index value of a slotted series to be present (such as lev0 HK data for HMI and AIA.)

The operation is to scan the series for all possible records between the low and high limits or the present first and last records if low or high are absent. For each record found, the "quality" of the data will be assessed. If a keyword named "QUALITY" is present its value will be tested and the record will be labeled "MISSING" if QUALITY is negative. If both a QUALITY keyword is present in the data and the "mask" argument is present, then that mask will be "anded" with QUALITY to determine records to be marked as MISSING. If no QUALITY keyword is found but a keyword named "DATAVALS" is present then DATAVALS will be tested and a record with DATAVALS == 0 will be labeled MISSING. If the record is not labeled MISSING it will be labeled "OK". If no record is present for a record slot in the range low to to high then that slot will be labeled unknown, or "UNK" for short.

The record completeness summary is ordered by a single prime-key. If no key parameter is present the first prime-key of integer or slotted type will be used. If the series is structured with multiple prime-keys and the completeness of a subset specified by additional prime-keys is desired, then those prime keys and selected values may be provided as additional keyword=value pairs.

... - optional additional query clauses to restrict the records used in the survey.

JSOC flags:
DRMS common main program
The prime-key specified by pkey or the first integer or slotted primekey will be used to order the completeness survey and will be referred to as the "ordering prime-key". The ordering prime-key may be of type TIME or other floating slotted keywords, or of an integer type. In the discussions here, the word "time" will be used to refer to the ordering keyword even if it is of some other type.

If the -o flag (for online) is present then each record labeled OK will be tested to verify that any storage-unit that has been assigned to that record still exists either online or on tape in SUMS. If there once was a storage unit but no longer is (due to expired retention time of a non-archived series) the record will be labeled "GONE". Note that this can be an expensive test to make on large series since it requires calls to SUMS and is noticably slower than without the -o flag. Please ues the -o flag only on the selected ranges of a series where needed. If the -g flag is set, -o will be forced to be set and any records with storage_units not online will be labeled as UNK.

Records are identified as MISS if QUALITY is < 0 or if the mask parameter is provided, only if one or more bits set in mask are also set in QUALITY. If ignore is provided, the if any bits in ignore are also present in QUALITY the record not used in the comparison with mask. A -m flag will result in the MISS category being added to the UNK category. This is used to simplify finding which records might need to be recomputed.

A summary will be printed if the -s flag is present.

The "other" argument may be used to add where clauses to the query for records. The "other" clauses are not used for determining the first and last slots, but only for the existence information. As such, any records that are excluded by matching the "other" clauses will be counted in the UNK category.

After all specified records have been examined a table of the resulting completeness information is printed. There are two formats for this table. The default format is a list of contiguous segments with the same record label, OK, MISS, UNK, or GONE. For each contiguous segment one line of information will be printed containing the label, the start "time", and the count of records in the contiguous same-label interval. The time printed will be the prime key value of the first record in the interval. If the -t flag is present then the first and last time of each contiguous segment will be printed as well as the count.

If the block parameter is specified, the alternate table format will be used. In the blocked case the records are grouped in intervals of length <blocklength> and a summary line is printed for each block. The first block is aligned with the first record time (either first record in series or low). The summary printed includes the start time of the block, the number of records in the block labeled OK, MISS, UNK, or GONE (if -o is present). If the ordering prime-key is of type TIME then the blocking interval, <blocklength> may have suffixes to specify time intervals such as s=seconds, d=day, h=hour, etc. as recognized by atoinc(3).

If the -i flag is present, the start times in the printed table will be the index number rather than the slot label. E.g. if the prime key used for the completeness survey is T_REC then instead of printing the time T_REC, the value of T_REC_index will be printed.

A header will be printed before the completeness table. If the -q (quiet) flag is present the header will not be printed.

Example 1: To show the coverage in the first MDI Dynamics run:
  show_coverage ds=mdi.fd_V_lev18 low=1996.05.23_22_TAI high=1996.07.24_04:17_TAI
Shows a lot of little gaps in complete dynamics interval.

Example 2: To show the summary of records in a range of data, such as the above MDI dynamics run:

  show_coverage ds=mdi.fd_V_lev18 low=1996.05.23_22_TAI high=1996.07.24_04:17_TAI block=1000d
Here block is set to a large number to gather all the information into a single line. Simple math then shows the data is actually 96.6% complete.

Special Cases:
Some specific code is included to handle HMI and AIA special cases.

Bad FSN: Both HMI and AIA can produce erroneous large values for FSN (Frame Serial Number) due to hardware issues onboard. To help in the case where the user does not specify a "high" limit for analysis, and where the first 7 characters in the seriesname are "hmi.lev" or "aia.lev" then the query for series high value will add the clause that FSN < 0X1C000000. There are some cases with an erroneous FSN lower than this, but no easy test.

FSN and times: Some series, e.g. hmi.cosmic_rays, have both T_OBS and FSN as prime keys but are not slotted on time. For this case as well as the general case of lev0 and lev1 data or other series where a T_OBS keyword exists and is not the selected key for analysis, and the user wishes to limit the analysis based on time instead of the integer key (e.g. FSN) then "high" and "low" may be expressed as a time if and only if the time string contains at least one "_" char. If the seriesname starts "hmi.lev" or "aia.lev" then the appropriate lev0 series will be used to convert times to FSN. Note that this will fail if the low time is after the end of the data or if the high time requested is before the beginning of the mission. NOTE May 2013, changed to hmi.lev1 or aia.lev1 instead of the lev0 since lev0 does not (yet) have shadow tables. Also look to see if either T_REC or T_OBS is a prime key and use the one that is prime. FUTURE - check to see if either is indexed, and use that one.

b Example 3: To show the summary of records in a range of time where the prime key is FSN, such as hmi.lev0a:

  show_coverage hmi.lev0a low=2010.05.01 high=2010.06.01 block=46080 key=FSN

Limitation: Since DRMS is queried using drms_record_getvector the QUALITY keyword must be an integer type to be used. Only the single prime key value, either QUALITY or DATAVALS, and possibly sunum are used so all must be convertible to long long by PostgreSQL.
Efficiency: Although records that are not OK need not be queried in SUMS for online status, they are and that information is ignored. Some waste effort here, but for SDO should be very small.

See also:
show_info - Examine a dataseries structure or contents

Generated on Mon Mar 26 07:00:53 2018 for JSOC_Documentation by  doxygen