The Global Helioseismology Pipeline:
How It Works

The global helioseismology pipeline consists of a sequence of data 
processing that takes images as its input and gives as output inferences 
of physical parameters inside the Sun, along with several intermediate 
data products.  Conceptually, it can be summarized as follows. Input 
images are remapped and decomposed into their spherical harmonic 
components.  These spherical harmonic coefficients are rearranged into 
timeseries, typically of length 72 days.  The fourier transform of these 
timeseries are fit to yield the mode parameters, which can then be 
inverted for either internal rotation or sound speed. In the following 
sections i give a more detailed account of these steps, as well as a 
description of the scripts and modules used to carry them out.


Introduction

The typical input for the pipeline are dopplergrams (velocity images), 
although images of intensity and line core intensity (or even line 
width) can also be used.  Specifically, the pipeline was designed to use 
data from MDI and HMI, but it can also be used with AIA, GONG or MWO 
data.  In principle it can be used to analyze any sequence of images 
that can be imported into DRMS (Data Record Management System, see 
documentation at http://jsoc.stanford.edu/jsocwiki/FrontPage).

The MDI instrument took science data from May 1996 to February 2011.  
Equipped with a 1024^2 CCD, it gave dopplergrams of this resolution for 
a few months each year.  Due to telemetry constraints, for the rest of 
the time we have only vector-weighted dopplergrams, which means they 
have been convolved with a gaussian, subsampled by a factor of five in 
each dimension, and cropped, to yield images of resolution 192^2.  In 
both cases the cadence was typically one minute.

The HMI instrument, with a 4096^2 CCD, started taking science data in 
April 2010.  Full resolution dopplergrams are available at a cadence of 
45 seconds, as well as images of intensity, line depth, and line width.

Regardless of their source, the first step in the global helioseismology 
pipeline is to remap the input images to a regular grid in longitude and 
sin(latitude).  This gives us an opportunity to correct for instrumental 
problems such as distortion which have not already been removed and 
allows us to use a static set of target spherical harmonics to project 
onto.  The maps are then fourier transformed in longitude and 
transposed, and then an inner product is taken with a target set of 
associated legendre functions to yield the spherical harmonic 
coefficients up to some maximum degree.  Although we will only fit up to 
l=300, we generate the spherical harmonics up to l=1000 for MDI full 
disk data and possibly higher for HMI data.

Since the spherical harmonics form an orthogonal basis, this process 
would ideally separate the modes from each other.  However, since we 
cannot observe the entire surface of the Sun, modes which are nearby in 
degree l and order m leak into the target mode.  This effect is 
quantified in the leakage matrix, which is used later when doing the 
fitting.  The leakage matrix is generated by its own branch of the 
pipeline, and will be described in more detail below.

Once each image is decomposed thusly, the spherical harmonic 
coefficients are rearranged (retiled) into longer timeseries, one for 
each l and m.  Typically we use 72 day timeseries, although any length 
is possible.  The next step is to identify bad data points which should 
be rejected.  For MDI and HMI, we start by rejecting any images which 
were taken with the ISS (Image Stabilization System) turned off.  A 
median filter is also applied to reject outliers.  For the purposes of 
detrending, discontinuities in the data must also be located.  For MDI, 
this has been done manually.  For HMI it can be done automatically using 
several keywords from the input images.

The next step in the processing is to actually perform the detrending, 
which is done by polynomial subtraction.  The data are then gapfilled 
using an autoregressive algorithm, which is iterated three times.  At 
this point power spectra can be generated for those who need them.  A 
new window function is also generated to reflect the filled gaps.

For our own fitting, we use the fourier transform of the gapfilled data.  
The peaks in the fourier transform are fit using information from the 
leakage matrix and a maximum likelihood technique, resulting in a peak 
frequency, amplitude, width, and background for each l and radial order 
n that could be fit.  The m-dependence of the frequencies is 
parameterized by the a-coefficients, which are fit for directly in the 
peakbagging, with the other mode parameters assumed to be independent of 
m.

Finally, the frequencies and a-coefficients can be inverted to yield the 
differential rotation or sound speed inside the Sun.

In the following sections I describe this process from a software point 
of view.  If you are running these programs, they can be found in 
$JSOCROOT/proj/globalhs/scripts/ and $JSOCROOT/proj/globalhs/apps/.  They 
can also be accessed from the web at 
http://jsoc.stanford.edu/cvs/JSOC/proj/globalhs/.


The Big Script: doglobalhs

The tcsh script that runs all of these steps in sequence is called 
doglobalhs.  This and all globalhs scripts take arguments in the form of 
name=value with no spaces allowed.  The first thing doglobalhs does is to 
call checkglobalhsargs, passing it the same input parameters doglobalhs 
was given.  This script ensures that there are no spurious spaces in the 
input parameters, that every parameter given is one that is used by 
doglobalhs, and that parameters with only a small set of possible values 
have a value that is known to doglobalhs.  The minimum required arguments are

starttime - the beginning of the time period to be analyzed, typically 
given as an offset from the MDI epoch of 1993.01.01_00:00:00_TAI (e.g. a 
day number suffixed with 'd'), but date strings are also allowed.

totaltime - the total amount of time to be analyzed, given as a string, 
such as '72d' or '48h'.  This string will be parsed by the program 
durcon, the documentation for which can be found below.

label - a string that labels a particular way of running the pipeline.  
It should uniquely map to the set of all input parameters needed by all 
modules in the pipeline.  This is implemented as parameter file 
templates located in $JSOCROOT/proj/globalhs/scripts/parmtemplates/, and 
will be described in more detail in the next section.

Depending on what parameter file templates are present and their content, 
the doglobalhs script will run a series of other scripts: dosht, 
doshtcheck, doretilen, dodscopyn, doretile1, dodscopy1, domkgaps, 
dogapfill, dogfgaps, dopow, dopowslice, dopkbgn, and doinvert.  Each of 
these will be discussed in turn below, and each of them requires a 
parameter file template called {desc}.parms.blank to be present in the 
directory in which it is run.  {desc} is a descriptive string for each 
processing step, and is either 'v2t', 'retn', 'ret1', 'ret', 'tsf', 'pow', 
'slc', 'pkb', or 'inv'.

The doglobalhs script copies a file called {desc}.parms.label in 
$JSOCROOT/proj/globalhs/scripts/parmtemplates/ to the file 
{desc}.parms.blank in the working directory for each of the scripts it 
calls.  If no such file is found for that label that processing step is 
skipped.  For example, if i run the command 

doglobalhs starttime=6400d totaltime=72d label=mdivw72d 

then the first thing the doglobalhs script will do, after setting a 
number of environment and shell variables, will be to copy the file 
$JSOCROOT/proj/globalhs/scripts/parmtemplates/v2t.parms.mdivw72d to the 
file v2t.parms.blank in the working directory for the script dosht.  If 
there is no such file, it moves on to the next step.

The value of the parameter label is also given to the environment 
variable GLOBALHS_LABEL.  This variable is used as the suffix for the 
names of all jobs submitted to the cluster, which allows for multiple 
labels to be run simultaneously.

The doglobalhs script can also take a large number of optional 
parameters, which are usually left at their default values.  These will 
be discussed in decreasing order of how likely it is that one would want 
to specify them:

lmin - minimum spherical harmonic degree to compute, default 0
lmax - maximum spherical harmonic degree to compute, default 300
lchunk - number of l's to chunk together for short timeseries, default 40
firststep - string corresponding to first processing step desired, default 'sht'
laststep - string corresponding to last processing step desired, default 'pkbgn'

These last two arguments are only likely to be specified to restart the 
pipeline in case of a failure or to repeat only a subset of the 
processing, often because data may have aged off disk, or to run different 
processing steps for different ranges in l.  For example, we might want to 
create power spectra up to l=1000, but we only run the peakbagging up to 
l=300.  In this case i might run doglobalhs once with only its required 
arguments, and then a second time with 'lmin=301 lmax=1000 laststep=pow' 
The allowed values for these two arguments are 'sht', 'shtcheck', 
'retilen', 'dscopyn', 'retile1, 'retile1', 'dscopy1', 'mkgaps', 'gapfill', 
'gfgaps', 'pow', 'slice', 'pkbgn', and 'invert'.

The next set of optional parameters give further control of how the l's 
are grouped together.  These will be described in more detail in the 
sections describing the processing steps they apply to.  They are

lchunk1 - passed as lchunk to dosht and lchunkin to doretilen, defaults to lchunk
lchunk2 - passed as lchunkout to doretilen and lchunkin to doretile1, defaults to lchunk
lchunk3 - passed as lchunkout to doreilte1, defaults to 1
lchunk4 - passed as lchunk to dogapfill and dopow, defaults to 10
lchunk5 - passed as lchunk to dopowslice, defaults to lchunk4

Similarly, there are parameters that control grouping in time:
  
timechunk - defaults to '1d', which means one day
timechunk1 - passed as timechunk to dosht, defaults to timechunk
timechunk2 - passed as timechunk to doretilen, defaults to '12d'.

There are three additional parameters that control which portions of 
timeseries are used to make power spectra.  These will be described in 
more detail in the sections describing the scripts dopow and dopowslice:

powtotaltime - passed as totaltimeout to dopow and dopowslice, defaults 
to totaltime, which is always passed as totaltimein to these scripts. 

powstartoffset - passed as startoffset to dopow and dopowslice, given as 
a string, defaults to '0d'.

slicestep - passed as slicestep to dopowslice, given as a string, 
defaults to '3d'.

One may also specify a flag to suppress the writing of fft's for the 
peakbagging (in the case that they have already been written):

pkbwriteflag - passed as writeflag to dopkbgn, defaults to 1.

The doglobalhs script can also take a number of arguments that control 
how jobs are submitted to the computing cluster for each of the 
processing steps.  For each of these arguments specified it sets a 
corresponding enviroment variable that will be read by one of the 
scripts it calls. If these variables are unset, a default is assigned by 
the script that attempts to read them.

For dosht and doretilen, the number of jobs they submit to the cluster 
will be equal to totaltime/timechunk.  For the remaining scripts, the 
number of jobs they submit are given by the following parameters:

retnjobs - sets GLOBALHS_RETNJOBS, read by doretile1
tsfnjobs - sets GLOBALHS_TSFNJOBS, read by dogapfill
pownjobs - sets GLOBALHS_POWNJOBS, read by dopow and dopowslice
pkbnjobs - sets GLOBALHS_PKBNJOBS, read by doiter called by dopkbgn

Each processing step also has a job threshold.  The number of jobs of a 
particular type must fall below this number before more jobs may be 
submitted.  These parameters are

v2tjobsthreshold - sets GLOBALHS_V2TJOBTHRESHOLD, read by dosht
retjobsthreshold - sets GLOBALHS_RETJOBTHRESHOLD, read by doretilen and doretile1
tsfjobsthreshold - sets GLOBALHS_TSFJOBTHRESHOLD, read by dogapfill
powjobsthreshold - sets GLOBALHS_POWJOBTHRESHOLD, read by dopow and dopowslice
pkbjobsthreshold - sets GLOBALHS_PKBJOBTHRESHOLD, read by doiter called by dopkbgn

Finally, the particular queue to which each processing step submits jobs 
is a string and may be specified as follows.  If the environment 
variable is unset, the string 'j.q' is always used.

v2tqueue - sets GLOBALHS_V2TQUEUE, read by dosht
retqueue - sets GLOBALHS_RETQUEUE, read by doretilen and doretile1
tsfqueue - sets GLOBALHS_TSFQUEUE, read by dogapfill
powqueue - sets GLOBALHS_POWQUEUE, read by dopow and dopowslice
pkbqueue - sets GLOBALHS_PKBQUEUE, read by doiter called by dopkbgn

The last parameter one can specify is 

topdir - top level directory in which to make working directories, defaults to '/tmp28'.

That is, the last thing doglobalhs does before the first step is to set 
the variable workdir to 
/$topdir/$USER/globalhs_work/$label/$starttime'_'$totaltime/$lmin-$lmax, 
which shall be used as the parent directory for processing.


Spherical Harmonic Transform: dosht

If doglobalhs finds a parameter file template for the label specified, 
it creates the directory workdir/sht and copies the file as described 
above into it.  It then constructs the input recordset (see JSOC wiki) 
from this file and its input parameters.  It then runs a show_info 
command to make sure that the entire input recordset is online (read 
from tape if necessary) before it proceeds to run the script dosht.

The dosht script starts by setting a variable jobthreshold, explained 
below, to have the value of the environment variable 
GLOBALHS_V2TJOBTHRESHOLD if it is set or 70 otherwise.  If the environment 
variable GLOBALHS_LABEL is set it will be used as the suffix for job names 
via the variable suff.  Otherwise this variable is set to the empty 
string.  This script gives the variable q the value of GLOBALHS_V2TQUEUE 
if it is set, or 'j.q' otherwise.  This is the cluster queue that jobs 
will be submitted to.  If q gets the value 'a.q' in this manner, this 
script aliases qsub, qstat, and waittosubmit to qsub2, qstat2, and 
waittosubmit2 respectively.  This will almost never be used, because a.q 
at present is reserved for compute-intensive jobs, and sht jobs typically 
involve too much I/O for it.  Due to an inadequacy of the queueing system, 
the same commands cannot be used for both queues.  For documentation on 
qsub{2} and qstat{2}, see their man pages.  The documentation for 
waittosubmit{2} can be found below.

The script then checks for its required parameters, starttime and 
totaltime, which have the meanings given in the previous section. The 
value of the variable totaltime is parsed by the program durcon.  It gives 
as its output the number of seconds corresponding to the strings given as 
its arguments.  The variable totalsecs is then set to this number.  
durcon also parses the value of timechunk, which defaults to '1d' if not 
given, and sets the value of chunksecs to the corresponding number of 
seconds.  Hence, chunksecs will be 86400 by default.  The script then 
knows the number of jobs it will submit to the cluster, which will be 
totalsecs/chunksecs, as computed by bc, and assigns this value to the 
variable ntimechunks.  Hence, for the usual totaltime of '72d', 72 jobs 
will be submitted.  If chunksecs does not evenly divide totalsecs, the 
number of jobs submitted will be ntimechunks+1.

The parameters lmin, lmax, and lchunk have the meanings given above, and 
default to 0, 300, and 40 respectively.

The script then checks for the existence of the file v2t.parms.blank, 
and exits with an error message if it is not found.  An example of this 
file for the label mdivw72d is given here:

in=mdi.vw_V[XXXX]
tsout=mdi.vw_V_sht_6h
MAPMMAX=600 
SINBDIVS=200 
VCORLEV=1 
LGSHIFT=1 
DATASIGN=-1 
MAPLGMIN=-89.0 
MAPLGMAX=89.0 
MAPBMAX=89.0 
MAPRMAX=0.95 
APODIZE=1 
APINNER=0.83 
APWIDTH=0.04 
PERR=-0.2 
IERR=-0.1 
DISTORT=2
LMIN=NNNN
LMAX=MMMM
LCHUNK=CCCC
TCHUNK=6h
TTOTAL=TTTT
TSTART=SSSS
VERSION=version0

##lzeroref=mdi.vw_V_lzero_72d
##sourcetype=mdi
##scratch=mdi.globalhs_scratch

This file is a template file for input parameters to the module jv2ts, 
which is the executable that performs the spherical harmonic transform.  
As with any JSOC module, its arguments can be specified like 'jv2ts 
@parameterfile' (see the JSOC wiki).  In the template, any occurence of a 
capital letter repeated four times (like XXXX, MMMM, NNNN, CCCC, TTTT and 
SSSS above) will be replaced by the strings necessary for each run of the 
module.  These strings will be determined by the input parameters to the 
script.  Any line beginning with '#' is considered a comment by a JSOC 
module and will be ignored.  Any line beginning with '##' is understood by 
one of the globalhs scripts to specify a parmeter needed by the script but 
not by any module.

The parsing of the parameter file template by dosht is significant, 
because it removes some of the flexibility that a parameter file would 
otherwise have.  In general, JSOC modules accept 'name=value' in their 
input parameters or alternatively 'name= value'.  However, this latter 
form is not accepted in parameter file templates read by the globalhs 
scripts, even though the globalhs modules would technically accept it.  
In other words, do not use spurious spaces in parameter file templates, 
because the globalhs scripts will replace every space by a newline to 
make subsequent parsing easier.  In particular, the file parms.tmp is 
created from v2t.parms.blank by removing lines which have '#' as their 
first nonspace character, then replacing all occurrence of spaces with 
newlines, and finally removing lines that have nothing but space.  The 
result is that there is a line in parms.tmp for every parameter specified 
in v2t.parms.blank, and each parameter is at the beginning of its line.

The next part of dosht queries DRMS so that starttime may be interpreted 
correctly whether it is given as a date string or a day number.  To best 
facilitate understanding of this, i repeat some of the DRMS documentation 
here.

When a dataseries has a primekey that may take any floating point value, 
it is highly advisable for this primekey to be slotted.  This term refers 
to a mapping between the floating point value and an integer value (the 
slot number or index value).  This is necessary because floating point 
values will in general not compare exactly with their printed 
representations.

An example is a primekey of type time, which is equivalent to double.  
Internally, times are stored as the double precision number of seconds 
that have elapsed since 1977.01.01_00:00:00_TAI.  When the primekey is 
slotted, however, two additional constant keywords in the dataseries are 
needed to convert the floating point value to the index value: the epoch 
(time of the zeroth slot) and the step (slot width).

For slotted primekeys of type time, DRMS provides the option to specify 
the beginning of an interval (or a single value) as an offset from the 
epoch.  This is by far the preferred method when using the global 
pipeline, although date strings are accepted for those who prefer to make 
things difficult for themselves.

Therefore dosht reads the epoch and step keywords (T_REC_epoch and 
T_REC_step) from the input dataseries given in v2t.parms.blank in order to 
compute the index value corresponding to starttime.  If the program durcon 
is able to parse starttime, it is assumed to be an offset from the epoch.  
This is the normal usage.  If durcon cannot parse starttime, it is assumed 
to be a date string.  In either case the input recordset for each time 
chunk will be constructed using index values.

dosht then loops over the time chunks, creating one parameter file and one 
cluster script for each.  The parameter file is made from parms.tmp by 
replacing MMMM with lmax, NNNN with lmin, CCCC with lchunk, and TTTT with 
timechunk.  For each time chunk, XXXX is replaced with the index interval 
of the corresponding input records and SSSS is replaced with the date 
string corresponding to the beginning of the interval.  The cluster 
scripts are named subt.$firstindex.$index.$lmin-$lmax$suff, the essential 
parts of which are

jv2ts @v2t.parms.$index >& v2t.log.$index
echo $status >& jv2ts.exitstatus.$index

where $index is the index value of the first input record for that time 
chunk.  If timechunk does not evenly divide totaltime, one additional 
parameter file and cluster script are written to cover the remainder of 
the time.

dosht then calls waittosubmit.  waittosubmit calls qstat in a loop and 
returns once the total number of jobs of the specified type (in this case 
jobs containing the string 'subt') falls below the number given by 
jobtheshold (see above).  Once waittosubmit returns, all of the cluster 
scripts are submitted.  dosht itself then enters a loop in which it calls 
qstat, waiting for the jobs it just submitted to finish.

Once the jobs have finished, dosht confirms that a log file exists for 
every parameter file.  That is, it makes sure the cluster scripts were 
actually exectuted.  It then checks that all the jv2ts.exitstatus.* 
contain only 0, and that all the log files contain the string "successful 
completion", which all the globalhs modules print if they run 
successfully.  If any failures are detected, they are recorded in the file 
faillog, and the jobs are resubmitted.  After finishing, the check is 
repeated.  If some jobs have still failed, the script aborts with exit 
status 1.  Otherwise it echoes the string "successful completion" itself 
and exits with status 0.


Better Check: doshtcheck

Retiling Several Time Periods: doretilen

Optional Dataset Copy: dodscopyn

Retiling a Single Time Period: doretile1

Optional Dataset Copy: dodscopy1

Determining the Window: domkgaps

Detrending and Gapfilling: dogapfill

Determining the Gapfilled Window: dogfgaps

Generating Power Spectra: dopow

Generating Spectra of Slices of the Timeseries: dopowslice

Peakbagging: dopkbgn

Two Dimensional RLS Inversions: doinvert

Ancillary Programs: durcon and waittosubmit

Scripts Not Called by dogloblhas: dorepeatpow, dorebinsmooth, doreweed


Dataseries and Keywords

Almost all of the dataseries used by the globalhs pipeline have T_START as 
their first primekey, which is of type time (equivalent to double).  
Internally, T_START is stored as the double precision number of seconds 
that have elapsed since 1977.01.01_00:00:00_TAI.  This primekey is slotted 
with a slot type of TS_SLOT, which requires several other constant 
keywords in the dataseries definition: T_START_epoch, T_START_step, and 
T_START_round.  Then the true primekey will be the integer T_START_index 
given by

T_START_index = (T_START - T_START_epoch + T_START_round/2)/T_START_step, 

where T_START_epoch=1993.01.01_00:00:00_TAI (the MDI epoch).  
T_START_round is set to the cadence of the input images, but for the 
purposes of understanding, it can be taken as zero.  T_START_step is one 
day for most dataseries.  Therefore, T_START_index is usually equal to the 
MDI day number.  


Glossary of Acronyms

MDI - Michelson Doppler Imager
HMI - Helioseismic and Magnetic Imager
AIA - Atmospheric Imaging Assembly
GONG - Global Oscillation Network Group
MWO - Mount Wilson Observatory
JSOC - Joint Science Operations Center
DRMS - Data Record Management System
SUMS - Storage Unit Management System
DSDS - Data Storage and Distribution System