The Global Helioseismology Pipeline: How It Works The global helioseismology pipeline consists of a sequence of data processing that takes images as its input and gives as output inferences of physical parameters inside the Sun, along with several intermediate data products. Conceptually, it can be summarized as follows. Input images are remapped and decomposed into their spherical harmonic components. These spherical harmonic coefficients are rearranged into timeseries, typically of length 72 days. The fourier transform of these timeseries are fit to yield the mode parameters, which can then be inverted for either internal rotation or sound speed. In the following sections i give a more detailed account of these steps, as well as a description of the scripts and modules used to carry them out. Introduction The typical input for the pipeline are dopplergrams (velocity images), although images of intensity and line core intensity (or even line width) can also be used. Specifically, the pipeline was designed to use data from MDI and HMI, but it can also be used with AIA, GONG or MWO data. In principle it can be used to analyze any sequence of images that can be imported into DRMS (Data Record Management System, see documentation at http://jsoc.stanford.edu/jsocwiki/FrontPage). The MDI instrument took science data from May 1996 to February 2011. Equipped with a 1024^2 CCD, it gave dopplergrams of this resolution for a few months each year. Due to telemetry constraints, for the rest of the time we have only vector-weighted dopplergrams, which means they have been convolved with a gaussian, subsampled by a factor of five in each dimension, and cropped, to yield images of resolution 192^2. In both cases the cadence was typically one minute. The HMI instrument, with a 4096^2 CCD, started taking science data in April 2010. Full resolution dopplergrams are available at a cadence of 45 seconds, as well as images of intensity, line depth, and line width. Regardless of their source, the first step in the global helioseismology pipeline is to remap the input images to a regular grid in longitude and sin(latitude). This gives us an opportunity to correct for instrumental problems such as distortion which have not already been removed and allows us to use a static set of target spherical harmonics to project onto. The maps are then fourier transformed in longitude and transposed, and then an inner product is taken with a target set of associated legendre functions to yield the spherical harmonic coefficients up to some maximum degree. Although we will only fit up to l=300, we generate the spherical harmonics up to l=1000 for MDI full disk data and possibly higher for HMI data. Since the spherical harmonics form an orthogonal basis, this process would ideally separate the modes from each other. However, since we cannot observe the entire surface of the Sun, modes which are nearby in degree l and order m leak into the target mode. This effect is quantified in the leakage matrix, which is used later when doing the fitting. The leakage matrix is generated by its own branch of the pipeline, and will be described in more detail below. Once each image is decomposed thusly, the spherical harmonic coefficients are rearranged (retiled) into longer timeseries, one for each l and m. Typically we use 72 day timeseries, although any length is possible. The next step is to identify bad data points which should be rejected. For MDI and HMI, we start by rejecting any images which were taken with the ISS (Image Stabilization System) turned off. A median filter is also applied to reject outliers. For the purposes of detrending, discontinuities in the data must also be located. For MDI, this has been done manually. For HMI it can be done automatically using several keywords from the input images. The next step in the processing is to actually perform the detrending, which is done by polynomial subtraction. The data are then gapfilled using an autoregressive algorithm, which is iterated three times. At this point power spectra can be generated for those who need them. A new window function is also generated to reflect the filled gaps. For our own fitting, we use the fourier transform of the gapfilled data. The peaks in the fourier transform are fit using information from the leakage matrix and a maximum likelihood technique, resulting in a peak frequency, amplitude, width, and background for each l and radial order n that could be fit. The m-dependence of the frequencies is parameterized by the a-coefficients, which are fit for directly in the peakbagging, with the other mode parameters assumed to be independent of m. Finally, the frequencies and a-coefficients can be inverted to yield the differential rotation or sound speed inside the Sun. In the following sections I describe this process from a software point of view. If you are running these programs, they can be found in $JSOCROOT/proj/globalhs/scripts/ and $JSOCROOT/proj/globalhs/apps/. They can also be accessed from the web at http://jsoc.stanford.edu/cvs/JSOC/proj/globalhs/. The Big Script: doglobalhs The tcsh script that runs all of these steps in sequence is called doglobalhs. This and all globalhs scripts take arguments in the form of name=value with no spaces allowed. The first thing doglobalhs does is to call checkglobalhsargs, passing it the same input parameters doglobalhs was given. This script ensures that there are no spurious spaces in the input parameters, that every parameter given is one that is used by doglobalhs, and that parameters with only a small set of possible values have a value that is known to doglobalhs. The minimum required arguments are starttime - the beginning of the time period to be analyzed, typically given as an offset from the MDI epoch of 1993.01.01_00:00:00_TAI (e.g. a day number suffixed with 'd'), but date strings are also allowed. totaltime - the total amount of time to be analyzed, given as a string, such as '72d' or '48h'. This string will be parsed by the program durcon, the documentation for which can be found below. label - a string that labels a particular way of running the pipeline. It should uniquely map to the set of all input parameters needed by all modules in the pipeline. This is implemented as parameter file templates located in $JSOCROOT/proj/globalhs/scripts/parmtemplates/, and will be described in more detail in the next section. Depending on what parameter file templates are present and their content, the doglobalhs script will run a series of other scripts: dosht, doshtcheck, doretilen, dodscopyn, doretile1, dodscopy1, domkgaps, dogapfill, dogfgaps, dopow, dopowslice, dopkbgn, and doinvert. Each of these will be discussed in turn below, and each of them requires a parameter file template called {desc}.parms.blank to be present in the directory in which it is run. {desc} is a descriptive string for each processing step, and is either 'v2t', 'retn', 'ret1', 'ret', 'tsf', 'pow', 'slc', 'pkb', or 'inv'. The doglobalhs script copies a file called {desc}.parms.label in $JSOCROOT/proj/globalhs/scripts/parmtemplates/ to the file {desc}.parms.blank in the working directory for each of the scripts it calls. If no such file is found for that label that processing step is skipped. For example, if i run the command doglobalhs starttime=6400d totaltime=72d label=mdivw72d then the first thing the doglobalhs script will do, after setting a number of environment and shell variables, will be to copy the file $JSOCROOT/proj/globalhs/scripts/parmtemplates/v2t.parms.mdivw72d to the file v2t.parms.blank in the working directory for the script dosht. If there is no such file, it moves on to the next step. The value of the parameter label is also given to the environment variable GLOBALHS_LABEL. This variable is used as the suffix for the names of all jobs submitted to the cluster, which allows for multiple labels to be run simultaneously. The doglobalhs script can also take a large number of optional parameters, which are usually left at their default values. These will be discussed in decreasing order of how likely it is that one would want to specify them: lmin - minimum spherical harmonic degree to compute, default 0 lmax - maximum spherical harmonic degree to compute, default 300 lchunk - number of l's to chunk together for short timeseries, default 40 firststep - string corresponding to first processing step desired, default 'sht' laststep - string corresponding to last processing step desired, default 'pkbgn' These last two arguments are only likely to be specified to restart the pipeline in case of a failure or to repeat only a subset of the processing, often because data may have aged off disk, or to run different processing steps for different ranges in l. For example, we might want to create power spectra up to l=1000, but we only run the peakbagging up to l=300. In this case i might run doglobalhs once with only its required arguments, and then a second time with 'lmin=301 lmax=1000 laststep=pow' The allowed values for these two arguments are 'sht', 'shtcheck', 'retilen', 'dscopyn', 'retile1, 'retile1', 'dscopy1', 'mkgaps', 'gapfill', 'gfgaps', 'pow', 'slice', 'pkbgn', and 'invert'. The next set of optional parameters give further control of how the l's are grouped together. These will be described in more detail in the sections describing the processing steps they apply to. They are lchunk1 - passed as lchunk to dosht and lchunkin to doretilen, defaults to lchunk lchunk2 - passed as lchunkout to doretilen and lchunkin to doretile1, defaults to lchunk lchunk3 - passed as lchunkout to doreilte1, defaults to 1 lchunk4 - passed as lchunk to dogapfill and dopow, defaults to 10 lchunk5 - passed as lchunk to dopowslice, defaults to lchunk4 Similarly, there are parameters that control grouping in time: timechunk - defaults to '1d', which means one day timechunk1 - passed as timechunk to dosht, defaults to timechunk timechunk2 - passed as timechunk to doretilen, defaults to '12d'. There are three additional parameters that control which portions of timeseries are used to make power spectra. These will be described in more detail in the sections describing the scripts dopow and dopowslice: powtotaltime - passed as totaltimeout to dopow and dopowslice, defaults to totaltime, which is always passed as totaltimein to these scripts. powstartoffset - passed as startoffset to dopow and dopowslice, given as a string, defaults to '0d'. slicestep - passed as slicestep to dopowslice, given as a string, defaults to '3d'. One may also specify a flag to suppress the writing of fft's for the peakbagging (in the case that they have already been written): pkbwriteflag - passed as writeflag to dopkbgn, defaults to 1. The doglobalhs script can also take a number of arguments that control how jobs are submitted to the computing cluster for each of the processing steps. For each of these arguments specified it sets a corresponding enviroment variable that will be read by one of the scripts it calls. If these variables are unset, a default is assigned by the script that attempts to read them. For dosht and doretilen, the number of jobs they submit to the cluster will be equal to totaltime/timechunk. For the remaining scripts, the number of jobs they submit are given by the following parameters: retnjobs - sets GLOBALHS_RETNJOBS, read by doretile1 tsfnjobs - sets GLOBALHS_TSFNJOBS, read by dogapfill pownjobs - sets GLOBALHS_POWNJOBS, read by dopow and dopowslice pkbnjobs - sets GLOBALHS_PKBNJOBS, read by doiter called by dopkbgn Each processing step also has a job threshold. The number of jobs of a particular type must fall below this number before more jobs may be submitted. These parameters are v2tjobsthreshold - sets GLOBALHS_V2TJOBTHRESHOLD, read by dosht retjobsthreshold - sets GLOBALHS_RETJOBTHRESHOLD, read by doretilen and doretile1 tsfjobsthreshold - sets GLOBALHS_TSFJOBTHRESHOLD, read by dogapfill powjobsthreshold - sets GLOBALHS_POWJOBTHRESHOLD, read by dopow and dopowslice pkbjobsthreshold - sets GLOBALHS_PKBJOBTHRESHOLD, read by doiter called by dopkbgn Finally, the particular queue to which each processing step submits jobs is a string and may be specified as follows. If the environment variable is unset, the string 'j.q' is always used. v2tqueue - sets GLOBALHS_V2TQUEUE, read by dosht retqueue - sets GLOBALHS_RETQUEUE, read by doretilen and doretile1 tsfqueue - sets GLOBALHS_TSFQUEUE, read by dogapfill powqueue - sets GLOBALHS_POWQUEUE, read by dopow and dopowslice pkbqueue - sets GLOBALHS_PKBQUEUE, read by doiter called by dopkbgn The last parameter one can specify is topdir - top level directory in which to make working directories, defaults to '/tmp28'. That is, the last thing doglobalhs does before the first step is to set the variable workdir to /$topdir/$USER/globalhs_work/$label/$starttime'_'$totaltime/$lmin-$lmax, which shall be used as the parent directory for processing. Spherical Harmonic Transform: dosht If doglobalhs finds a parameter file template for the label specified, it creates the directory workdir/sht and copies the file as described above into it. It then constructs the input recordset (see JSOC wiki) from this file and its input parameters. It then runs a show_info command to make sure that the entire input recordset is online (read from tape if necessary) before it proceeds to run the script dosht. The dosht script starts by setting a variable jobthreshold, explained below, to have the value of the environment variable GLOBALHS_V2TJOBTHRESHOLD if it is set or 70 otherwise. If the environment variable GLOBALHS_LABEL is set it will be used as the suffix for job names via the variable suff. Otherwise this variable is set to the empty string. This script gives the variable q the value of GLOBALHS_V2TQUEUE if it is set, or 'j.q' otherwise. This is the cluster queue that jobs will be submitted to. If q gets the value 'a.q' in this manner, this script aliases qsub, qstat, and waittosubmit to qsub2, qstat2, and waittosubmit2 respectively. This will almost never be used, because a.q at present is reserved for compute-intensive jobs, and sht jobs typically involve too much I/O for it. Due to an inadequacy of the queueing system, the same commands cannot be used for both queues. For documentation on qsub{2} and qstat{2}, see their man pages. The documentation for waittosubmit{2} can be found below. The script then checks for its required parameters, starttime and totaltime, which have the meanings given in the previous section. The value of the variable totaltime is parsed by the program durcon. It gives as its output the number of seconds corresponding to the strings given as its arguments. The variable totalsecs is then set to this number. durcon also parses the value of timechunk, which defaults to '1d' if not given, and sets the value of chunksecs to the corresponding number of seconds. Hence, chunksecs will be 86400 by default. The script then knows the number of jobs it will submit to the cluster, which will be totalsecs/chunksecs, as computed by bc, and assigns this value to the variable ntimechunks. Hence, for the usual totaltime of '72d', 72 jobs will be submitted. If chunksecs does not evenly divide totalsecs, the number of jobs submitted will be ntimechunks+1. The parameters lmin, lmax, and lchunk have the meanings given above, and default to 0, 300, and 40 respectively. The script then checks for the existence of the file v2t.parms.blank, and exits with an error message if it is not found. An example of this file for the label mdivw72d is given here: in=mdi.vw_V[XXXX] tsout=mdi.vw_V_sht_6h MAPMMAX=600 SINBDIVS=200 VCORLEV=1 LGSHIFT=1 DATASIGN=-1 MAPLGMIN=-89.0 MAPLGMAX=89.0 MAPBMAX=89.0 MAPRMAX=0.95 APODIZE=1 APINNER=0.83 APWIDTH=0.04 PERR=-0.2 IERR=-0.1 DISTORT=2 LMIN=NNNN LMAX=MMMM LCHUNK=CCCC TCHUNK=6h TTOTAL=TTTT TSTART=SSSS VERSION=version0 ##lzeroref=mdi.vw_V_lzero_72d ##sourcetype=mdi ##scratch=mdi.globalhs_scratch This file is a template file for input parameters to the module jv2ts, which is the executable that performs the spherical harmonic transform. As with any JSOC module, its arguments can be specified like 'jv2ts @parameterfile' (see the JSOC wiki). In the template, any occurence of a capital letter repeated four times (like XXXX, MMMM, NNNN, CCCC, TTTT and SSSS above) will be replaced by the strings necessary for each run of the module. These strings will be determined by the input parameters to the script. Any line beginning with '#' is considered a comment by a JSOC module and will be ignored. Any line beginning with '##' is understood by one of the globalhs scripts to specify a parmeter needed by the script but not by any module. The parsing of the parameter file template by dosht is significant, because it removes some of the flexibility that a parameter file would otherwise have. In general, JSOC modules accept 'name=value' in their input parameters or alternatively 'name= value'. However, this latter form is not accepted in parameter file templates read by the globalhs scripts, even though the globalhs modules would technically accept it. In other words, do not use spurious spaces in parameter file templates, because the globalhs scripts will replace every space by a newline to make subsequent parsing easier. In particular, the file parms.tmp is created from v2t.parms.blank by removing lines which have '#' as their first nonspace character, then replacing all occurrence of spaces with newlines, and finally removing lines that have nothing but space. The result is that there is a line in parms.tmp for every parameter specified in v2t.parms.blank, and each parameter is at the beginning of its line. The next part of dosht queries DRMS so that starttime may be interpreted correctly whether it is given as a date string or a day number. To best facilitate understanding of this, i repeat some of the DRMS documentation here. When a dataseries has a primekey that may take any floating point value, it is highly advisable for this primekey to be slotted. This term refers to a mapping between the floating point value and an integer value (the slot number or index value). This is necessary because floating point values will in general not compare exactly with their printed representations. An example is a primekey of type time, which is equivalent to double. Internally, times are stored as the double precision number of seconds that have elapsed since 1977.01.01_00:00:00_TAI. When the primekey is slotted, however, two additional constant keywords in the dataseries are needed to convert the floating point value to the index value: the epoch (time of the zeroth slot) and the step (slot width). For slotted primekeys of type time, DRMS provides the option to specify the beginning of an interval (or a single value) as an offset from the epoch. This is by far the preferred method when using the global pipeline, although date strings are accepted for those who prefer to make things difficult for themselves. Therefore dosht reads the epoch and step keywords (T_REC_epoch and T_REC_step) from the input dataseries given in v2t.parms.blank in order to compute the index value corresponding to starttime. If the program durcon is able to parse starttime, it is assumed to be an offset from the epoch. This is the normal usage. If durcon cannot parse starttime, it is assumed to be a date string. In either case the input recordset for each time chunk will be constructed using index values. dosht then loops over the time chunks, creating one parameter file and one cluster script for each. The parameter file is made from parms.tmp by replacing MMMM with lmax, NNNN with lmin, CCCC with lchunk, and TTTT with timechunk. For each time chunk, XXXX is replaced with the index interval of the corresponding input records and SSSS is replaced with the date string corresponding to the beginning of the interval. The cluster scripts are named subt.$firstindex.$index.$lmin-$lmax$suff, the essential parts of which are jv2ts @v2t.parms.$index >& v2t.log.$index echo $status >& jv2ts.exitstatus.$index where $index is the index value of the first input record for that time chunk. If timechunk does not evenly divide totaltime, one additional parameter file and cluster script are written to cover the remainder of the time. dosht then calls waittosubmit. waittosubmit calls qstat in a loop and returns once the total number of jobs of the specified type (in this case jobs containing the string 'subt') falls below the number given by jobtheshold (see above). Once waittosubmit returns, all of the cluster scripts are submitted. dosht itself then enters a loop in which it calls qstat, waiting for the jobs it just submitted to finish. Once the jobs have finished, dosht confirms that a log file exists for every parameter file. That is, it makes sure the cluster scripts were actually exectuted. It then checks that all the jv2ts.exitstatus.* contain only 0, and that all the log files contain the string "successful completion", which all the globalhs modules print if they run successfully. If any failures are detected, they are recorded in the file faillog, and the jobs are resubmitted. After finishing, the check is repeated. If some jobs have still failed, the script aborts with exit status 1. Otherwise it echoes the string "successful completion" itself and exits with status 0. Better Check: doshtcheck Retiling Several Time Periods: doretilen Optional Dataset Copy: dodscopyn Retiling a Single Time Period: doretile1 Optional Dataset Copy: dodscopy1 Determining the Window: domkgaps Detrending and Gapfilling: dogapfill Determining the Gapfilled Window: dogfgaps Generating Power Spectra: dopow Generating Spectra of Slices of the Timeseries: dopowslice Peakbagging: dopkbgn Two Dimensional RLS Inversions: doinvert Ancillary Programs: durcon and waittosubmit Scripts Not Called by dogloblhas: dorepeatpow, dorebinsmooth, doreweed Dataseries and Keywords Almost all of the dataseries used by the globalhs pipeline have T_START as their first primekey, which is of type time (equivalent to double). Internally, T_START is stored as the double precision number of seconds that have elapsed since 1977.01.01_00:00:00_TAI. This primekey is slotted with a slot type of TS_SLOT, which requires several other constant keywords in the dataseries definition: T_START_epoch, T_START_step, and T_START_round. Then the true primekey will be the integer T_START_index given by T_START_index = (T_START - T_START_epoch + T_START_round/2)/T_START_step, where T_START_epoch=1993.01.01_00:00:00_TAI (the MDI epoch). T_START_round is set to the cadence of the input images, but for the purposes of understanding, it can be taken as zero. T_START_step is one day for most dataseries. Therefore, T_START_index is usually equal to the MDI day number. Glossary of Acronyms MDI - Michelson Doppler Imager HMI - Helioseismic and Magnetic Imager AIA - Atmospheric Imaging Assembly GONG - Global Oscillation Network Group MWO - Mount Wilson Observatory JSOC - Joint Science Operations Center DRMS - Data Record Management System SUMS - Storage Unit Management System DSDS - Data Storage and Distribution System