(file) Return to whattodolev0.txt CVS log (file) (dir) Up to [Development] / JSOC / doc

  1 production 1.7 		/home/production/cvs/JSOC/doc/whattodolev0.txt  25Nov2008
  2 production 1.1 
  3                
  4                	------------------------------------------------------
  5                	Running Datacapture & Pipeline Backend lev0 Processing
  6                	------------------------------------------------------
  7                
  8                
  9                NOTE: For now, this is all done from the xim w/s (Jim's office)
 10                
 11                Datacapture:
 12                --------------------------
 13                
 14                NOTE:IMPORTANT: Please keep in mind that each data capture machine has its
 15                own independent /home/production.
 16                
 17 production 1.7 FORMERLY: 1. The Datacapture system for aia/hmi is by convention dcs0/dcs1 
 18                respectively. If the spare dcs2 is to be put in place, it is renamed dcs0 
 19                or dcs1, and the original machine is renamed dcs2.
 20                
 21                1. The datacapture machine serving for AIA or HMI is determined by
 22                the entries in:
 23                
 24                /home/production/cvs/JSOC/proj/datacapture/scripts/dsctab.txt
 25                
 26                This is edited or listed by the program:
 27                
 28                /home/production/cvs/JSOC/proj/datacapture/scripts> dcstab.pl -h
 29                Display or change the datacapture system assignment file.
 30                Usage: dcstab [-h][-l][-e]
 31                       -h = print this help message
 32                       -l = list the current file contents
 33                       -e = edit with vi the current file contents
 34                
 35                For dcs3 the dcstab.txt would look like:
 36                AIA=dcs3
 37                HMI=dcs3
 38 production 1.7 
 39 production 1.1 
 40 production 1.2 1a. The spare dcs2 normally servers as a backup destination of the postgres
 41 production 1.4 running on dcs0 and dcs1. You should see this postgres cron job on dcs0
 42                and dcs1, respectively:
 43 production 1.2 
 44 production 1.4 0,20,40 * * * * /var/lib/pgsql/rsync_pg_dcs0_to_dcs2.pl
 45 production 1.2 0,20,40 * * * * /var/lib/pgsql/rsync_pg_dcs1_to_dcs2.pl
 46                
 47 production 1.5 For this to work, this must be done on dcs0, dcs1 and dcs2, as user
 48                postgres, after any reboot:
 49 production 1.2 
 50                > ssh-agent | head -2 > /var/lib/pgsql/ssh-agent.env
 51                > chmod 600 /var/lib/pgsql/ssh-agent.env
 52                > source /var/lib/pgsql/ssh-agent.env
 53                > ssh-add
 54                (The password is written on my whiteboard (same as production's))
 55                
 56 production 1.1 2. Login as user production via j0. (password is on Jim's whiteboard).
 57                
 58                3. The Postgres must be running and is started automatically on boot:
 59                
 60                > ps -ef |grep pg
 61                postgres  4631     1  0 Mar11 ?        00:06:21 /usr/bin/postmaster -D /var/lib/pgsql/data
 62                
 63                4. The root of the datacapture tree is /home/production/cvs/JSOC.
 64                The producton runs as user id 388.
 65                
 66                5. The sum_svc is normally running:
 67                
 68                > ps -ef |grep sum_svc
 69                388      26958     1  0 Jun09 pts/0    00:00:54 sum_svc jsocdc
 70                
 71                Note the SUMS database is jsocdc. This is a separate DB on each dcs.
 72                
 73                6. To start/restart the sum_svc and related programs (e.g. tape_svc) do:
 74                
 75                > sum_start_dc
 76                sum_start at 2008.06.16_13:32:23
 77 production 1.1 ** NOTE: "soc_pipe_scp jsocdc" still running
 78                Do you want me to do a sum_stop followed by a sum_start for you (y or n):
 79                
 80                You would normally answer 'y' here.
 81                
 82                7. To run the datacapture gui that will display the data, mark it for archive,
 83                optionally extract lev0 and send it on the the pipeline backend, do this:
 84                
 85                > cd /home/production/cvs/JSOC/proj/datacapture/scripts>
 86                > ./socdc
 87                
 88                All you would normally do is hit "Start Instances for HMI" or AIA for
 89                what datacapture machine you are on.
 90                
 91                8. To optionally extract lev0 do this:
 92                
 93                > touch /usr/local/logs/soc/LEV0FILEON
 94                
 95                To stop lev0:
 96                
 97                > /bin/rm /usr/local/logs/soc/LEV0FILEON
 98 production 1.1 
 99                The last 100 images for each VC are kept in /tmp/jim.
100                
101                NOTE: If you turn lev0 on, you are going to be data sensitive and you
102                may see things like this, in which case you have to restart socdc:
103                
104                ingest_tlm: /home/production/cvs/EGSE/src/libhmicomp.d/decompress.c:1385: decompress_undotransform: Assertion `N>=(6) && N<=(16)' failed.
105                kill: no process ID specified
106                
107                9. The datacapture machines automatically copies DDS input data to the 
108                pipeline backend on /dds/socdc living on d01. This is done by the program:
109                
110                >  ps -ef |grep soc_pipe_scp
111                388      21529 21479  0 Jun09 pts/0    00:00:13 soc_pipe_scp /dds/soc2pipe/hmi /dds/socdc/hmi d01i 30
112                
113                This requires that an ssh-agent be running. If you reboot a dcs machine do:
114                
115                > ssh-agent | head -2 > /tmp/ssh-agent.env
116                > chmod 600 /tmp/ssh-agent.env
117                > source /tmp/ssh-agent.env
118 production 1.7 > ssh-add		(or for sonar: ssh-add /home/production/.ssh/id_rsa)
119 production 1.1 (The password is written on my whiteboard)
120                
121                NOTE: cron jobs use this /tmp/ssh-agent.env file
122                
123                If you want another window to use the ssh-agent that is already running do:
124                > source /tmp/ssh-agent.env
125                
126                NOTE: on any one machine for user production there s/b just one ssh-agent
127                running.
128                
129                
130                If you see that a dcs has asked for a password, the ssh-agent has failed.
131                You can probably find an error msg on d01 like 'invalid user production'.
132                You should exit the socdc. Make sure there is no soc_pipe_scp still running.
133                Restart the socdc.
134                
135                If you find that there is a hostname for production that is not in the 
136                /home/production/.ssh/authorized_keys file then do this on the host that
137                you want to add:
138                
139                Pick up the entry in /home/production/.ssh/id_rsa.pub
140 production 1.1 and put it in this file on the host that you want to have access to
141                (make sure that it's all one line):
142                
143                /home/production/.ssh/authorized_keys
144                
145                NOTE: DO NOT do a ssh-keygen or you will have to update all the host's
146                authorized_keys with the new public key you just generated.
147                
148                If not already active, then do what's shown above for the ssh-agent.
149                
150                
151                10. There should be a cron job running that will archive to the T50 tapes.
152                Note the names are asymmetric for dcs0 and dcs1.
153                
154                30 0-23 * * * /home/production/cvs/jsoc/scripts/tapearc_do
155                
156                00 0-23 * * * /home/production/cvs/jsoc/scripts/tapearc_do_dcs1
157                
158 production 1.6 In the beginning of the world, before any sum_start_dc, the T50 should have 
159                a supply of blank tapes in it's active slots (1-24). A cleaning tape must
160                be in slot 25. The imp/exp slots (26-30) must be vacant.
161                To see the contents of the T50 before startup do:
162                
163                > mtx -f /dev/t50 status
164                
165                Whenever sum_start_dc is called, all the tapes are inventoried and added
166                to the SUMS database if necessary.
167                When a tape is written full by the tapearc_do cron job, the t50view
168                display (see 11. and 12. below) 'Imp/Exp' button will increment its
169                count. Tapes should be exported before the count gets above 5.
170                
171 production 1.1 11. There should be running the t50view program to display/control the
172                tape operations.
173                
174                > t50view -i jsocdc
175                
176                The -i means interactive mode, which will allow you to change tapes.
177                
178                12. Every 2 days, inspect the t50 display for the button on the top row
179                called 'Imp/Exp'. If it is non 0 (and yellow), then some full tapes can be
180                exported from the T50 and new tapes put in for further archiving.
181 production 1.6 
182 production 1.1 Hit the 'Imp/Exp' button. 
183                Follow explicitly all the directions.
184                The blank L4 tapes are in the tape room in the computer room.
185                
186 production 1.6 When the tape drive needs cleaning, hit the "Start Cleaning" button on
187                the t50view gui.
188                
189 production 1.7 13. There should be a cron job running as user production on both dcs0 and 
190                dcs1 that will set the Offsite_Ack field in the sum_main DB table.
191                20 0 * * * /home/production/tape_verify/scripts/set_sum_main_offsite_ack.pl 
192                
193                Where:
194                #/home/production/tape_verify/scripts/set_sum_main_offsite_ack.pl
195                #
196                #This reads the .ver files produced by Tim's
197                #/home/production/tape_verify/scripts/run_remote_tape_verify.pl
198                #A .ver file looks like:
199                ## Offsite verify offhost:dds/off2ds/HMI_2008.06.11_01:12:27.ver
200                ## Tape   0=success 0=dcs0(aia)
201                #000684L4 0         1
202                #000701L4 0         1
203                ##END
204                #For each tape that has been verified successfully, this program
205                #sets the Offsite_Ack to 'Y' in the sum_main for all entries
206                #with Arch_Tape = the given tape id.
207                #
208                #The machine names where AIA and HMI processing live
209                #is found in dcstab.txt which must be on either dcs0 or dcs1
210 production 1.7 
211                14. Other background info is in:
212 production 1.1 
213                http://hmi.stanford.edu/development/JSOC_Documents/Data_Capture_Documents/DataCapture.html
214                
215 production 1.6 ***************************dsc3*********************************************
216                NOTE: dcs3 (i.e. offsite datacapture machine shipped to Goddard Nov 2008)
217                
218                At Goddard the dcs3 host name will be changed. See the following for
219                how to accomodate this:
220                
221                /home/production/cvs/JSOC/doc/dcs3_name_change.txt
222                
223                This cron job must be run to clean out the /dds/soc2pipe/[aia,hmi]:
224                0,5,10,15,20,25,30,35,40,45,50,55 * * * *
225                /home/production/cvs/JSOC/proj/datacapture/scripts/rm_soc2pipe.pl
226                
227                Also on dcs3 the offsite_ack check and safe tape check is not done in:
228                /home/production/cvs/JSOC/base/sums/libs/pg/SUMLIB_RmDo.pgc
229 production 1.1 
230 production 1.6 Also on dcs3, because there is no pipeline backend, there is not .arc file 
231                ever made for the DDS.
232                ***************************dsc3*********************************************
233 production 1.1 
234                Level 0 Backend:
235                --------------------------
236                
237                1. As mentioned above, the datacapture machines automatically copies DDS input 
238                data to the pipeline backend on /dds/socdc living on d01. 
239                
240                2. The lev0 code runs as ingest_lev0 on the cluster machine cl1n001,
241                which has d01:/dds mounted. cl1n001 can be accessed through j1.
242                
243                3. All 4 instances of ingest_lev0 for the 4 VCs are controlled by
244                /home/production/cvs/JSOC/proj/lev0/apps/doingestlev0.pl
245                
246                If you want to start afresh, kill any ingest_lev0 running (will later be
247                automated). Then do:
248                
249                > cd /home/production/cvs/JSOC/proj/lev0/apps
250                > start_lev0.pl
251                
252                You will see 4 instances started and the log file names can be seen.
253                You will be advised that to cleanly stop the lev0 processing, run:
254 production 1.1 
255                > stop_lev0.pl
256                
257 production 1.2 It may take awhile for all the ingest_lev0 processes to get to a point
258                where they can stop cleanly.
259                
260                For now, every hour, the ingest_lev0 processes are automatically restarted.
261 production 1.1 
262                
263                4. The output is for the series:
264                
265                hmi.tlmd
266                hmi.lev0d
267                aia.tlmd
268                aia.lev0d
269                
270                #It is all save in DRMS and  archived.
271                Only the tlmd is archived. (see below if you want to change the
272                archiving status of a dataseries)
273                
274                5. If something in the backend goes down such that you can't run 
275                ingest_lev0, then you may want to start this cron job that will
276                periodically clean out the /dds/socdc dir of the files that are
277                coming in from the datacapture systems.
278                
279                > crontab -l
280                # DO NOT EDIT THIS FILE - edit the master and reinstall.
281                # (/tmp/crontab.XXXXVnxDO9 installed on Mon Jun 16 16:38:46 2008)
282 production 1.7 # (Cron version V5.0 -- $Id: whattodolev0.txt,v 1.6 2008/11/07 16:54:37 production Exp $)
283 production 1.1 #0,20,40 * * * * /home/jim/cvs/jsoc/scripts/pipefe_rm
284                
285                ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
286                
287                Starting and stoping SUMS on d02:
288                
289                Login as production on d02
290                sum_start_d02
291                
292                (if sums is already running it will ask you if you want to halt it.
293                you normally say 'y'.)
294                
295                sum_stop_d02
296                if you just want to stop sums.
297                
298                ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
299                
300                SUMS archiving:
301                
302                Currently SUM is archiving continuously. The script is:
303                
304 production 1.1 /home/production/cvs/JSOC/base/sums/scripts/tape_do.pl
305                
306                To halt it do:
307                
308                touch /usr/local/logs/tapearc/TAPEARC_ABORT
309                
310                Try to keep it running, as there is still much to be archived.
311                
312                ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
313                
314                Change archiving status of a dataseries:
315                
316                > psql -h hmidb jsoc
317                
318                jsoc=> update hmi.drms_series set archive=0 where seriesname='hmi.lev0c';
319                UPDATE 1
320                jsoc=> \q
321                
322 production 1.7 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
323                
324                The modified dcs reboot procedure is in ~kehcheng/dcs.reboot.notes.

Karen Tian
Powered by
ViewCVS 0.9.4