(file) Return to whattodolev0.txt CVS log (file) (dir) Up to [Development] / JSOC / doc

  1 production 1.7 		/home/production/cvs/JSOC/doc/whattodolev0.txt  25Nov2008
  2 production 1.1 
  3 production 1.9 ------------------------------------------------
  4                WARNING!! Some of this is outdated. 3Jun2010
  5                Please see more recent what*.txt files, e.g.
  6                whattodo_start_stop_lev1_0_sums.txt
  7                ------------------------------------------------
  8 production 1.1 
  9                	------------------------------------------------------
 10                	Running Datacapture & Pipeline Backend lev0 Processing
 11                	------------------------------------------------------
 12                
 13                
 14                NOTE: For now, this is all done from the xim w/s (Jim's office)
 15                
 16                Datacapture:
 17                --------------------------
 18                
 19                NOTE:IMPORTANT: Please keep in mind that each data capture machine has its
 20                own independent /home/production.
 21                
 22 production 1.7 FORMERLY: 1. The Datacapture system for aia/hmi is by convention dcs0/dcs1 
 23                respectively. If the spare dcs2 is to be put in place, it is renamed dcs0 
 24                or dcs1, and the original machine is renamed dcs2.
 25                
 26                1. The datacapture machine serving for AIA or HMI is determined by
 27                the entries in:
 28                
 29                /home/production/cvs/JSOC/proj/datacapture/scripts/dsctab.txt
 30                
 31                This is edited or listed by the program:
 32                
 33                /home/production/cvs/JSOC/proj/datacapture/scripts> dcstab.pl -h
 34                Display or change the datacapture system assignment file.
 35                Usage: dcstab [-h][-l][-e]
 36                       -h = print this help message
 37                       -l = list the current file contents
 38                       -e = edit with vi the current file contents
 39                
 40                For dcs3 the dcstab.txt would look like:
 41                AIA=dcs3
 42                HMI=dcs3
 43 production 1.7 
 44 production 1.1 
 45 production 1.2 1a. The spare dcs2 normally servers as a backup destination of the postgres
 46 production 1.4 running on dcs0 and dcs1. You should see this postgres cron job on dcs0
 47                and dcs1, respectively:
 48 production 1.2 
 49 production 1.4 0,20,40 * * * * /var/lib/pgsql/rsync_pg_dcs0_to_dcs2.pl
 50 production 1.2 0,20,40 * * * * /var/lib/pgsql/rsync_pg_dcs1_to_dcs2.pl
 51                
 52 production 1.5 For this to work, this must be done on dcs0, dcs1 and dcs2, as user
 53                postgres, after any reboot:
 54 production 1.2 
 55                > ssh-agent | head -2 > /var/lib/pgsql/ssh-agent.env
 56                > chmod 600 /var/lib/pgsql/ssh-agent.env
 57                > source /var/lib/pgsql/ssh-agent.env
 58                > ssh-add
 59 production 1.10 (The password is same as production's)
 60 production 1.2  
 61 production 1.1  2. Login as user production via j0. (password is on Jim's whiteboard).
 62                 
 63                 3. The Postgres must be running and is started automatically on boot:
 64                 
 65 production 1.10 #######OLD#########################
 66                 #> ps -ef |grep pg
 67                 #postgres  4631     1  0 Mar11 ?        00:06:21 /usr/bin/postmaster -D /var/lib/pgsql/data
 68                 ###################################
 69                 
 70                 dcs0:/home/production> px postgres
 71                 postgres  6545     1  0 May04 ?        00:09:50 /usr/local/pgsql-8.4/bin/postgres -D /var/lib/pgsql/dcs0_data
 72 production 1.1  
 73                 4. The root of the datacapture tree is /home/production/cvs/JSOC.
 74                 The producton runs as user id 388.
 75                 
 76                 5. The sum_svc is normally running:
 77                 
 78                 > ps -ef |grep sum_svc
 79                 388      26958     1  0 Jun09 pts/0    00:00:54 sum_svc jsocdc
 80                 
 81                 Note the SUMS database is jsocdc. This is a separate DB on each dcs.
 82                 
 83                 6. To start/restart the sum_svc and related programs (e.g. tape_svc) do:
 84                 
 85                 > sum_start_dc
 86                 sum_start at 2008.06.16_13:32:23
 87                 ** NOTE: "soc_pipe_scp jsocdc" still running
 88                 Do you want me to do a sum_stop followed by a sum_start for you (y or n):
 89                 
 90                 You would normally answer 'y' here.
 91                 
 92                 7. To run the datacapture gui that will display the data, mark it for archive,
 93 production 1.1  optionally extract lev0 and send it on the the pipeline backend, do this:
 94                 
 95                 > cd /home/production/cvs/JSOC/proj/datacapture/scripts>
 96                 > ./socdc
 97                 
 98                 All you would normally do is hit "Start Instances for HMI" or AIA for
 99                 what datacapture machine you are on.
100                 
101                 8. To optionally extract lev0 do this:
102                 
103                 > touch /usr/local/logs/soc/LEV0FILEON
104                 
105                 To stop lev0:
106                 
107                 > /bin/rm /usr/local/logs/soc/LEV0FILEON
108                 
109                 The last 100 images for each VC are kept in /tmp/jim.
110                 
111                 NOTE: If you turn lev0 on, you are going to be data sensitive and you
112                 may see things like this, in which case you have to restart socdc:
113                 
114 production 1.1  ingest_tlm: /home/production/cvs/EGSE/src/libhmicomp.d/decompress.c:1385: decompress_undotransform: Assertion `N>=(6) && N<=(16)' failed.
115                 kill: no process ID specified
116                 
117                 9. The datacapture machines automatically copies DDS input data to the 
118                 pipeline backend on /dds/socdc living on d01. This is done by the program:
119                 
120                 >  ps -ef |grep soc_pipe_scp
121                 388      21529 21479  0 Jun09 pts/0    00:00:13 soc_pipe_scp /dds/soc2pipe/hmi /dds/socdc/hmi d01i 30
122                 
123                 This requires that an ssh-agent be running. If you reboot a dcs machine do:
124                 
125 production 1.8  > ssh-agent | head -2 > /var/tmp/ssh-agent.env
126                 > chmod 600 /var/tmp/ssh-agent.env
127                 > source /var/tmp/ssh-agent.env
128                 > ssh-add	(or for sonar: ssh-add /home/production/.ssh/id_rsa)
129 production 1.1  (The password is written on my whiteboard)
130                 
131 production 1.10 NOTE: on some machines you may have to put the user name in
132                 /etc/ssh/allowed_users
133                 
134 production 1.8  NOTE: cron jobs use this /var/tmp/ssh-agent.env file
135 production 1.1  
136                 If you want another window to use the ssh-agent that is already running do:
137 production 1.8  > source /var/tmp/ssh-agent.env
138 production 1.1  
139                 NOTE: on any one machine for user production there s/b just one ssh-agent
140                 running.
141                 
142                 
143                 If you see that a dcs has asked for a password, the ssh-agent has failed.
144                 You can probably find an error msg on d01 like 'invalid user production'.
145                 You should exit the socdc. Make sure there is no soc_pipe_scp still running.
146                 Restart the socdc.
147                 
148                 If you find that there is a hostname for production that is not in the 
149                 /home/production/.ssh/authorized_keys file then do this on the host that
150                 you want to add:
151                 
152                 Pick up the entry in /home/production/.ssh/id_rsa.pub
153                 and put it in this file on the host that you want to have access to
154                 (make sure that it's all one line):
155                 
156                 /home/production/.ssh/authorized_keys
157                 
158                 NOTE: DO NOT do a ssh-keygen or you will have to update all the host's
159 production 1.1  authorized_keys with the new public key you just generated.
160                 
161                 If not already active, then do what's shown above for the ssh-agent.
162                 
163                 
164                 10. There should be a cron job running that will archive to the T50 tapes.
165                 Note the names are asymmetric for dcs0 and dcs1.
166                 
167                 30 0-23 * * * /home/production/cvs/jsoc/scripts/tapearc_do
168                 
169                 00 0-23 * * * /home/production/cvs/jsoc/scripts/tapearc_do_dcs1
170                 
171 production 1.6  In the beginning of the world, before any sum_start_dc, the T50 should have 
172                 a supply of blank tapes in it's active slots (1-24). A cleaning tape must
173                 be in slot 25. The imp/exp slots (26-30) must be vacant.
174                 To see the contents of the T50 before startup do:
175                 
176                 > mtx -f /dev/t50 status
177                 
178                 Whenever sum_start_dc is called, all the tapes are inventoried and added
179                 to the SUMS database if necessary.
180                 When a tape is written full by the tapearc_do cron job, the t50view
181                 display (see 11. and 12. below) 'Imp/Exp' button will increment its
182                 count. Tapes should be exported before the count gets above 5.
183                 
184 production 1.1  11. There should be running the t50view program to display/control the
185                 tape operations.
186                 
187                 > t50view -i jsocdc
188                 
189                 The -i means interactive mode, which will allow you to change tapes.
190                 
191                 12. Every 2 days, inspect the t50 display for the button on the top row
192                 called 'Imp/Exp'. If it is non 0 (and yellow), then some full tapes can be
193                 exported from the T50 and new tapes put in for further archiving.
194 production 1.6  
195 production 1.1  Hit the 'Imp/Exp' button. 
196                 Follow explicitly all the directions.
197                 The blank L4 tapes are in the tape room in the computer room.
198                 
199 production 1.6  When the tape drive needs cleaning, hit the "Start Cleaning" button on
200                 the t50view gui.
201                 
202 production 1.7  13. There should be a cron job running as user production on both dcs0 and 
203                 dcs1 that will set the Offsite_Ack field in the sum_main DB table.
204                 20 0 * * * /home/production/tape_verify/scripts/set_sum_main_offsite_ack.pl 
205                 
206                 Where:
207                 #/home/production/tape_verify/scripts/set_sum_main_offsite_ack.pl
208                 #
209                 #This reads the .ver files produced by Tim's
210                 #/home/production/tape_verify/scripts/run_remote_tape_verify.pl
211                 #A .ver file looks like:
212                 ## Offsite verify offhost:dds/off2ds/HMI_2008.06.11_01:12:27.ver
213                 ## Tape   0=success 0=dcs0(aia)
214                 #000684L4 0         1
215                 #000701L4 0         1
216                 ##END
217                 #For each tape that has been verified successfully, this program
218                 #sets the Offsite_Ack to 'Y' in the sum_main for all entries
219                 #with Arch_Tape = the given tape id.
220                 #
221                 #The machine names where AIA and HMI processing live
222                 #is found in dcstab.txt which must be on either dcs0 or dcs1
223 production 1.7  
224                 14. Other background info is in:
225 production 1.1  
226                 http://hmi.stanford.edu/development/JSOC_Documents/Data_Capture_Documents/DataCapture.html
227                 
228 production 1.6  ***************************dsc3*********************************************
229                 NOTE: dcs3 (i.e. offsite datacapture machine shipped to Goddard Nov 2008)
230                 
231                 At Goddard the dcs3 host name will be changed. See the following for
232                 how to accomodate this:
233                 
234                 /home/production/cvs/JSOC/doc/dcs3_name_change.txt
235                 
236                 This cron job must be run to clean out the /dds/soc2pipe/[aia,hmi]:
237                 0,5,10,15,20,25,30,35,40,45,50,55 * * * *
238                 /home/production/cvs/JSOC/proj/datacapture/scripts/rm_soc2pipe.pl
239                 
240                 Also on dcs3 the offsite_ack check and safe tape check is not done in:
241                 /home/production/cvs/JSOC/base/sums/libs/pg/SUMLIB_RmDo.pgc
242 production 1.1  
243 production 1.6  Also on dcs3, because there is no pipeline backend, there is not .arc file 
244                 ever made for the DDS.
245                 ***************************dsc3*********************************************
246 production 1.1  
247                 Level 0 Backend:
248                 --------------------------
249                 
250 production 1.9  !!Make sure run Phil's script for watchlev0 in the background on cl1n001:
251                 /home/production/cvs/JSOC/base/sums/scripts/get_dcs_times.csh
252                 
253 production 1.1  1. As mentioned above, the datacapture machines automatically copies DDS input 
254                 data to the pipeline backend on /dds/socdc living on d01. 
255                 
256                 2. The lev0 code runs as ingest_lev0 on the cluster machine cl1n001,
257                 which has d01:/dds mounted. cl1n001 can be accessed through j1.
258                 
259                 3. All 4 instances of ingest_lev0 for the 4 VCs are controlled by
260                 /home/production/cvs/JSOC/proj/lev0/apps/doingestlev0.pl
261                 
262                 If you want to start afresh, kill any ingest_lev0 running (will later be
263                 automated). Then do:
264                 
265                 > cd /home/production/cvs/JSOC/proj/lev0/apps
266 production 1.9  > doingestlev0.pl     (actually a link to start_lev0.pl)
267 production 1.1  
268                 You will see 4 instances started and the log file names can be seen.
269                 You will be advised that to cleanly stop the lev0 processing, run:
270                 
271                 > stop_lev0.pl
272                 
273 production 1.2  It may take awhile for all the ingest_lev0 processes to get to a point
274                 where they can stop cleanly.
275                 
276                 For now, every hour, the ingest_lev0 processes are automatically restarted.
277 production 1.1  
278                 
279                 4. The output is for the series:
280                 
281                 hmi.tlmd
282                 hmi.lev0d
283                 aia.tlmd
284                 aia.lev0d
285                 
286                 #It is all save in DRMS and  archived.
287                 Only the tlmd is archived. (see below if you want to change the
288                 archiving status of a dataseries)
289                 
290                 5. If something in the backend goes down such that you can't run 
291                 ingest_lev0, then you may want to start this cron job that will
292                 periodically clean out the /dds/socdc dir of the files that are
293                 coming in from the datacapture systems.
294                 
295                 > crontab -l
296                 # DO NOT EDIT THIS FILE - edit the master and reinstall.
297                 # (/tmp/crontab.XXXXVnxDO9 installed on Mon Jun 16 16:38:46 2008)
298 production 1.10 # (Cron version V5.0 -- $Id: whattodolev0.txt,v 1.9 2010/12/17 18:34:28 production Exp $)
299 production 1.1  #0,20,40 * * * * /home/jim/cvs/jsoc/scripts/pipefe_rm
300                 
301                 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
302                 
303                 Starting and stoping SUMS on d02:
304                 
305                 Login as production on d02
306                 sum_start_d02
307                 
308                 (if sums is already running it will ask you if you want to halt it.
309                 you normally say 'y'.)
310                 
311                 sum_stop_d02
312                 if you just want to stop sums.
313                 
314                 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
315                 
316                 SUMS archiving:
317                 
318                 Currently SUM is archiving continuously. The script is:
319                 
320 production 1.9  /home/production/cvs/JSOC/base/sums/scripts/tape_do_0.pl  (and _1, _2, _3)
321 production 1.1  
322                 To halt it do:
323                 
324 production 1.9  touch /usr/local/logs/tapearc/TAPEARC_ABORT[0,1,2]
325 production 1.1  
326                 Try to keep it running, as there is still much to be archived.
327                 
328                 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
329                 
330                 Change archiving status of a dataseries:
331                 
332                 > psql -h hmidb jsoc
333                 
334                 jsoc=> update hmi.drms_series set archive=0 where seriesname='hmi.lev0c';
335                 UPDATE 1
336                 jsoc=> \q
337                 
338 production 1.7  ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
339                 
340                 The modified dcs reboot procedure is in ~kehcheng/dcs.reboot.notes.

Karen Tian
Powered by
ViewCVS 0.9.4