|Return to whattodolev0.txt CVS log||Up to [Development] / JSOC / doc|
File: [Development] / JSOC / doc / whattodolev0.txt
Revision: 1.9, Fri Dec 17 18:34:28 2010 UTC (12 years, 5 months ago) by production
CVS Tags: Ver_6-3, Ver_6-2, Ver_6-1, Ver_6-0, Ver_5-14, Ver_5-13, NetDRMS_Ver_9-9, NetDRMS_Ver_6-3, NetDRMS_Ver_6-2, NetDRMS_Ver_6-1, NetDRMS_Ver_6-0, NetDRMS_Ver_2-7
Changes since 1.8: +12 -4 lines
/home/production/cvs/JSOC/doc/whattodolev0.txt 25Nov2008 ------------------------------------------------ WARNING!! Some of this is outdated. 3Jun2010 Please see more recent what*.txt files, e.g. whattodo_start_stop_lev1_0_sums.txt ------------------------------------------------ ------------------------------------------------------ Running Datacapture & Pipeline Backend lev0 Processing ------------------------------------------------------ NOTE: For now, this is all done from the xim w/s (Jim's office) Datacapture: -------------------------- NOTE:IMPORTANT: Please keep in mind that each data capture machine has its own independent /home/production. FORMERLY: 1. The Datacapture system for aia/hmi is by convention dcs0/dcs1 respectively. If the spare dcs2 is to be put in place, it is renamed dcs0 or dcs1, and the original machine is renamed dcs2. 1. The datacapture machine serving for AIA or HMI is determined by the entries in: /home/production/cvs/JSOC/proj/datacapture/scripts/dsctab.txt This is edited or listed by the program: /home/production/cvs/JSOC/proj/datacapture/scripts> dcstab.pl -h Display or change the datacapture system assignment file. Usage: dcstab [-h][-l][-e] -h = print this help message -l = list the current file contents -e = edit with vi the current file contents For dcs3 the dcstab.txt would look like: AIA=dcs3 HMI=dcs3 1a. The spare dcs2 normally servers as a backup destination of the postgres running on dcs0 and dcs1. You should see this postgres cron job on dcs0 and dcs1, respectively: 0,20,40 * * * * /var/lib/pgsql/rsync_pg_dcs0_to_dcs2.pl 0,20,40 * * * * /var/lib/pgsql/rsync_pg_dcs1_to_dcs2.pl For this to work, this must be done on dcs0, dcs1 and dcs2, as user postgres, after any reboot: > ssh-agent | head -2 > /var/lib/pgsql/ssh-agent.env > chmod 600 /var/lib/pgsql/ssh-agent.env > source /var/lib/pgsql/ssh-agent.env > ssh-add (The password is written on my whiteboard (same as production's)) 2. Login as user production via j0. (password is on Jim's whiteboard). 3. The Postgres must be running and is started automatically on boot: > ps -ef |grep pg postgres 4631 1 0 Mar11 ? 00:06:21 /usr/bin/postmaster -D /var/lib/pgsql/data 4. The root of the datacapture tree is /home/production/cvs/JSOC. The producton runs as user id 388. 5. The sum_svc is normally running: > ps -ef |grep sum_svc 388 26958 1 0 Jun09 pts/0 00:00:54 sum_svc jsocdc Note the SUMS database is jsocdc. This is a separate DB on each dcs. 6. To start/restart the sum_svc and related programs (e.g. tape_svc) do: > sum_start_dc sum_start at 2008.06.16_13:32:23 ** NOTE: "soc_pipe_scp jsocdc" still running Do you want me to do a sum_stop followed by a sum_start for you (y or n): You would normally answer 'y' here. 7. To run the datacapture gui that will display the data, mark it for archive, optionally extract lev0 and send it on the the pipeline backend, do this: > cd /home/production/cvs/JSOC/proj/datacapture/scripts> > ./socdc All you would normally do is hit "Start Instances for HMI" or AIA for what datacapture machine you are on. 8. To optionally extract lev0 do this: > touch /usr/local/logs/soc/LEV0FILEON To stop lev0: > /bin/rm /usr/local/logs/soc/LEV0FILEON The last 100 images for each VC are kept in /tmp/jim. NOTE: If you turn lev0 on, you are going to be data sensitive and you may see things like this, in which case you have to restart socdc: ingest_tlm: /home/production/cvs/EGSE/src/libhmicomp.d/decompress.c:1385: decompress_undotransform: Assertion `N>=(6) && N<=(16)' failed. kill: no process ID specified 9. The datacapture machines automatically copies DDS input data to the pipeline backend on /dds/socdc living on d01. This is done by the program: > ps -ef |grep soc_pipe_scp 388 21529 21479 0 Jun09 pts/0 00:00:13 soc_pipe_scp /dds/soc2pipe/hmi /dds/socdc/hmi d01i 30 This requires that an ssh-agent be running. If you reboot a dcs machine do: > ssh-agent | head -2 > /var/tmp/ssh-agent.env > chmod 600 /var/tmp/ssh-agent.env > source /var/tmp/ssh-agent.env > ssh-add (or for sonar: ssh-add /home/production/.ssh/id_rsa) (The password is written on my whiteboard) NOTE: cron jobs use this /var/tmp/ssh-agent.env file If you want another window to use the ssh-agent that is already running do: > source /var/tmp/ssh-agent.env NOTE: on any one machine for user production there s/b just one ssh-agent running. If you see that a dcs has asked for a password, the ssh-agent has failed. You can probably find an error msg on d01 like 'invalid user production'. You should exit the socdc. Make sure there is no soc_pipe_scp still running. Restart the socdc. If you find that there is a hostname for production that is not in the /home/production/.ssh/authorized_keys file then do this on the host that you want to add: Pick up the entry in /home/production/.ssh/id_rsa.pub and put it in this file on the host that you want to have access to (make sure that it's all one line): /home/production/.ssh/authorized_keys NOTE: DO NOT do a ssh-keygen or you will have to update all the host's authorized_keys with the new public key you just generated. If not already active, then do what's shown above for the ssh-agent. 10. There should be a cron job running that will archive to the T50 tapes. Note the names are asymmetric for dcs0 and dcs1. 30 0-23 * * * /home/production/cvs/jsoc/scripts/tapearc_do 00 0-23 * * * /home/production/cvs/jsoc/scripts/tapearc_do_dcs1 In the beginning of the world, before any sum_start_dc, the T50 should have a supply of blank tapes in it's active slots (1-24). A cleaning tape must be in slot 25. The imp/exp slots (26-30) must be vacant. To see the contents of the T50 before startup do: > mtx -f /dev/t50 status Whenever sum_start_dc is called, all the tapes are inventoried and added to the SUMS database if necessary. When a tape is written full by the tapearc_do cron job, the t50view display (see 11. and 12. below) 'Imp/Exp' button will increment its count. Tapes should be exported before the count gets above 5. 11. There should be running the t50view program to display/control the tape operations. > t50view -i jsocdc The -i means interactive mode, which will allow you to change tapes. 12. Every 2 days, inspect the t50 display for the button on the top row called 'Imp/Exp'. If it is non 0 (and yellow), then some full tapes can be exported from the T50 and new tapes put in for further archiving. Hit the 'Imp/Exp' button. Follow explicitly all the directions. The blank L4 tapes are in the tape room in the computer room. When the tape drive needs cleaning, hit the "Start Cleaning" button on the t50view gui. 13. There should be a cron job running as user production on both dcs0 and dcs1 that will set the Offsite_Ack field in the sum_main DB table. 20 0 * * * /home/production/tape_verify/scripts/set_sum_main_offsite_ack.pl Where: #/home/production/tape_verify/scripts/set_sum_main_offsite_ack.pl # #This reads the .ver files produced by Tim's #/home/production/tape_verify/scripts/run_remote_tape_verify.pl #A .ver file looks like: ## Offsite verify offhost:dds/off2ds/HMI_2008.06.11_01:12:27.ver ## Tape 0=success 0=dcs0(aia) #000684L4 0 1 #000701L4 0 1 ##END #For each tape that has been verified successfully, this program #sets the Offsite_Ack to 'Y' in the sum_main for all entries #with Arch_Tape = the given tape id. # #The machine names where AIA and HMI processing live #is found in dcstab.txt which must be on either dcs0 or dcs1 14. Other background info is in: http://hmi.stanford.edu/development/JSOC_Documents/Data_Capture_Documents/DataCapture.html ***************************dsc3********************************************* NOTE: dcs3 (i.e. offsite datacapture machine shipped to Goddard Nov 2008) At Goddard the dcs3 host name will be changed. See the following for how to accomodate this: /home/production/cvs/JSOC/doc/dcs3_name_change.txt This cron job must be run to clean out the /dds/soc2pipe/[aia,hmi]: 0,5,10,15,20,25,30,35,40,45,50,55 * * * * /home/production/cvs/JSOC/proj/datacapture/scripts/rm_soc2pipe.pl Also on dcs3 the offsite_ack check and safe tape check is not done in: /home/production/cvs/JSOC/base/sums/libs/pg/SUMLIB_RmDo.pgc Also on dcs3, because there is no pipeline backend, there is not .arc file ever made for the DDS. ***************************dsc3********************************************* Level 0 Backend: -------------------------- !!Make sure run Phil's script for watchlev0 in the background on cl1n001: /home/production/cvs/JSOC/base/sums/scripts/get_dcs_times.csh 1. As mentioned above, the datacapture machines automatically copies DDS input data to the pipeline backend on /dds/socdc living on d01. 2. The lev0 code runs as ingest_lev0 on the cluster machine cl1n001, which has d01:/dds mounted. cl1n001 can be accessed through j1. 3. All 4 instances of ingest_lev0 for the 4 VCs are controlled by /home/production/cvs/JSOC/proj/lev0/apps/doingestlev0.pl If you want to start afresh, kill any ingest_lev0 running (will later be automated). Then do: > cd /home/production/cvs/JSOC/proj/lev0/apps > doingestlev0.pl (actually a link to start_lev0.pl) You will see 4 instances started and the log file names can be seen. You will be advised that to cleanly stop the lev0 processing, run: > stop_lev0.pl It may take awhile for all the ingest_lev0 processes to get to a point where they can stop cleanly. For now, every hour, the ingest_lev0 processes are automatically restarted. 4. The output is for the series: hmi.tlmd hmi.lev0d aia.tlmd aia.lev0d #It is all save in DRMS and archived. Only the tlmd is archived. (see below if you want to change the archiving status of a dataseries) 5. If something in the backend goes down such that you can't run ingest_lev0, then you may want to start this cron job that will periodically clean out the /dds/socdc dir of the files that are coming in from the datacapture systems. > crontab -l # DO NOT EDIT THIS FILE - edit the master and reinstall. # (/tmp/crontab.XXXXVnxDO9 installed on Mon Jun 16 16:38:46 2008) # (Cron version V5.0 -- $Id: whattodolev0.txt,v 1.9 2010/12/17 18:34:28 production Exp $) #0,20,40 * * * * /home/jim/cvs/jsoc/scripts/pipefe_rm ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Starting and stoping SUMS on d02: Login as production on d02 sum_start_d02 (if sums is already running it will ask you if you want to halt it. you normally say 'y'.) sum_stop_d02 if you just want to stop sums. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ SUMS archiving: Currently SUM is archiving continuously. The script is: /home/production/cvs/JSOC/base/sums/scripts/tape_do_0.pl (and _1, _2, _3) To halt it do: touch /usr/local/logs/tapearc/TAPEARC_ABORT[0,1,2] Try to keep it running, as there is still much to be archived. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Change archiving status of a dataseries: > psql -h hmidb jsoc jsoc=> update hmi.drms_series set archive=0 where seriesname='hmi.lev0c'; UPDATE 1 jsoc=> \q ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ The modified dcs reboot procedure is in ~kehcheng/dcs.reboot.notes.