1 production 1.1 /home/production/cvs/JSOC/doc/whattodo_sum_partn_on_off.txt
2
3 If a file server is down, it's /SUM partitions should be taken
4 offline so that SUMS does not try to allocate storage from them
5 and then hang on the mkdir that it is unable to do.
6
7 Here are the /SUM partitions on each file server:
8
9 d02: /SUM0 thru /SUM20s
10 d03: /SUM30s
11 d04: /SUM40s
12
13 The way to take a /SUM partition offline is to set its
14 pds_set_num = -1. This is done in the sum_partn_avail db table
15 which looks like:
16
17 partn_name | total_bytes | avail_bytes | pds_set_num | pds_set_prime
18 ------------+----------------+---------------+-------------+---------------
19 /SUM37 | 25000000000000 | 1349566595072 | 0 | 0
20 /SUM4 | 33000000000000 | 1600150855680 | 0 | 0
21 /SUM2 | 33000000000000 | 1599634276352 | 0 | 0
22 production 1.1 /SUM20 | 33000000000000 | 1599828013056 | 0 | 0
23 /SUM21 | 33000000000000 | 1600223768576 | 0 | 0
24 /SUM22 | 33000000000000 | 1599709511680 | 0 | 0
25 /SUM3 | 33000000000000 | 1599469428736 | 0 | 0
26 [etc.]
27
28 Do this as user production on a machine with psql (e.g. n02)
29 and adjust for which file server (d03) is to be taken offline:
30
31 > psql -h hmidb -p 5434 jsoc_sums
32 jsoc_sums=> select * from sum_partn_avail where partn_name like '/SUM3%';
33 partn_name | total_bytes | avail_bytes | pds_set_num | pds_set_prime
34 ------------+----------------+---------------+-------------+---------------
35 /SUM3 | 33000000000000 | 1599825543168 | 0 | 0
36 /SUM30 | 22000000000000 | 1199689433088 | 0 | 0
37 /SUM31 | 22000000000000 | 1199430434816 | 0 | 0
38 /SUM32 | 22000000000000 | 747345281024 | 0 | 0
39 /SUM33 | 22000000000000 | 786056609792 | 0 | 0
40 /SUM34 | 25000000000000 | 1349826641920 | 0 | 0
41 /SUM35 | 25000000000000 | 1349841321984 | 0 | 0
42 /SUM36 | 25000000000000 | 1349845516288 | 0 | 0
43 production 1.1 /SUM37 | 25000000000000 | 1349560303616 | 0 | 0
44 (9 rows)
45
46 jsoc_sums=> update sum_partn_avail set pds_set_num=-1 where
47 jsoc_sums-> partn_name in ('/SUM30', '/SUM31', '/SUM32', '/SUM33',
48 jsoc-sums(> '/SUM34', '/SUM35', '/SUM36', '/SUM37');
49
50 [Notice that we did not set '/SUM3' which is not on the d03 fileserver.]
51
52 jsoc_sums=> \q
53
54 Now force all the sum_svc processes to reread this new sum_partn_avail table.
55
56 > sumrepartn
57
58 When d03 is back on the air, rerun the update command shown above with
59 pds_set_num=0 and do again:
60
61 > sumrepartn
62
63
|