Sunday, July 11, 2010

SB: TLM restarted again on glidein-2

stuck since 3h:
[1411] crab@glidein-2 ~$ date
Sun Jul 11 14:11:55 PDT 2010
-rw-rw-rw- 1 crab uscms 369490122 Jul 11 08:11 .SEinteraction.log
-rw-rw-rw- 1 crab uscms 5843593 Jul 11 08:11 ComponentLog

no error message in ComponentLog nor .SEInteraction.log
last lines of ComponentLog
2010-07-11 08:11:44,589:Timeleft for retrieved proxy: (exit code 0) 76112
2010-07-11 08:11:44,589:Requested voms validity: 21:08

last lines of .SEInteraction.log
2010-07-11 08:11:21.484095:
Executed: env X509_USER_PROXY=/tmp/del_proxies/cf44bdea0cb960a56b620421578d26cb6e594b8c uberftp glidein-2.t2.ucsd.edu "rm /data/CSstoragePath/hbrun_PhotonJetPt15_9p2x6m/crab_fjr_123.xml"
Done with exit code: 0
and output:
220 glidein-2.t2.ucsd.edu GridFTP Server 2.8 (gcc64dbg, 1217607445-63) [VDT patched 4.0.8] ready.
230 User uscms2341 logged in.

Tuesday, July 6, 2010

SB: TLM restarted again on glidein-2

beacuse stuck since 3 hours
[1525] crab@glidein-2 ~/work/TaskLifeManager$ date
Mon Jul 5 15:25:29 PDT 2010
[1525] crab@glidein-2 ~/work/TaskLifeManager$ tail -1 ComponentLog
2010-07-05 12:31:36,384:[/tmp/del_proxies/a7835c47ea54226fc019d9de30a45fa4b89bd287]

Wednesday, June 30, 2010

restarted CSW on glidein-2

because of
https://savannah.cern.ch/bugs/?68929

last lines of ComponentLog

2010-06-30 13:14:27,228:Input whitelist:
2010-06-30 13:14:27,228:Input blacklist:
2010-06-30 13:14:27,228:Converted whitelist:
2010-06-30 13:14:27,229:Converted blacklist:
2010-06-30 13:14:27,913:MS: get requested
2010-06-30 13:14:27,914:Publishing TTXmlLogging
2010-06-30 13:14:27,914:MS: publish requested
2010-06-30 13:14:27,915:Publish Event: worker_2 TTXmlLogging meridian_EGPD_v4_v1_138572_b5m1n7::-1::/data/CSstoragePath/logs/meridian_EGPD_v4_v1_138572_b5m1n7_spec/1277928863.75.pkl
2010-06-30 13:14:27,915:Publishing CrabServerWorkerComponent:FatWorkerResult
2010-06-30 13:14:27,916:MS: publish requested
2010-06-30 13:14:27,916:(1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'Worker worker_2. Problem performing List Match for task meridian_EGPD_v4_v1_1385' at line 7")
[1455] crab@glidein-2 ~/work/CrabServerWorker$

Tuesday, June 29, 2010

SB: TLM restarted again on glidein-2

since stuck by ~2hours. Last line in ComponentLog was
2010-06-29 12:37:32,800:[/tmp/del_proxies/0e6a5e75ddd4d72a44ce4593b9653a0ea51dcb91]

SB: TaskLifeManager restarted on glidein-2

was stuck since more then 1:30 hours. No visible error. Last message
2010-06-29 03:13:09,331:[/tmp/del_proxies/4999da3342d14feb5c3f70ae3789e4326fa2d82c]
According to the green box

availability info:
Error: CrabServerWorker not running
status:
not available

Again.

Dave

Thursday, June 24, 2010

SB: TaskLifeManager restarted on submit-2

was restarted earlier by Sanjay. Post this for recordkeeping

restarted TaskLifeManager on submit-2

no update to log since >20min. Last lines in log:

crab@submit-2 ~/work/TaskLifeManager$ tail -5 ComponentLog
2010-06-23 12:30:34,142:[/tmp/del_proxies/241fb22338757d8c32122053135f8cecd3d645a6]
2010-06-23 14:49:14,720:Exception raised: 'Error deleting [/data/CSstoragePath/slehti_crab_0_100618_161934_qe306c/out_files_1149.tgz]'
2010-06-23 14:49:14,721:problems deleting osb for job 1149
2010-06-23 14:51:41,590:Exception raised: 'Error deleting [/data/CSstoragePath/slehti_crab_0_100618_161934_qe306c/crab_fjr_1173.xml]'
2010-06-23 14:51:41,590:problems deleting osb for job 1173
crab@submit-2 ~/work/TaskLifeManager$

Wednesday, June 23, 2010

GetOutput restarted again on vocms21

since it was not updating since 20:00 Also http://vocms21.cern.ch:8888/compstatus/ wa showing ~1K jobs in JC queue, not there are 8.

Tuesday, June 22, 2010

GetOutput restarted on vocms21

beacause of
https://hypernews.cern.ch/HyperNews/CMS/get/crabDevelopment/1404/1/1/1.html
no useful message was in ComponentLog

vocms21 updated

earlier this morning vocms21 was updated live to crab server 112. So far so good,
components are running but we keep server off the available pool
https://hypernews.cern.ch/HyperNews/CMS/get/crabDevelopment/1403/1.html

plan is to drain it and then scratch the 3GB MySqlDB and start a new one
with autoextend set to 1GB.

stefano (upon info from Sanjay)

taking slc5cern = vocms21.cern.ch off the available servers list

because TaskTraking is not restarting

TakTracking on vocms21 stuck. fails to restart

Restarted TT because of
https://savannah.cern.ch/bugs/index.php?69074

But fails to restart
Component: TaskTracking NOT Running

2010-06-22 10:48:14,365:Problem registering component
[(1205, 'Lock wait timeout exceeded; try restarting transaction')]
2010-06-22 10:48:14,368:Traceback (most recent call last):
File "/home/crab/sw_area/slc5_amd64_gcc434/cms/crab-server/CRABSERVER_1_1_1_pre12/lib/TaskTracking/TaskTrackingComponent.py", line 1185, in startComponent
self.ms.registerAs("TaskTracking")
File "/home/crab/sw_area/slc5_amd64_gcc434/cms/prodagent/PRODAGENT_0_12_17_CRAB_1-cmp7/lib/MessageService/MessageService.py", line 154, in registerAs
cursor.execute(sqlCommand)
File "/build/diego/crabserver/slc5_amd64_gcc434/external/py2-mysqldb/1.2.2-cmp4/lib/python2.4/site-packages/MySQLdb/cursors.py", line 166, in execute
File "/build/diego/crabserver/slc5_amd64_gcc434/external/py2-mysqldb/1.2.2-cmp4/lib/python2.4/site-packages/MySQLdb/connections.py", line 35, in defaulterrorhandler
OperationalError: (1205, 'Lock wait timeout exceeded; try restarting transaction')



Component keep trying automatically, but keeps failing.

Thursday, June 10, 2010

FatWorker.py patched

to version 1.204 as per Fabio's indication https://savannah.cern.ch/bugs/index.php?68634 and CrabServerWorker restarted both at CERN (vocms21) and UCSD (glidein-2)

(n.b. previously had verion: 1.203 at UCSD and 1.194.2.6 at CERN. No reason why CERN was like that. 1.203 is what is tagged for server 112pre1 )

Saturday, June 5, 2010

gliden-2 patched for gLite JDL spurious resubmission

I updated FatWorker.py in /sw_area/slc5_amd64_gcc434/cms/crab-server/CRABSERVER_1_1_1_pre12/lib/CrabServerWorker to current CVS head with Fabio fix

On 20/05/2010 10.42, Fabio Farina wrote:
> The file to replace is this one
> http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/COMP/CRAB/CRABSERVER/src/python/CrabServerWorker/FatWorker.py?view=log&pathrev=HEAD
> Sanjay Padhi ha scritto:
>> Hi Fabio,
>> Is it possible for you to tell me which patch was responsible for the
>> retry count in glite ?

Tuesday, June 1, 2010

SB: TaskLifeManager restarted on glidein-2

usual problem, logs ends with:
2010-06-01 02:52:37,590:Checking proxy [/tmp/del_proxies/067fc18b4c5f9b5613eb5dce1fd527b044bb0463]
2010-06-01 02:52:37,702:Credential still valid for: 114894 s
2010-06-01 02:52:37,702:Trying to renew proxy [/tmp/del_proxies/067fc18b4c5f9b5613eb5dce1fd527b044bb0463]
2010-06-01 02:52:39,411:MyProxy renewal - logon :
unset X509_USER_CERT X509_USER_KEY && env X509_USER_CERT=$HOME/.globus/hostcert.pem X509_USER_KEY=$HOME/.globus/hostkey.pem myproxy-logon -d -n -s $MYPROXY_SERVER -a /tmp/del_proxies/067fc18b4c5f9b5613eb5dce1fd527b044bb0463 -o /tmp/del_proxies/067fc18b4c5f9b5613eb5dce1fd527b044bb0463 -k ab771271de34096b70a435a7805fdef141e2197d -t 168:00
2010-06-01 02:52:39,452:Timeleft for retrieved proxy: (exit code 0) 114946
2010-06-01 02:52:39,452:Requested voms validity: 31:55

and component stuck since 2 hours.

Monday, May 31, 2010

SB: TaskLifeManager restarted on vocms21 (again)

last lines of ComponentLog:

2010-05-31 15:29:56,383:MyProxy renewal - logon :
unset X509_USER_CERT X509_USER_KEY && env X509_USER_CERT=$HOME/.globus/hostcert.pem X509_USER_KEY=$HOME/.globus/hostkey.pem myproxy-logon -d -n -s $MYPROXY_SERVER -a /tmp/del_proxies/c3bd92f55a15d0209614e7de9f151860d2af7690 -o /tmp/del_proxies/c3bd92f55a15d0209614e7de9f151860d2af7690 -k 95a232654d8bb8afedd32875d0d6885983237bdb -t 168:00
2010-05-31 15:29:56,420:Timeleft for retrieved proxy: (exit code 0) 9303
2010-05-31 15:29:56,421:Requested voms validity: 2:35
-bash-3.2$

SB: TaskLifeManager restarted on vocms21

stuck since 1h15min. Last lines in log:
2010-05-31 13:44:56,492:MyProxy renewal - logon :
unset X509_USER_CERT X509_USER_KEY && env X509_USER_CERT=$HOME/.globus/hostcert.pem X509_USER_KEY=$HOME/.globus/hostkey.pem myproxy-logon -d -n -s $MYPROXY_SERVER -a /tmp/del_proxies/4b6f844bf24d14714dc465576061745027b95e30 -o /tmp/del_proxies/4b6f844bf24d14714dc465576061745027b95e30 -k 95a232654d8bb8afedd32875d0d6885983237bdb -t 168:00
2010-05-31 13:44:56,530:Timeleft for retrieved proxy: (exit code 0) 79906
2010-05-31 13:44:56,531:Requested voms validity: 22:11


was also restarted at 9:05 this morning by Sanjay

Thursday, May 27, 2010

stale BDII at CERN (from Burt Holzman)

https://gus.fzk.de/ws/ticket_info.php?ticket=58525 The OSG information has not been updated in 1 day for one of the CERN BDIIs in the round-robin. I don't know if the EGI side is out-of-date as well, but this potentialy impacts analysis operations (ERTs are not being updated for example).

SiteDB times out, most things fail

since yesterday afternoon. Many users have plenty of problems. Status tracked here so far: https://savannah.cern.ch/bugs/index.php?67990 .
Operator restared services at 10:05, it appears fixed now.

Wednesday, April 28, 2010

OOOOPPPPSSSSS

sorry, I pasted wrong URL.
THe pointer to new e-log is
https://prod-grid-logger.cern.ch/elog/Analysis+Operations/

to subscribe click on Config in top menu

Stefano

new E-Log

FacOps has enabled our e-log in the official place
https://lxbra1907.cern.ch/elog/Analysis+Operations/
it has no RSS feed, but people can subscribe to get mails of postings. I think we should switch to use that. THat way anyone in CMS (e.g. Computing shifter) can post. If you do not have already an e-log password you will need to register, I am using same username and password as Hypernews, to keep it simple. I will make an attempt at importing history from blog there, but if it does not work right-away, I'll give up.
See you on the new logbook.
Stefano

Tuesday, April 27, 2010

put back slc5cern=vocms21 in AvailableServerList

Fabio fixed the problem http://savannah.cern.ch/bugs/?66389 one test task went OK, status from web OK, also some users where submitting anyhow (guess they hardcoded slc5ern in their crab.cfg) and it all looks normal. Stefano

Monday, April 26, 2010

how to add/remove a Crab server etc.

our operation twiki now has instructions on how to use crab config files to e.g. add/remove one crab server to pool. The TWISTY thing I put is maybe not so nice, feel free to edit.
https://twiki.cern.ch/twiki/bin/view/CMS/CrabUserSupportFAQ#Links
need to click on More... (that's the not so cool and nice part)
stefano

today's news

slc5cern aka vocms21 is still unusable, Fabio is looking at it, seems it just takes it an awful amount of time to go through backlog and Fabio is looking into speeding that up. Is an interesting thing and we are not in a hurry, so let' keep it off the available list.
CMSSW_370 is breaking crab compatibility (something changed in FWCore), details in Savannah.
Mistery of "file already exists but i am sure directory was empty" deepens.
Stefano

Sunday, April 25, 2010

Reply all is default

I have made reply all the default for the feedback list - makes it easier not to forget to do reply all when replying to messages as I often forget. Found this in the "labs" part of the settings for the gmail when looking for statistical tools.

Cheers,
Dave

vocms21 removed

I commented out slc5cern (i.e. vocms21.cern.ch) in /afs/cern.ch/cms/LCG/crab/config/AvailableServerList
(and committed change to CVS). Stefano

restarted again

was stuck again. Restarted a couple of times. But still left with a bunch of jobs in Submitting state. Do not know what do, will leave for developers to investigate.

CrabServerWorker component restarted



vocms21 were not submitting (noticed after user complained). I restarted CrabServerWorker as per instructions https://twiki.cern.ch/twiki/bin/view/CMS/CrabServerOpsIssues#Down_Component and things seem recovered. Details for developers in Savannah https://savannah.cern.ch/bugs/index.php?66389 . Image from http://vocms21.cern.ch:8888/graphjobstcum/?length=24&span=hours show that activity was stopped then restarted. Stefano

Saturday, April 24, 2010

added back wms212

wms212.cern.ch was removed ( by me) from /afs/cern.ch/cms/LCG/crab/config/glite_wms_CERN.conf on Apr7 because Filippo Spiga reported it was hanging output, enough time has passed that should be drained. Status from CERN monitoring page looks good. I added it back and commited change to CVS. Stefano

Opened a ticekt with T2_Vienna

analysis jobs are not getting enough CPU there. Ticket is https://savannah.cern.ch/support/index.php?114034 , pointer to feedback thread:
https://hypernews.cern.ch/HyperNews/CMS/get/crabFeedback/3150/1/1/1/1/1/1/1/1/1/1/1/1/1/1.html .Stefano

test,please ignore

Wednesday, April 21, 2010

Also crab.support@gmail.com can now post

for your convenience when logged on gmail as crab.support. Please remember to sign with your own name. And keep entry terse, avoid unnecessary new line. Stefano

crab server fails with CMSSW_3_6_0

See previous entry. I told users to use direct submissions in reply to Olga's HN posting.

cms_ui_env_NEW removed

to avoid confusion I removed UI setup scripts with _NEW from both AFS and CVS. Users should use the standard one as per workbook. _NEW was only for testing before we changed the default. Stefano

Ok I'm in

Nice tool to exchange infos...I've subscribed to the feed in order to see the updates directly on my browser's bar.

Cheers,
Marco.

ok, I signed up ....

... and it took me half a dozen trials to remember my google passwd and account.
Man, you guys are finally bringing me into the 21st century, it seems.

test

test - apologies for noise.
Cheers,
Dave

Tuesday, April 20, 2010

first entry

I created this blog to temporarely supply the missing e-log for Analysis Operations.
Shoudl all be self explaining. Let's try to have not more then a few terse postings a day.
Stefano