Message boards :
Number crunching :
CPU Benchmark appeared to cause WU error (Linux_AMD64)
Message board moderation
Author | Message |
---|---|
Send message Joined: 3 May 11 Posts: 4 Credit: 135,212,867 RAC: 0 |
I have observed a couple of intermittent failed WUs on a Linux_AMD64 machine. These failed WUs occurred very far apart. I was unable to work out what exactly was causing the problem until one occurred while I was watching. Normally I can manually suspend and resume a WU without any problems. On this occasion the BOINC Manager suspended all projects to execute the scheduled CPU Benchmark cycle. At the start of that event the WU failed. It was approximately 80% through. The box is host#769 OS: Ubuntu desktop 10.10 Kernel: Linux 2.6.35-28-generic Arch: AMD64 RAM: 16GB CPU: 6 core AMD Phenom(tm) II X6 1055T Processor [Family 16 Model 10 Stepping 0], GPU: CAL ATI Radeon HD5700 series (Juniper) (1024MB) driver: 1.4.1332 S/W: (of note) ia32libs ati-stream-sdk-v2.2-lnx64 BOINC 6.10.58 Task: 1403959 WU: 1082519 I'm really not that concerned about the issue myself as it is so infrequent, but thought I should post the details of the event, in case it matters to someone. [EDIT] the event may also coincide with an orphaned client app process: I just found an extra instance of the client app that had been running for 11+ hours. It was unnaccounted for by the BOINC Manager. It is also not the process belonging to the WU that failed. Also, approx 10 hours ago when that orphaned process should have returned or failed, this host does not have a failed WU (not even in the whole week prior). Mysterious. I killed the process and will resume monitoring. [/EDIT] |
Send message Joined: 22 Jun 11 Posts: 2080 Credit: 1,844,408,008 RAC: 90 |
I have observed a couple of intermittent failed WUs on a Linux_AMD64 machine. I don't know about the 'orphan' process but the rest, crashing a unit during the benchmark process, was a common Windows problem a few versions ago. Your pc's are hidden so I can't tell if you are using the latest version of Boinc, but you also might ask on the Boinc Developers Mailing List, sorry I do not have a link to it. Dr. David Anderson of Seti is the creator and maintainer of Boinc and is in charge of the Mailing List, be aware that he can be a bit gruff at times, it is HIS BABY we are using! He seems to like to learn but seems to dislike re-educating people on stuff he has said in the past. He has a Team of programmers helping him, but again it is HIS BABY!! |
Send message Joined: 3 May 11 Posts: 4 Credit: 135,212,867 RAC: 0 |
That's ok, thanks. I am a long time BOINCer and I am familiar with the forums. The note is more for the admins (who can see hidden hosts). I don't think there is a real issue with the Moo wrapper or the dnetc client app, and I do not believe this would be a BOINC manager issue in any way that I can think of, or we would see a lot more of this, which we don't at this time. Point to note - this is a very VERY busy host so an issue from time to time is expected. I'm posting FTR in case this happens to anyone else so it might be correlated. The oddity is the orphaned process. In the scheme of things the wrapper should have shut it down at some point, but all relevant tasks in the time-frame were completed successfully, so I can't correlate it. Anyway it is gone now. I'm watching to see if it happens again. I may write a bash script to tell me immediately when more than one dnetc task occurs on the host so that I can interrogate it and nail it down. But given the frequency I don't expect that to occur again for another week or two. Thanks for your reply, though. |
Send message Joined: 18 May 11 Posts: 46 Credit: 1,254,302,893 RAC: 0 |
I have observed a couple of intermittent failed WUs on a Linux_AMD64 machine. It may be related to the problem I posted here: http://moowrap.net/forum_thread.php?id=91 WUs that are stopped and restarted are marked invalid. All other WUs validate fine. |
Send message Joined: 23 May 11 Posts: 1 Credit: 59,590 RAC: 0 |
I see the same issues on my system also: 20-Jul-2011 17:44:37 [---] Running CPU benchmarks 20-Jul-2011 17:44:37 [---] Suspending computation - CPU benchmarks in progress 20-Jul-2011 17:44:38 [Moo! Wrapper] Computation for task dnetc_r72_1311076534_2_32_0 finished 20-Jul-2011 17:44:38 [Moo! Wrapper] Output file dnetc_r72_1311076534_2_32_0_0 for task dnetc_r72_1311076534_2_32_0 absent 15-Jul-2011 17:15:08 [---] Running CPU benchmarks 15-Jul-2011 17:15:08 [---] Suspending computation - CPU benchmarks in progress 15-Jul-2011 17:15:09 [Moo! Wrapper] Computation for task dnetc_r72_1310728004_2_32_0 finished 15-Jul-2011 17:15:09 [Moo! Wrapper] Output file dnetc_r72_1310728004_2_32_0_0 for task dnetc_r72_1310728004_2_32_0 absent 10-Jul-2011 17:10:55 [---] Running CPU benchmarks 10-Jul-2011 17:10:55 [---] Suspending computation - CPU benchmarks in progress 10-Jul-2011 17:10:56 [Moo! Wrapper] Computation for task dnetc_r72_1310207188_2_32_2 finished 10-Jul-2011 17:10:56 [Moo! Wrapper] Output file dnetc_r72_1310207188_2_32_2_0 for task dnetc_r72_1310207188_2_32_2 absent linux_AMD64 (ubuntu 11.04), ati 11.6, boinc 6.12.18 |
Send message Joined: 20 Apr 11 Posts: 388 Credit: 822,356,221 RAC: 0 |
On this occasion the BOINC Manager suspended all projects to execute the scheduled CPU Benchmark cycle. Looks like one of those dreadful signal 11 (SIGSEGV) crashes. :( This is a bug in our wrapper, which seems to get triggered when it executes a code path for some not-so-normal situation. I'd need to figure out what part that is and fix it but this suspend/resume logic is good pointer so now I know where to start looking. :) the event may also coincide with an orphaned client app process: I've seen this happen also and especially on GPU it causes noticeable lag (two processes giving the same GPU work). I believe it usually is caused by BOINC Client killing our wrapper before we complete shutting down the d.net client (we first ask nicely and get nasty only if it doesn't comply). Unfortunately, BOINC Client doesn't kill our children (the d.net client) as well so it's left running (should end normally eventually, but can also hang). I don't know if I can fix this in our wrapper but there has been some BOINC Client fixes in this area. At least on Linux they should now kill also our children in this case (which, most likely, will make it loose packets but that's another problem). Unless I was dreaming, BOINC Client might have also extended (or fixed) the delay between asking us to go away and killing us. -w |