CPU Benchmark appeared to cause WU error (Linux_AMD64)

\n studio-striking\n

Message boards : Number crunching : CPU Benchmark appeared to cause WU error (Linux_AMD64)
Message board moderation

To post messages, you must log in.

AuthorMessage
AMDave

Send message
Joined: 3 May 11
Posts: 4
Credit: 135,212,867
RAC: 0
Message 800 - Posted: 15 Jul 2011, 9:48:33 UTC
Last modified: 15 Jul 2011, 10:20:30 UTC

I have observed a couple of intermittent failed WUs on a Linux_AMD64 machine.

These failed WUs occurred very far apart.

I was unable to work out what exactly was causing the problem until one occurred while I was watching.

Normally I can manually suspend and resume a WU without any problems.

On this occasion the BOINC Manager suspended all projects to execute the scheduled CPU Benchmark cycle.

At the start of that event the WU failed.

It was approximately 80% through.

The box is host#769
OS: Ubuntu desktop 10.10
Kernel: Linux 2.6.35-28-generic
Arch: AMD64
RAM: 16GB
CPU: 6 core AMD Phenom(tm) II X6 1055T Processor [Family 16 Model 10 Stepping 0],
GPU: CAL ATI Radeon HD5700 series (Juniper) (1024MB) driver: 1.4.1332
S/W: (of note)
ia32libs
ati-stream-sdk-v2.2-lnx64
BOINC 6.10.58

Task: 1403959
WU: 1082519

I'm really not that concerned about the issue myself as it is so infrequent, but thought I should post the details of the event, in case it matters to someone.

[EDIT]
the event may also coincide with an orphaned client app process:
I just found an extra instance of the client app that had been running for 11+ hours.
It was unnaccounted for by the BOINC Manager.
It is also not the process belonging to the WU that failed.
Also, approx 10 hours ago when that orphaned process should have returned or failed, this host does not have a failed WU (not even in the whole week prior).
Mysterious.
I killed the process and will resume monitoring.
[/EDIT]
ID: 800 · Rating: 0 · rate: Rate + / Rate - Report as offensive
mikey
Avatar

Send message
Joined: 22 Jun 11
Posts: 2080
Credit: 1,826,667,680
RAC: 32,419
Message 801 - Posted: 15 Jul 2011, 10:59:38 UTC - in response to Message 800.  

I have observed a couple of intermittent failed WUs on a Linux_AMD64 machine.

These failed WUs occurred very far apart.

I was unable to work out what exactly was causing the problem until one occurred while I was watching.

Normally I can manually suspend and resume a WU without any problems.

On this occasion the BOINC Manager suspended all projects to execute the scheduled CPU Benchmark cycle.

At the start of that event the WU failed.

It was approximately 80% through.

The box is host#769
OS: Ubuntu desktop 10.10
Kernel: Linux 2.6.35-28-generic
Arch: AMD64
RAM: 16GB
CPU: 6 core AMD Phenom(tm) II X6 1055T Processor [Family 16 Model 10 Stepping 0],
GPU: CAL ATI Radeon HD5700 series (Juniper) (1024MB) driver: 1.4.1332
S/W: (of note)
ia32libs
ati-stream-sdk-v2.2-lnx64
BOINC 6.10.58

Task: 1403959
WU: 1082519

I'm really not that concerned about the issue myself as it is so infrequent, but thought I should post the details of the event, in case it matters to someone.

[EDIT]
the event may also coincide with an orphaned client app process:
I just found an extra instance of the client app that had been running for 11+ hours.
It was unnaccounted for by the BOINC Manager.
It is also not the process belonging to the WU that failed.
Also, approx 10 hours ago when that orphaned process should have returned or failed, this host does not have a failed WU (not even in the whole week prior).
Mysterious.
I killed the process and will resume monitoring.
[/EDIT]


I don't know about the 'orphan' process but the rest, crashing a unit during the benchmark process, was a common Windows problem a few versions ago. Your pc's are hidden so I can't tell if you are using the latest version of Boinc, but you also might ask on the Boinc Developers Mailing List, sorry I do not have a link to it. Dr. David Anderson of Seti is the creator and maintainer of Boinc and is in charge of the Mailing List, be aware that he can be a bit gruff at times, it is HIS BABY we are using! He seems to like to learn but seems to dislike re-educating people on stuff he has said in the past. He has a Team of programmers helping him, but again it is HIS BABY!!
ID: 801 · Rating: 0 · rate: Rate + / Rate - Report as offensive
AMDave

Send message
Joined: 3 May 11
Posts: 4
Credit: 135,212,867
RAC: 0
Message 804 - Posted: 16 Jul 2011, 4:57:27 UTC - in response to Message 801.  

That's ok, thanks.
I am a long time BOINCer and I am familiar with the forums.
The note is more for the admins (who can see hidden hosts).

I don't think there is a real issue with the Moo wrapper or the dnetc client app, and I do not believe this would be a BOINC manager issue in any way that I can think of, or we would see a lot more of this, which we don't at this time.

Point to note - this is a very VERY busy host so an issue from time to time is expected.

I'm posting FTR in case this happens to anyone else so it might be correlated.

The oddity is the orphaned process. In the scheme of things the wrapper should have shut it down at some point, but all relevant tasks in the time-frame were completed successfully, so I can't correlate it.

Anyway it is gone now. I'm watching to see if it happens again. I may write a bash script to tell me immediately when more than one dnetc task occurs on the host so that I can interrogate it and nail it down. But given the frequency I don't expect that to occur again for another week or two.

Thanks for your reply, though.
ID: 804 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Beyond
Avatar

Send message
Joined: 18 May 11
Posts: 46
Credit: 1,254,302,893
RAC: 0
Message 827 - Posted: 23 Jul 2011, 12:54:32 UTC - in response to Message 800.  

I have observed a couple of intermittent failed WUs on a Linux_AMD64 machine.

These failed WUs occurred very far apart.

I was unable to work out what exactly was causing the problem until one occurred while I was watching.

Normally I can manually suspend and resume a WU without any problems.

On this occasion the BOINC Manager suspended all projects to execute the scheduled CPU Benchmark cycle.

At the start of that event the WU failed.

It was approximately 80% through.

It may be related to the problem I posted here:

http://moowrap.net/forum_thread.php?id=91

WUs that are stopped and restarted are marked invalid. All other WUs validate fine.
ID: 827 · Rating: 0 · rate: Rate + / Rate - Report as offensive
cisf

Send message
Joined: 23 May 11
Posts: 1
Credit: 59,590
RAC: 0
Message 828 - Posted: 23 Jul 2011, 14:01:53 UTC

I see the same issues on my system also:

20-Jul-2011 17:44:37 [---] Running CPU benchmarks
20-Jul-2011 17:44:37 [---] Suspending computation - CPU benchmarks in progress
20-Jul-2011 17:44:38 [Moo! Wrapper] Computation for task dnetc_r72_1311076534_2_32_0 finished
20-Jul-2011 17:44:38 [Moo! Wrapper] Output file dnetc_r72_1311076534_2_32_0_0 for task dnetc_r72_1311076534_2_32_0 absent



15-Jul-2011 17:15:08 [---] Running CPU benchmarks
15-Jul-2011 17:15:08 [---] Suspending computation - CPU benchmarks in progress
15-Jul-2011 17:15:09 [Moo! Wrapper] Computation for task dnetc_r72_1310728004_2_32_0 finished
15-Jul-2011 17:15:09 [Moo! Wrapper] Output file dnetc_r72_1310728004_2_32_0_0 for task dnetc_r72_1310728004_2_32_0 absent


10-Jul-2011 17:10:55 [---] Running CPU benchmarks
10-Jul-2011 17:10:55 [---] Suspending computation - CPU benchmarks in progress
10-Jul-2011 17:10:56 [Moo! Wrapper] Computation for task dnetc_r72_1310207188_2_32_2 finished
10-Jul-2011 17:10:56 [Moo! Wrapper] Output file dnetc_r72_1310207188_2_32_2_0 for task dnetc_r72_1310207188_2_32_2 absent


linux_AMD64 (ubuntu 11.04), ati 11.6, boinc 6.12.18
ID: 828 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Teemu Mannermaa
Project administrator
Project developer
Project tester

Send message
Joined: 20 Apr 11
Posts: 388
Credit: 822,356,221
RAC: 0
Message 866 - Posted: 1 Aug 2011, 3:05:22 UTC - in response to Message 800.  

On this occasion the BOINC Manager suspended all projects to execute the scheduled CPU Benchmark cycle.
At the start of that event the WU failed.


Looks like one of those dreadful signal 11 (SIGSEGV) crashes. :( This is a bug in our wrapper, which seems to get triggered when it executes a code path for some not-so-normal situation. I'd need to figure out what part that is and fix it but this suspend/resume logic is good pointer so now I know where to start looking. :)

the event may also coincide with an orphaned client app process:
I just found an extra instance of the client app that had been running for 11+ hours.


I've seen this happen also and especially on GPU it causes noticeable lag (two processes giving the same GPU work). I believe it usually is caused by BOINC Client killing our wrapper before we complete shutting down the d.net client (we first ask nicely and get nasty only if it doesn't comply). Unfortunately, BOINC Client doesn't kill our children (the d.net client) as well so it's left running (should end normally eventually, but can also hang).

I don't know if I can fix this in our wrapper but there has been some BOINC Client fixes in this area. At least on Linux they should now kill also our children in this case (which, most likely, will make it loose packets but that's another problem). Unless I was dreaming, BOINC Client might have also extended (or fixed) the delay between asking us to go away and killing us.

-w
ID: 866 · Rating: 0 · rate: Rate + / Rate - Report as offensive

Message boards : Number crunching : CPU Benchmark appeared to cause WU error (Linux_AMD64)


 
Copyright © 2011-2024 Moo! Wrapper Project