Faulty WUs

\n studio-striking\n

Message boards : Number crunching : Faulty WUs
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
NeoMetal*

Send message
Joined: 27 May 11
Posts: 10
Credit: 150,418,811
RAC: 0
Message 1734 - Posted: 14 Dec 2011, 23:21:53 UTC - in response to Message 1708.  
Last modified: 14 Dec 2011, 23:27:32 UTC


EDIT:: (I was looking at my screen after closing a web session and caught a MOO work unit get a computation error.
"Output File dnetc_r72_1323592031_308_768_4_0 for task dnetc_r72_1323592031_308_768_4 Exceeds Size Limit
FileSize: 54240.000000 Bytes. Limit: 51200.000000 Bytes".
So this could be the problem, it shows as a Failed Upload and File Transfer Error when reported)

Conan.


Another thing is a packet limit error. Here's the end of one of my WU stderrs:


[Dec 14 08:03:11 UTC] RC5-72: 1 packet (2.00 stats units) remains in in.r72
Projected ideal time to completion: 0.00:00:04.00
[Dec 14 08:03:11 UTC] RC5-72: 387 packets (764.00 stats units) are in
out.r72
[Dec 14 08:03:20 UTC] RC5-72: Completed CA:5BBA5EB5:00000000 (3.00 stats units)
0.00:00:08.45 - [1,523,938,721 keys/s]
[Dec 14 08:03:20 UTC] RC5-72: Loaded CA:5BBA5ECC:00000000:2*2^32
[Dec 14 08:03:20 UTC] RC5-72: Summary: 388 packets (767.00 stats units)
0.00:35:59.60 - [1,525.39 Mkeys/s]
[Dec 14 08:03:20 UTC] RC5-72: 0 packets remain in in.r72
[Dec 14 08:03:20 UTC] RC5-72: 388 packets (767.00 stats units) are in
out.r72
[Dec 14 08:03:27 UTC] RC5-72: Completed CA:5BBA5ECC:00000000 (2.00 stats units)
0.00:00:05.66 - [1,516,852,305 keys/s]
[Dec 14 08:03:27 UTC] Shutdown - packet limit exceeded.
[Dec 14 08:03:27 UTC] RC5-72: Summary: 389 packets (769.00 stats units)
0.00:36:05.26 - [1,525.37 Mkeys/s]
[Dec 14 08:03:27 UTC] RC5-72: 0 packets remain in in.r72
[Dec 14 08:03:27 UTC] RC5-72: 389 packets (769.00 stats units) are in
out.r72
[Dec 14 08:03:27 UTC] *Break* Shutting down...
[Dec 14 08:03:27 UTC] Shutdown complete.
00:03:27 (1444): called boinc_finish

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>dnetc_r72_1323738425_389_769_1_0</file_name>
<error_code>-131</error_code>
</file_xfer_error>


It happens at varying lengths so it must have something to do with the file size as well as it looks like its adding extra computation results to the out file from ghost packets.
I've been lucky so far. Only 3 bad WUs in last 2 days (all on 1 box). Another thing also is the in and out files in the slot folders are always garbled (or encrypted?) chars but the checkpoint file is that way only sometimes while clear text on others. This is on 3 different boxes, 1 with 4850 and 2 with 5830s. Any thoughts?

NeoMetal
ID: 1734 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Senilix

Send message
Joined: 11 May 11
Posts: 26
Credit: 50,059,517
RAC: 0
Message 1735 - Posted: 15 Dec 2011, 0:05:17 UTC

I'm now playing with a manual fix for that issue:

1. Request a cache of 1 day of Moo WUs.
2. Shutdown BOINC after all WUs have been downloaded.
3. Manually edit the file 'client_state.xml' and replace the line
<max_nbytes>51200.000000</max_nbytes>

by
<max_nbytes>61200.000000</max_nbytes>

for every Moo! <file_info> section in that file.
4. Restart BOINC.

I'm quite confident that this will fix the issue but of course only time will tell if it really does...
ID: 1735 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile STE\/E

Send message
Joined: 2 May 11
Posts: 57
Credit: 250,035,598
RAC: 0
Message 1736 - Posted: 15 Dec 2011, 0:15:02 UTC - in response to Message 1735.  

I'm now playing with a manual fix for that issue:

1. Request a cache of 1 day of Moo WUs.
2. Shutdown BOINC after all WUs have been downloaded.
3. Manually edit the file 'client_state.xml' and replace the line
<max_nbytes>51200.000000</max_nbytes>

by
<max_nbytes>61200.000000</max_nbytes>

for every Moo! <file_info> section in that file.
4. Restart BOINC.

I'm quite confident that this will fix the issue but of course only time will tell if it really does...


lol ... I'll Detach and go to another Project before going thru all that work on 4 Different Box's ... I can live with the 10-15 Percent error rate for a few more day's, but at the rate it's going that's going to go up to 25% or more so it'll be time to move on until the problems fixed ...

ID: 1736 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Senilix

Send message
Joined: 11 May 11
Posts: 26
Credit: 50,059,517
RAC: 0
Message 1737 - Posted: 15 Dec 2011, 0:25:05 UTC

Actually, it looks like Teemu is already working on a solution as I'm not getting any new GPU work:
15.12.2011 01:25:05 | Moo! Wrapper | Requesting new tasks for ATI GPU
15.12.2011 01:25:07 | Moo! Wrapper | Scheduler request completed: got 0 new tasks
15.12.2011 01:25:07 | Moo! Wrapper | No tasks sent

Server Status page is showing plenty of available tasks but they seem to be CPU only...
ID: 1737 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Teemu Mannermaa
Project administrator
Project developer
Project tester

Send message
Joined: 20 Apr 11
Posts: 388
Credit: 822,356,221
RAC: 0
Message 1745 - Posted: 15 Dec 2011, 8:34:17 UTC

Hi,

Thanks for PM me and sorry that it took this long to notice but I'm working on this problem now. This seems to also affect out throughput. :(

This seems to be related to the upstream resending old work which leads to high fragmentation in the work they send to us which in turn increases the packets in our work units.

The error -131 means the output file is bigger than what we accept. Since more packets mean bigger files, that'll explain why we've now hit this limit. I've increased this limit for new work we generate and next I'll see what I can do with the existing work that keeps failing. (These will eventually fail permanently but that can take some time so I'd rather get them fixed or aborted centrally.)

-w
ID: 1745 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Chris S
Avatar

Send message
Joined: 2 Oct 11
Posts: 238
Credit: 386,562,838
RAC: 11,644
Message 1753 - Posted: 15 Dec 2011, 13:21:35 UTC

Across 8 rigs I only have a total of 6 error -131's.

HD5850 - 3
HD4870 - 2
HD4770 - 1

So either I am very lucky or it's about to hit me ....
I iz also got icons!



ID: 1753 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Teemu Mannermaa
Project administrator
Project developer
Project tester

Send message
Joined: 20 Apr 11
Posts: 388
Credit: 822,356,221
RAC: 0
Message 1757 - Posted: 15 Dec 2011, 15:45:06 UTC

Hi hi,

Looks like these big packet counts also affected our work generation (or there was some other problem) and I had to fix that as well. We seem to be running out of huge work units at the moment but it should eventually catch up, I hope. I'll keep an eye on that.

The -131 error codes should now also be fixed for any existing work units as well. At least as much they can be. Until all the faulty work units have cycled from your systems there might still be some errors generated. If you are in hurry or badly affected, a project reset to trigger a resending of work will hopefully update the wu metadata stored by your BOINC Client.

-w
ID: 1757 · Rating: 0 · rate: Rate + / Rate - Report as offensive
John Clark

Send message
Joined: 27 Jul 11
Posts: 342
Credit: 252,653,488
RAC: 0
Message 1758 - Posted: 15 Dec 2011, 17:24:23 UTC

Thanks Teemu.

I'll try that rset again.
ID: 1758 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile STE\/E

Send message
Joined: 2 May 11
Posts: 57
Credit: 250,035,598
RAC: 0
Message 1759 - Posted: 15 Dec 2011, 21:16:33 UTC
Last modified: 15 Dec 2011, 21:17:44 UTC

I don't think I've had a Computation Error since Aborting all work & slowly getting new work, can't be positive on that though but don't think I have ... Thanks Teemu

Seems the Wu's are taking a longer time to finish though ...
ID: 1759 · Rating: 0 · rate: Rate + / Rate - Report as offensive
John Clark

Send message
Joined: 27 Jul 11
Posts: 342
Credit: 252,653,488
RAC: 0
Message 1760 - Posted: 15 Dec 2011, 21:33:21 UTC

I thin we were seeing 2 things happening -

1. The -131 error where a WU wet full time but errored so no credit (for me this rose from 5% towards 10%+. Hopefully, as Teemu said, this may now be cured, but I need to leave it 24 hours to check (I did a project reset like Ste\/e.

2. The WUs seemed to need about 5% to 10% more crunch time, and were getting a marginally lower credit at 6.5K rather than 7K.

I think the latter is what Ste\/e was posting about. I can confirm the same on 2 rigs.
ID: 1760 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Conan
Avatar

Send message
Joined: 2 May 11
Posts: 53
Credit: 255,356,125
RAC: 6,763
Message 1761 - Posted: 15 Dec 2011, 23:54:57 UTC

Although the actual number of faulty work units was not overly large about 10% seems to be the number, my performance here has dropped markedly.

My run times are as good as before the errors, my throughput is the same (or appears to be the same) as normal, the amount awarded per result seems to be the same
Yet my RAC is dropping faster every day, which corresponds to the drop in output that I have noticed as well.
RAC is down at least 60,000 or more, with RAC being an average figure it shows it has been happening for a little while now.
This is despite most things all appearing to be running fine.

I don't get why there has been such a big drop.
When I checked on Boincstats everyone has had a major drop in output.

Puzzled

Conan
ID: 1761 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Copycat-Digital for WCG*
Avatar

Send message
Joined: 11 May 11
Posts: 44
Credit: 291,412,341
RAC: 0
Message 1762 - Posted: 16 Dec 2011, 0:42:47 UTC

I'm having the same longer crunch times
A 768 used to take about 30 minutes on a 5850 / 6950
Now it is up to 38 minutes
ID: 1762 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Teemu Mannermaa
Project administrator
Project developer
Project tester

Send message
Joined: 20 Apr 11
Posts: 388
Credit: 822,356,221
RAC: 0
Message 1765 - Posted: 16 Dec 2011, 6:21:18 UTC - in response to Message 1759.  

Seems the Wu's are taking a longer time to finish though ...


Since there's more packets in work units due to fragmentation then that means more context switches for the GPU when D.net Client moves through the packets. This can slow down the performance for obvious reasons and I wouldn't be surprised to see higher CPU usage as well. Just like having the client interrupted all the time can increase CPU usage due to increased GPU traffic.

I'm not sure I can do anything about this other than hope we blow through these old fragmented areas quickly so we can get to fresh blocks. Block fetching for the work generation is entirely handled by non-opensource upstream code so I'm kinda at mercy of that code. :(

That said, I'll try to look if there's more going on to affect our RACs as I find that concerning. Could be it's just an effect of not catching this problem sooner. It has been going on since at least from 10th of December. :(

-w
ID: 1765 · Rating: 0 · rate: Rate + / Rate - Report as offensive
NeoMetal*

Send message
Joined: 27 May 11
Posts: 10
Credit: 150,418,811
RAC: 0
Message 1767 - Posted: 16 Dec 2011, 7:05:58 UTC - in response to Message 1765.  

Seems the Wu's are taking a longer time to finish though ...


Since there's more packets in work units due to fragmentation then that means more context switches for the GPU when D.net Client moves through the packets. This can slow down the performance for obvious reasons and I wouldn't be surprised to see higher CPU usage as well. Just like having the client interrupted all the time can increase CPU usage due to increased GPU traffic.

I'm not sure I can do anything about this other than hope we blow through these old fragmented areas quickly so we can get to fresh blocks. Block fetching for the work generation is entirely handled by non-opensource upstream code so I'm kinda at mercy of that code. :(

That said, I'll try to look if there's more going on to affect our RACs as I find that concerning. Could be it's just an effect of not catching this problem sooner. It has been going on since at least from 10th of December. :(

-w

Since these WUs are taking 10%-15% longer to complete how about raising the credit 10%-15% to compensate then go back to old credit when they are finished (or not). I know some of the other WUs are mixed in but those would just be bonuses during this time. A simple short term fix I would think.

NeoMetal
ID: 1767 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Copycat-Digital for WCG*
Avatar

Send message
Joined: 11 May 11
Posts: 44
Credit: 291,412,341
RAC: 0
Message 1768 - Posted: 16 Dec 2011, 10:13:02 UTC

The increase in time vary from card to card.
On my HD6950 it is about 10 - 15% longer but on the older HD5850 that is normally 5% slower than the 6950 the time increased to about 30 - 35%.
Both cards are optimized to run on the fastest GPU core with a free CPU core available on both PCs
My personal RAC dropped from 835k (on the 10th) to currently 775k & is still falling rapidly
I checked my failed WUs & found only 4 so this can't be the reason
ID: 1768 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile SLAYER OF DEATH

Send message
Joined: 12 Jul 11
Posts: 112
Credit: 229,191,777
RAC: 0
Message 1769 - Posted: 16 Dec 2011, 10:46:51 UTC

Everthing all normal here now...Cool! CrunchTimes are right on, heat, the roaring hummmmm of the fans!
THANKS!!!
ID: 1769 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Bernt
Avatar

Send message
Joined: 26 May 11
Posts: 568
Credit: 121,524,886
RAC: 0
Message 1770 - Posted: 16 Dec 2011, 11:55:52 UTC

My WU´s failed to approx 50%. Hoping for better result. I trust Teemu that he is/has fixed this!
ID: 1770 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Bernt
Avatar

Send message
Joined: 26 May 11
Posts: 568
Credit: 121,524,886
RAC: 0
Message 1777 - Posted: 16 Dec 2011, 22:52:50 UTC - in response to Message 1770.  
Last modified: 16 Dec 2011, 22:53:42 UTC

Slowly but surely recover!
I´m happy.
ID: 1777 · Rating: 0 · rate: Rate + / Rate - Report as offensive
John Clark

Send message
Joined: 27 Jul 11
Posts: 342
Credit: 252,653,488
RAC: 0
Message 1780 - Posted: 17 Dec 2011, 9:09:50 UTC

The -131 errors seem to have been corrected by Teemu.

Still showing marginally extended crunch times for WUs, which I hope will rectify over time.

Some crunchers seem totally unaffected, which is weird.
ID: 1780 · Rating: 0 · rate: Rate + / Rate - Report as offensive
DigitalDingus

Send message
Joined: 2 May 11
Posts: 54
Credit: 117,821,513
RAC: 0
Message 1784 - Posted: 17 Dec 2011, 14:59:38 UTC

I noticed a credit drop of around 500 per WU. No big deal per se, but just wondering if anyone has noticed a slight drop. Like 5% or so.
ID: 1784 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Faulty WUs


 
Copyright © 2011-2024 Moo! Wrapper Project