Message boards :
Number crunching :
Faulty WUs
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 27 May 11 Posts: 10 Credit: 150,418,811 RAC: 0 |
Another thing is a packet limit error. Here's the end of one of my WU stderrs: [Dec 14 08:03:11 UTC] RC5-72: 1 packet (2.00 stats units) remains in in.r72 It happens at varying lengths so it must have something to do with the file size as well as it looks like its adding extra computation results to the out file from ghost packets. I've been lucky so far. Only 3 bad WUs in last 2 days (all on 1 box). Another thing also is the in and out files in the slot folders are always garbled (or encrypted?) chars but the checkpoint file is that way only sometimes while clear text on others. This is on 3 different boxes, 1 with 4850 and 2 with 5830s. Any thoughts? NeoMetal |
Send message Joined: 11 May 11 Posts: 26 Credit: 50,059,517 RAC: 0 |
I'm now playing with a manual fix for that issue: 1. Request a cache of 1 day of Moo WUs. 2. Shutdown BOINC after all WUs have been downloaded. 3. Manually edit the file 'client_state.xml' and replace the line <max_nbytes>51200.000000</max_nbytes> by <max_nbytes>61200.000000</max_nbytes> for every Moo! <file_info> section in that file. 4. Restart BOINC. I'm quite confident that this will fix the issue but of course only time will tell if it really does... |
Send message Joined: 2 May 11 Posts: 57 Credit: 250,035,598 RAC: 0 |
I'm now playing with a manual fix for that issue: lol ... I'll Detach and go to another Project before going thru all that work on 4 Different Box's ... I can live with the 10-15 Percent error rate for a few more day's, but at the rate it's going that's going to go up to 25% or more so it'll be time to move on until the problems fixed ... |
Send message Joined: 11 May 11 Posts: 26 Credit: 50,059,517 RAC: 0 |
Actually, it looks like Teemu is already working on a solution as I'm not getting any new GPU work: 15.12.2011 01:25:05 | Moo! Wrapper | Requesting new tasks for ATI GPU 15.12.2011 01:25:07 | Moo! Wrapper | Scheduler request completed: got 0 new tasks 15.12.2011 01:25:07 | Moo! Wrapper | No tasks sent Server Status page is showing plenty of available tasks but they seem to be CPU only... |
Send message Joined: 20 Apr 11 Posts: 388 Credit: 822,356,221 RAC: 0 |
Hi, Thanks for PM me and sorry that it took this long to notice but I'm working on this problem now. This seems to also affect out throughput. :( This seems to be related to the upstream resending old work which leads to high fragmentation in the work they send to us which in turn increases the packets in our work units. The error -131 means the output file is bigger than what we accept. Since more packets mean bigger files, that'll explain why we've now hit this limit. I've increased this limit for new work we generate and next I'll see what I can do with the existing work that keeps failing. (These will eventually fail permanently but that can take some time so I'd rather get them fixed or aborted centrally.) -w |
Send message Joined: 2 Oct 11 Posts: 238 Credit: 386,559,766 RAC: 11,660 |
Across 8 rigs I only have a total of 6 error -131's. HD5850 - 3 HD4870 - 2 HD4770 - 1 So either I am very lucky or it's about to hit me .... I iz also got icons! |
Send message Joined: 20 Apr 11 Posts: 388 Credit: 822,356,221 RAC: 0 |
Hi hi, Looks like these big packet counts also affected our work generation (or there was some other problem) and I had to fix that as well. We seem to be running out of huge work units at the moment but it should eventually catch up, I hope. I'll keep an eye on that. The -131 error codes should now also be fixed for any existing work units as well. At least as much they can be. Until all the faulty work units have cycled from your systems there might still be some errors generated. If you are in hurry or badly affected, a project reset to trigger a resending of work will hopefully update the wu metadata stored by your BOINC Client. -w |
Send message Joined: 27 Jul 11 Posts: 342 Credit: 252,653,488 RAC: 0 |
Thanks Teemu. I'll try that rset again. |
Send message Joined: 2 May 11 Posts: 57 Credit: 250,035,598 RAC: 0 |
I don't think I've had a Computation Error since Aborting all work & slowly getting new work, can't be positive on that though but don't think I have ... Thanks Teemu Seems the Wu's are taking a longer time to finish though ... |
Send message Joined: 27 Jul 11 Posts: 342 Credit: 252,653,488 RAC: 0 |
I thin we were seeing 2 things happening - 1. The -131 error where a WU wet full time but errored so no credit (for me this rose from 5% towards 10%+. Hopefully, as Teemu said, this may now be cured, but I need to leave it 24 hours to check (I did a project reset like Ste\/e. 2. The WUs seemed to need about 5% to 10% more crunch time, and were getting a marginally lower credit at 6.5K rather than 7K. I think the latter is what Ste\/e was posting about. I can confirm the same on 2 rigs. |
Send message Joined: 2 May 11 Posts: 53 Credit: 255,355,741 RAC: 6,779 |
Although the actual number of faulty work units was not overly large about 10% seems to be the number, my performance here has dropped markedly. My run times are as good as before the errors, my throughput is the same (or appears to be the same) as normal, the amount awarded per result seems to be the same Yet my RAC is dropping faster every day, which corresponds to the drop in output that I have noticed as well. RAC is down at least 60,000 or more, with RAC being an average figure it shows it has been happening for a little while now. This is despite most things all appearing to be running fine. I don't get why there has been such a big drop. When I checked on Boincstats everyone has had a major drop in output. Puzzled Conan |
Send message Joined: 11 May 11 Posts: 44 Credit: 291,412,341 RAC: 0 |
I'm having the same longer crunch times A 768 used to take about 30 minutes on a 5850 / 6950 Now it is up to 38 minutes |
Send message Joined: 20 Apr 11 Posts: 388 Credit: 822,356,221 RAC: 0 |
Seems the Wu's are taking a longer time to finish though ... Since there's more packets in work units due to fragmentation then that means more context switches for the GPU when D.net Client moves through the packets. This can slow down the performance for obvious reasons and I wouldn't be surprised to see higher CPU usage as well. Just like having the client interrupted all the time can increase CPU usage due to increased GPU traffic. I'm not sure I can do anything about this other than hope we blow through these old fragmented areas quickly so we can get to fresh blocks. Block fetching for the work generation is entirely handled by non-opensource upstream code so I'm kinda at mercy of that code. :( That said, I'll try to look if there's more going on to affect our RACs as I find that concerning. Could be it's just an effect of not catching this problem sooner. It has been going on since at least from 10th of December. :( -w |
Send message Joined: 27 May 11 Posts: 10 Credit: 150,418,811 RAC: 0 |
Seems the Wu's are taking a longer time to finish though ... Since these WUs are taking 10%-15% longer to complete how about raising the credit 10%-15% to compensate then go back to old credit when they are finished (or not). I know some of the other WUs are mixed in but those would just be bonuses during this time. A simple short term fix I would think. NeoMetal |
Send message Joined: 11 May 11 Posts: 44 Credit: 291,412,341 RAC: 0 |
The increase in time vary from card to card. On my HD6950 it is about 10 - 15% longer but on the older HD5850 that is normally 5% slower than the 6950 the time increased to about 30 - 35%. Both cards are optimized to run on the fastest GPU core with a free CPU core available on both PCs My personal RAC dropped from 835k (on the 10th) to currently 775k & is still falling rapidly I checked my failed WUs & found only 4 so this can't be the reason |
Send message Joined: 12 Jul 11 Posts: 112 Credit: 229,191,777 RAC: 0 |
Everthing all normal here now...Cool! CrunchTimes are right on, heat, the roaring hummmmm of the fans! THANKS!!! |
Send message Joined: 26 May 11 Posts: 568 Credit: 121,524,886 RAC: 0 |
My WU´s failed to approx 50%. Hoping for better result. I trust Teemu that he is/has fixed this! |
Send message Joined: 26 May 11 Posts: 568 Credit: 121,524,886 RAC: 0 |
Slowly but surely recover! I´m happy. |
Send message Joined: 27 Jul 11 Posts: 342 Credit: 252,653,488 RAC: 0 |
The -131 errors seem to have been corrected by Teemu. Still showing marginally extended crunch times for WUs, which I hope will rectify over time. Some crunchers seem totally unaffected, which is weird. |
Send message Joined: 2 May 11 Posts: 54 Credit: 117,821,513 RAC: 0 |
I noticed a credit drop of around 500 per WU. No big deal per se, but just wondering if anyone has noticed a slight drop. Like 5% or so. |