Message boards :
Number crunching :
Faulty WUs
Message board moderation
Author | Message |
---|---|
Send message Joined: 11 May 11 Posts: 26 Credit: 50,059,517 RAC: 0 |
I was experiencing a couple of computation errors on my rig during the last two days. As all of my wingmen are getting the same error, there seems to be something wrong with the corresponding WUs. Here's some examples: http://moowrap.net/workunit.php?wuid=4382462 http://moowrap.net/workunit.php?wuid=4440517 http://moowrap.net/workunit.php?wuid=4415816 All are showing the same error code: </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>dnetc_r72_1323456479_401_768_2_0</file_name> <error_code>-131</error_code> </file_xfer_error> </message> |
Send message Joined: 2 May 11 Posts: 53 Credit: 255,355,741 RAC: 6,779 |
I too am getting a few of these errors, started happening on the 10th with work units created on the 9th. Both computers have shown this error. Conan |
Send message Joined: 17 Nov 11 Posts: 23 Credit: 5,700,133 RAC: 0 |
I was experiencing a couple of computation errors on my rig during the last two days. As all of my wingmen are getting the same error, there seems to be something wrong with the corresponding WUs. Here's some examples: Interesting you should mention, I am as well, I assumed it was just me. http://moowrap.net/workunit.php?wuid=4382112 http://moowrap.net/workunit.php?wuid=4414515 http://moowrap.net/workunit.php?wuid=4400706 http://moowrap.net/workunit.php?wuid=4400704 http://moowrap.net/workunit.php?wuid=4400701 http://moowrap.net/workunit.php?wuid=4400680 I believe I had others, they may have been cleared out by Moo. My a few my wingmen also had errors. |
Send message Joined: 5 May 11 Posts: 233 Credit: 351,414,150 RAC: 0 |
An additional factor for WUs that needs checking out on the server .... in the last 12 hours I have been getting not so much faulty one's as a vast array of different stats unit sizes. I used to get 768's which were the best for a twin 5970. Now its all shapes and sizes, leading to inefficient use of the 4xGPUs. Something appears to have changed / not been re-enabled on the methodology used to select the WU stats unit sizes when splitting out WUs to the various different GPU types. Regards Zy |
Send message Joined: 27 Jul 11 Posts: 342 Credit: 252,653,488 RAC: 0 |
I can confirm there seems something, not too vital, going on. My 5970 appears to be slowing down by about 100+ seconds per WU. No other observation, though. |
Send message Joined: 2 May 11 Posts: 57 Credit: 250,035,598 RAC: 0 |
I've had about 20 on 4 different Box's the last 3 Day's error out plus about the same amount Marked Invalid. The ones that error out appear to run to their normal amount of run time but then error out for some reason. Another thing that seems odd is while watching one Box the Wu's was up over 90% done in 10 Min's Time but then took 6-8 Min's more to finish the last 10% to completion. Watched 4-5 in a row do this. Also noticing what Zydor said, Wu Times all over the place, one will run 14 Min's the next 20 then maybe 16 etc ... STE\/E |
Send message Joined: 27 Jul 11 Posts: 342 Credit: 252,653,488 RAC: 0 |
My HD5850 seems to be slowing down also. The WU crunch times have risen slightly from 2,100-2,350 seconds (lower end for 6,500 credits) to 2,350-2,450 seconds for 6,550 credits. It uses to be 7,000 credits for this time. The HD5970 has risen to 1,150 seconds for 6,500 credits not 7,000 as it was. Something definitely going on. |
Send message Joined: 2 May 11 Posts: 57 Credit: 250,035,598 RAC: 0 |
Getting worse, 5 on the 10'th & 11'th each and 17 today the 12'th, already have 1 Dated the 13'th ... |
Send message Joined: 17 Nov 11 Posts: 23 Credit: 5,700,133 RAC: 0 |
Yikes, those are huge numbers compared to my bad ones. I would have thought someone from the project would have shared what the issue is and when to expect a fix; or at least an acknowledgement of an issue and 'we're working on it'. |
Send message Joined: 10 May 11 Posts: 17 Credit: 8,125,868,606 RAC: 21,940,970 |
Something has changed. The log file (stderr) of the new work units is longer than some time ago (check a new one against an old one and see by yourself). See this http://moowrap.net/result.php?resultid=4876874 and http://moowrap.net/result.php?resultid=5451946. Stepping 1 packet at a time instead of 64. Also the pattern I see while looking at the gpu usage with afterburner is different (less linear) |
Send message Joined: 2 May 11 Posts: 57 Credit: 250,035,598 RAC: 0 |
17 more Computation Errors today already, running all 4 Box's at their Stock Speeds but the Errors keep coming ... :/ |
Send message Joined: 27 Jul 11 Posts: 342 Credit: 252,653,488 RAC: 0 |
More than 5% of my output is erroring, with the WU crunching to full time then reporting errors. Also the times to crunch a WU is growing quite considerably. If this continues then I will detach and reattach. Overall AC has dropped from 810K+ since the longer crunch times started |
Send message Joined: 24 Nov 11 Posts: 1 Credit: 60,996,973 RAC: 0 |
Hello, same error code here (-131). All computation eror workunits have a very long stderr logfile, but only with result and computation information. No error information or something like that, apart from the end. I'm lucky, that there are only 10 workunits with computation errors. All errors occured on the same machine, it is the one with a ATI Radeon HD5870. Maybe this can help to solve this problem. DeeKay |
Send message Joined: 2 May 11 Posts: 53 Credit: 255,355,741 RAC: 6,779 |
This "File Transfer Error" -131 is starting to cause problems. I have 50 odd of these errors in the last few days, but they are increasing per day. The WU type are all different, run times vary but they always did so I can't prove nothing there. The errors seem to be on Work units that have over 300 packets in them (only did a cursory glance as anything under 200 packets seems to be OK). My RAC normally goes up and down with the WU types that I crunch from other projects but is gradual, changing by a few 10s of thousands over many days and then back up again. Now it is dropping like a stone (over 20,000 in the last day) which seems very odd. EDIT:: (I was looking at my screen after closing a web session and caught a MOO work unit get a computation error. "Output File dnetc_r72_1323592031_308_768_4_0 for task dnetc_r72_1323592031_308_768_4 Exceeds Size Limit FileSize: 54240.000000 Bytes. Limit: 51200.000000 Bytes". So this could be the problem, it shows as a Failed Upload and File Transfer Error when reported) Conan. |
Send message Joined: 2 May 11 Posts: 57 Credit: 250,035,598 RAC: 0 |
Frustrating, by day I've had 4-5-18-26 already 14 today so I expect more than 26 today. No contact from the Project, it doesn't take more than 5 seconds to come in & say we're working on it guy's but then they may not even be aware of the problem. Not a good way to run a Project ... |
Send message Joined: 11 May 11 Posts: 4 Credit: 263,352,134 RAC: 0 |
I am seeing all the symptoms: file upload error, short wu times, long wu times, etc. My daily output has dropped 150K. One box seems to lead the plunge (but all are taking hits), and it is no different than the others (3 of the same gpu's). I reduced my cache when this problem started a few days ago. Last night (and this morning) I 'detached' and 'reattached' to the project on 2 boxes. Let's see what happens... |
Send message Joined: 12 Jul 11 Posts: 112 Credit: 229,191,777 RAC: 0 |
Just cought it, 100 wu's trashed. Same 131 code, only effecting my 2 5870's and NOT my GTS250 or 4670. Stopped crunching, Abort rest. Teemu must be away, he is very good about fixing/catching stuff. Moo had this problem a few long months ago?? |
Send message Joined: 27 Jul 11 Posts: 342 Credit: 252,653,488 RAC: 0 |
Anyone PMed Teemu about this problem yet? When I did the other day, on a different matter, he responded very quickly. Unless he looks in to the forum he may not be aware of the problem! |
Send message Joined: 2 May 11 Posts: 57 Credit: 250,035,598 RAC: 0 |
I just Pm'ed him about the Problem ... |
Send message Joined: 11 May 11 Posts: 26 Credit: 50,059,517 RAC: 0 |
Darn! 3 out of my last 5 WUs ended with error_code -131. I'm putting the project to 'no new work' now ... |