Faulty WUs

\n studio-striking\n

Message boards : Number crunching : Faulty WUs
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Senilix

Send message
Joined: 11 May 11
Posts: 26
Credit: 50,059,517
RAC: 0
Message 1684 - Posted: 12 Dec 2011, 0:02:28 UTC

I was experiencing a couple of computation errors on my rig during the last two days. As all of my wingmen are getting the same error, there seems to be something wrong with the corresponding WUs. Here's some examples:
http://moowrap.net/workunit.php?wuid=4382462
http://moowrap.net/workunit.php?wuid=4440517
http://moowrap.net/workunit.php?wuid=4415816
All are showing the same error code:
</stderr_txt>
<message>
upload failure: <file_xfer_error>
  <file_name>dnetc_r72_1323456479_401_768_2_0</file_name>
  <error_code>-131</error_code>
</file_xfer_error>

</message>
ID: 1684 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Conan
Avatar

Send message
Joined: 2 May 11
Posts: 53
Credit: 255,355,933
RAC: 6,764
Message 1686 - Posted: 12 Dec 2011, 1:02:29 UTC

I too am getting a few of these errors, started happening on the 10th with work units created on the 9th.
Both computers have shown this error.

Conan
ID: 1686 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Philadelphia

Send message
Joined: 17 Nov 11
Posts: 23
Credit: 5,700,133
RAC: 0
Message 1688 - Posted: 12 Dec 2011, 3:50:35 UTC - in response to Message 1684.  
Last modified: 12 Dec 2011, 3:55:42 UTC

I was experiencing a couple of computation errors on my rig during the last two days. As all of my wingmen are getting the same error, there seems to be something wrong with the corresponding WUs. Here's some examples:
http://moowrap.net/workunit.php?wuid=4382462
http://moowrap.net/workunit.php?wuid=4440517
http://moowrap.net/workunit.php?wuid=4415816
All are showing the same error code:
</stderr_txt>
<message>
upload failure: <file_xfer_error>
  <file_name>dnetc_r72_1323456479_401_768_2_0</file_name>
  <error_code>-131</error_code>
</file_xfer_error>

</message>


Interesting you should mention, I am as well, I assumed it was just me.

http://moowrap.net/workunit.php?wuid=4382112
http://moowrap.net/workunit.php?wuid=4414515
http://moowrap.net/workunit.php?wuid=4400706
http://moowrap.net/workunit.php?wuid=4400704
http://moowrap.net/workunit.php?wuid=4400701
http://moowrap.net/workunit.php?wuid=4400680

I believe I had others, they may have been cleared out by Moo. My a few my wingmen also had errors.
ID: 1688 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Zydor
Avatar

Send message
Joined: 5 May 11
Posts: 233
Credit: 351,414,150
RAC: 0
Message 1690 - Posted: 12 Dec 2011, 7:32:00 UTC
Last modified: 12 Dec 2011, 7:33:00 UTC

An additional factor for WUs that needs checking out on the server .... in the last 12 hours I have been getting not so much faulty one's as a vast array of different stats unit sizes. I used to get 768's which were the best for a twin 5970. Now its all shapes and sizes, leading to inefficient use of the 4xGPUs.

Something appears to have changed / not been re-enabled on the methodology used to select the WU stats unit sizes when splitting out WUs to the various different GPU types.

Regards
Zy
ID: 1690 · Rating: 0 · rate: Rate + / Rate - Report as offensive
John Clark

Send message
Joined: 27 Jul 11
Posts: 342
Credit: 252,653,488
RAC: 0
Message 1692 - Posted: 12 Dec 2011, 9:09:47 UTC

I can confirm there seems something, not too vital, going on.

My 5970 appears to be slowing down by about 100+ seconds per WU. No other observation, though.
ID: 1692 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile STE\/E

Send message
Joined: 2 May 11
Posts: 57
Credit: 250,035,598
RAC: 0
Message 1694 - Posted: 12 Dec 2011, 14:25:18 UTC
Last modified: 12 Dec 2011, 14:31:52 UTC

I've had about 20 on 4 different Box's the last 3 Day's error out plus about the same amount Marked Invalid. The ones that error out appear to run to their normal amount of run time but then error out for some reason.

Another thing that seems odd is while watching one Box the Wu's was up over 90% done in 10 Min's Time but then took 6-8 Min's more to finish the last 10% to completion. Watched 4-5 in a row do this.

Also noticing what Zydor said, Wu Times all over the place, one will run 14 Min's the next 20 then maybe 16 etc ... STE\/E
ID: 1694 · Rating: 0 · rate: Rate + / Rate - Report as offensive
John Clark

Send message
Joined: 27 Jul 11
Posts: 342
Credit: 252,653,488
RAC: 0
Message 1695 - Posted: 12 Dec 2011, 17:42:36 UTC

My HD5850 seems to be slowing down also.

The WU crunch times have risen slightly from 2,100-2,350 seconds (lower end for 6,500 credits) to 2,350-2,450 seconds for 6,550 credits. It uses to be 7,000 credits for this time.

The HD5970 has risen to 1,150 seconds for 6,500 credits not 7,000 as it was.

Something definitely going on.
ID: 1695 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile STE\/E

Send message
Joined: 2 May 11
Posts: 57
Credit: 250,035,598
RAC: 0
Message 1696 - Posted: 13 Dec 2011, 0:52:46 UTC

Getting worse, 5 on the 10'th & 11'th each and 17 today the 12'th, already have 1 Dated the 13'th ...
ID: 1696 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Philadelphia

Send message
Joined: 17 Nov 11
Posts: 23
Credit: 5,700,133
RAC: 0
Message 1697 - Posted: 13 Dec 2011, 4:11:12 UTC - in response to Message 1696.  

Yikes, those are huge numbers compared to my bad ones.

I would have thought someone from the project would have shared what the issue is and when to expect a fix; or at least an acknowledgement of an issue and 'we're working on it'.
ID: 1697 · Rating: 0 · rate: Rate + / Rate - Report as offensive
valterc

Send message
Joined: 10 May 11
Posts: 17
Credit: 8,126,343,078
RAC: 21,941,045
Message 1698 - Posted: 13 Dec 2011, 11:14:17 UTC
Last modified: 13 Dec 2011, 11:19:54 UTC

Something has changed.

The log file (stderr) of the new work units is longer than some time ago (check a new one against an old one and see by yourself).

See this http://moowrap.net/result.php?resultid=4876874 and http://moowrap.net/result.php?resultid=5451946. Stepping 1 packet at a time instead of 64.

Also the pattern I see while looking at the gpu usage with afterburner is different (less linear)
ID: 1698 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile STE\/E

Send message
Joined: 2 May 11
Posts: 57
Credit: 250,035,598
RAC: 0
Message 1703 - Posted: 13 Dec 2011, 20:16:45 UTC

17 more Computation Errors today already, running all 4 Box's at their Stock Speeds but the Errors keep coming ... :/
ID: 1703 · Rating: 0 · rate: Rate + / Rate - Report as offensive
John Clark

Send message
Joined: 27 Jul 11
Posts: 342
Credit: 252,653,488
RAC: 0
Message 1704 - Posted: 13 Dec 2011, 21:14:21 UTC
Last modified: 13 Dec 2011, 21:15:45 UTC

More than 5% of my output is erroring, with the WU crunching to full time then reporting errors.

Also the times to crunch a WU is growing quite considerably.

If this continues then I will detach and reattach.

Overall AC has dropped from 810K+ since the longer crunch times started
ID: 1704 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile [SG-SPEG] DeeKay

Send message
Joined: 24 Nov 11
Posts: 1
Credit: 60,996,973
RAC: 0
Message 1705 - Posted: 13 Dec 2011, 21:50:21 UTC

Hello,

same error code here (-131).
All computation eror workunits have a very long stderr logfile, but only with result and computation information. No error information or something like that, apart from the end.

I'm lucky, that there are only 10 workunits with computation errors.
All errors occured on the same machine, it is the one with a ATI Radeon HD5870.

Maybe this can help to solve this problem.

DeeKay
ID: 1705 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Conan
Avatar

Send message
Joined: 2 May 11
Posts: 53
Credit: 255,355,933
RAC: 6,764
Message 1708 - Posted: 14 Dec 2011, 6:37:50 UTC
Last modified: 14 Dec 2011, 7:09:43 UTC

This "File Transfer Error" -131 is starting to cause problems.
I have 50 odd of these errors in the last few days, but they are increasing per day.
The WU type are all different, run times vary but they always did so I can't prove nothing there.
The errors seem to be on Work units that have over 300 packets in them (only did a cursory glance as anything under 200 packets seems to be OK).

My RAC normally goes up and down with the WU types that I crunch from other projects but is gradual, changing by a few 10s of thousands over many days and then back up again.
Now it is dropping like a stone (over 20,000 in the last day) which seems very odd.

EDIT:: (I was looking at my screen after closing a web session and caught a MOO work unit get a computation error.
"Output File dnetc_r72_1323592031_308_768_4_0 for task dnetc_r72_1323592031_308_768_4 Exceeds Size Limit
FileSize: 54240.000000 Bytes. Limit: 51200.000000 Bytes".
So this could be the problem, it shows as a Failed Upload and File Transfer Error when reported)

Conan.
ID: 1708 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile STE\/E

Send message
Joined: 2 May 11
Posts: 57
Credit: 250,035,598
RAC: 0
Message 1709 - Posted: 14 Dec 2011, 8:58:24 UTC

Frustrating, by day I've had 4-5-18-26 already 14 today so I expect more than 26 today. No contact from the Project, it doesn't take more than 5 seconds to come in & say we're working on it guy's but then they may not even be aware of the problem. Not a good way to run a Project ...
ID: 1709 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Bender10

Send message
Joined: 11 May 11
Posts: 4
Credit: 263,352,134
RAC: 0
Message 1711 - Posted: 14 Dec 2011, 10:57:28 UTC

I am seeing all the symptoms: file upload error, short wu times, long wu times, etc. My daily output has dropped 150K. One box seems to lead the plunge (but all are taking hits), and it is no different than the others (3 of the same gpu's).

I reduced my cache when this problem started a few days ago. Last night (and this morning) I 'detached' and 'reattached' to the project on 2 boxes.

Let's see what happens...
ID: 1711 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile SLAYER OF DEATH

Send message
Joined: 12 Jul 11
Posts: 112
Credit: 229,191,777
RAC: 0
Message 1712 - Posted: 14 Dec 2011, 11:00:51 UTC
Last modified: 14 Dec 2011, 11:28:20 UTC

Just cought it, 100 wu's trashed. Same 131 code, only effecting my 2 5870's and NOT my GTS250 or 4670. Stopped crunching, Abort rest. Teemu must be away, he is very good about fixing/catching stuff. Moo had this problem a few long months ago??
ID: 1712 · Rating: 0 · rate: Rate + / Rate - Report as offensive
John Clark

Send message
Joined: 27 Jul 11
Posts: 342
Credit: 252,653,488
RAC: 0
Message 1723 - Posted: 14 Dec 2011, 17:00:03 UTC

Anyone PMed Teemu about this problem yet?

When I did the other day, on a different matter, he responded very quickly.

Unless he looks in to the forum he may not be aware of the problem!
ID: 1723 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile STE\/E

Send message
Joined: 2 May 11
Posts: 57
Credit: 250,035,598
RAC: 0
Message 1729 - Posted: 14 Dec 2011, 19:24:17 UTC

I just Pm'ed him about the Problem ...
ID: 1729 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Senilix

Send message
Joined: 11 May 11
Posts: 26
Credit: 50,059,517
RAC: 0
Message 1733 - Posted: 14 Dec 2011, 23:02:33 UTC

Darn! 3 out of my last 5 WUs ended with error_code -131.

I'm putting the project to 'no new work' now ...
ID: 1733 · Rating: 0 · rate: Rate + / Rate - Report as offensive
1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Faulty WUs


 
Copyright © 2011-2024 Moo! Wrapper Project