2 WU on cuda cards?

Author	Message
Semtex Send message Joined: 14 Aug 11 Posts: 7 Credit: 20,109,913 RAC: 0	Message 929 - Posted: 24 Aug 2011, 19:19:58 UTC Is there a way to run 2 wu at once on cuda card? I have modify dnetc-1.00.ini (add max-threads=2) but dnetc518-win32-x86-cuda31.exe crashed after few minutes. Is it possible to make app_info.xml with <coproc> <type>CUDA</type> <count>0.50</count> </coproc> Thanks in advance. ID: 929 · Rating: 0 · rate: /

mikey Send message Joined: 22 Jun 11 Posts: 2080 Credit: 1,846,500,808 RAC: 31,411	Message 931 - Posted: 25 Aug 2011, 12:35:38 UTC - in response to Message 929. Is there a way to run 2 wu at once on cuda card? I have modify dnetc-1.00.ini (add max-threads=2) but dnetc518-win32-x86-cuda31.exe crashed after few minutes. Is it possible to make app_info.xml with <coproc> <type>CUDA</type> <count>0.50</count> </coproc> Thanks in advance. No, Moo and Dnetc are designed to use all the gpu power on one unit. ID: 931 · Rating: 0 · rate: /

Semtex Send message Joined: 14 Aug 11 Posts: 7 Credit: 20,109,913 RAC: 0	Message 936 - Posted: 25 Aug 2011, 17:39:41 UTC - in response to Message 931. Anyway I would like to see useful app_info only for testing. ID: 936 · Rating: 0 · rate: /

Semtex Send message Joined: 14 Aug 11 Posts: 7 Credit: 20,109,913 RAC: 0	Message 940 - Posted: 27 Aug 2011, 12:29:12 UTC - in response to Message 936. Does anybody know how to make app_info for Moo? ID: 940 · Rating: 0 · rate: /

Teemu Mannermaa Project administrator Project developer Project tester Send message Joined: 20 Apr 11 Posts: 389 Credit: 822,556,349 RAC: 0	Message 955 - Posted: 31 Aug 2011, 15:38:45 UTC Hi, Sure, I can make you one. Is there a particular reason you need one? You can limit Moo to use only first card found if you want to run something else for the other one but running two at the same time, like mikey said, is unfortunately not possible. -w ID: 955 · Rating: 0 · rate: /

Semtex Send message Joined: 14 Aug 11 Posts: 7 Credit: 20,109,913 RAC: 0	Message 956 - Posted: 31 Aug 2011, 18:52:44 UTC - in response to Message 955. Why is not possible? Hardware or software limitation? I have compare gtx465 and gtx470. Same speed of gpu, memory and shaders but gtx470 has got 25% more ROP-s and 27% more shaders. Time for crunching 1wu in not 25% faster on gtx470. If Moo use all gpu power why I can not get 25% gain? I am interesting to try crunching 2wu at the same time to see is there any speed improvement. Sorry for my English, it is not my native language. Best regards. ID: 956 · Rating: 0 · rate: /

Teemu Mannermaa Project administrator Project developer Project tester Send message Joined: 20 Apr 11 Posts: 389 Credit: 822,556,349 RAC: 0	Message 958 - Posted: 31 Aug 2011, 21:58:25 UTC - in response to Message 956. Why is not possible? Hardware or software limitation? We use D.net Client that detects cards by itself and uses all found by default. It has a software limitation where it can't be told to only use a specific card. There's a workaround that allows limiting number of cards it uses (only in order it sees them). -w ID: 958 · Rating: 0 · rate: /

Teemu Mannermaa Project administrator Project developer Project tester Send message Joined: 20 Apr 11 Posts: 389 Credit: 822,556,349 RAC: 0	Message 960 - Posted: 31 Aug 2011, 22:38:33 UTC Hi, There's an example 32-bit Windows ATI cruncher app_info.xml available at http://moowrap.net/download/app_info-win32-ati14-example.xml. It might be usable as is except ATI card field might need adjusting. Also field value is based on my own host and you should probably copy actual value for your host from your client state file. (The value gets calculated for each host by BOINC.) You can also try without one if you can live with funky estimates until BOINC sorts itself out. :) -w ID: 960 · Rating: 0 · rate: /

Semtex Send message Joined: 14 Aug 11 Posts: 7 Credit: 20,109,913 RAC: 0	Message 961 - Posted: 1 Sep 2011, 5:20:49 UTC - in response to Message 960. Thanks foe trying but it does not working. Error after 17s of crunching. Also it is missing cudart32_31_9.dll. I have try with <file_info> <name>cudart32_31_9.dll</name> </file_info> After all dnetc518-win32-x86-cuda31.exe does not start, only dnetc_1.02_windows_intelx86__cuda31.exe. Here is app_info: <app_info> <app> <name>dnetc</name> <user_friendly_name>Distributed.net Client</user_friendly_name> </app> <file_info> <name>dnetc_1.02_windows_intelx86__cuda31.exe</name> <executable/> </file_info> <file_info> <name>dnetc518-win32-x86-cuda31.exe</name> <executable/> </file_info> <file_info> <name>cudart32_31_9.dll</name> </file_info> <file_info> <name>dnetc-1.00.ini</name> </file_info> <file_info> <name>job-cuda31-1.00.xml</name> </file_info> <app_version> <app_name>dnetc</app_name> <version_num>102</version_num> <platform>windows_intelx86</platform> <avg_ncpus>0.050000</avg_ncpus> <max_ncpus>0.895864</max_ncpus> <plan_class>cuda31</plan_class> <api_version>6.13.0</api_version> <file_ref> <file_name>dnetc_1.02_windows_intelx86__cuda31.exe</file_name> <main_program/> </file_ref> <file_ref> <file_name>dnetc518-win32-x86-cuda31.exe</file_name> <copy_file/> </file_ref> <file_ref> <file_name>dnetc-1.00.ini</file_name> <open_name>dnetc.ini</open_name> <copy_file/> </file_ref> <file_ref> <file_name>job-cuda31-1.00.xml</file_name> <open_name>job.xml</open_name> <copy_file/> </file_ref> <coproc> <type>CUDA</type> <count>1.000000</count> </coproc> <gpu_ram>262144000.000000</gpu_ram> </app_version> </app_info> ID: 961 · Rating: 0 · rate: /

Teemu Mannermaa Project administrator Project developer Project tester Send message Joined: 20 Apr 11 Posts: 389 Credit: 822,556,349 RAC: 0	Message 980 - Posted: 4 Sep 2011, 0:02:13 UTC Hi, Oops, for some reason I thought you wanted one for ATI but clearly you talked about CUDA cards there. Sorry about that, but looks like you got it working after all. :) We use 0.20 for for CUDA and also lower requirement but those shouldn't matter that much. Especially if your host crunches with those settings. I did have one example for CUDAs up at http://moowrap.net/download/app_info-example.xml, which has the old setting still but otherwise should show a working example. -w ID: 980 · Rating: 0 · rate: /

Semtex Send message Joined: 14 Aug 11 Posts: 7 Credit: 20,109,913 RAC: 0	Message 986 - Posted: 4 Sep 2011, 20:13:19 UTC - in response to Message 980. Thanks a lot. It works after a little modification. 4 short wu at once: [/img] ID: 986 · Rating: 0 · rate: /

@@$tars_Finder@@ Send message Joined: 30 Sep 11 Posts: 3 Credit: 43,358,783 RAC: 0	Message 1892 - Posted: 24 Dec 2011, 13:25:35 UTC - in response to Message 960. Hi, There's an example 32-bit Windows ATI cruncher app_info.xml available at http://moowrap.net/download/app_info-win32-ati14-example.xml. It might be usable as is except ATI card <count> field might need adjusting. Also <flops> field value is based on my own host and you should probably copy actual value for your host from your client state file. (The value gets calculated for each host by BOINC.) You can also try without one if you can live with funky estimates until BOINC sorts itself out. :) -w Hi :-). I'd like to have 2 WU on my ATI card. To do that, I've tried this app_info for my ATI and worked fine. After that, I've tried to modified the <coproc> from 1 to 0.5 in this mode: <coproc> <type>ATI</type> <count>0.500000</count> </coproc> but does not work :-(. I receive an "HTTP Error" from the server when I do update. Please can you help me? Bye. ID: 1892 · Rating: 0 · rate: /

Teemu Mannermaa Project administrator Project developer Project tester Send message Joined: 20 Apr 11 Posts: 389 Credit: 822,556,349 RAC: 0	Message 1893 - Posted: 24 Dec 2011, 14:34:34 UTC Hi, I don't think scheduler expects people to try to use half ATI card and might actually crash. I'll have to take a look so that it won't at least crash.. Why are you trying to use two tasks on one card? I highly doubt it'll be twice as fast since both of them will use the same card (at full speed) and D.Net Client might even get so tangled that it can't finish correctly. -w ID: 1893 · Rating: 0 · rate: /

@@$tars_Finder@@ Send message Joined: 30 Sep 11 Posts: 3 Credit: 43,358,783 RAC: 0	Message 1895 - Posted: 24 Dec 2011, 15:47:38 UTC - in response to Message 1893. Hi, I don't think scheduler expects people to try to use half ATI card and might actually crash. I'll have to take a look so that it won't at least crash.. Why are you trying to use two tasks on one card? I highly doubt it'll be twice as fast since both of them will use the same card (at full speed) and D.Net Client might even get so tangled that it can't finish correctly. -w Hi :-). I'm trying to run two WU on single ATI card because the card passes often from 88% to 98%. With two WU running on a card would always be 100%. I use this system on other projects and improved performance is about 5% - 10%, so I would be happy if I could use it for Moo :-). Thank you. ID: 1895 · Rating: 0 · rate: /

Wizzo Send message Joined: 1 Jan 12 Posts: 13 Credit: 21,324,276 RAC: 0	Message 2069 - Posted: 3 Jan 2012, 2:16:56 UTC - in response to Message 1895. Last modified: 3 Jan 2012, 2:17:32 UTC I want in on that also, Moo works fine, but poem uses 25-35% of each gpu. I should be able to double, or maybe tripple up and see huge gains. ID: 2069 · Rating: 0 · rate: /

mikey Send message Joined: 22 Jun 11 Posts: 2080 Credit: 1,846,500,808 RAC: 31,411	Message 2073 - Posted: 3 Jan 2012, 12:20:21 UTC - in response to Message 2069. I want in on that also, Moo works fine, but poem uses 25-35% of each gpu. I should be able to double, or maybe tripple up and see huge gains. As Teemu said it sounds good but doesn't really work in practice. The problems is the way gpu memory works and how much is on each card and then how the project app uses that gpu memory. Unfortunately the gpu memory is not suited for 'side by side' calculations, meaning one thing at a time, ie NO multitasking! So running one unit or 10 units at once means each unit beyond the first must wait for the memory to be released and therefore is not faster. This releasing and then capturing and the releasing and then capturing again by multiple units is slower than just one units running thru. Moo uses ALL the gpu's in your system to do its thing, it is just designed that way. For example on an AMD 5770 it usually takes about an hour to finish one workunit, putting two 5770's in one box cuts that time in half to about 30 minutes, which is about as long as an AMD 5870 takes! You are talking about going the OTHER WAY here! ID: 2073 · Rating: 0 · rate: /

Zydor Send message Joined: 5 May 11 Posts: 233 Credit: 351,414,150 RAC: 0	Message 2076 - Posted: 3 Jan 2012, 13:04:24 UTC ... I should be able to double, or maybe tripple up and see huge gains ... Not happening. GPUs are only designed for the execution of one program at a time, it cannot multitask like a multi-thread CPU. That will change with AMDs new architecture which will give that ability, however, on anything less than 7XXX, it will not. What happens is .... it will load 2... 3 WUs, whatever you give it (it will choke on four), and then divide time equally between the loaded WUs, net result is no gain, for (say) two WUs it takes twice as long to crunch, with the net result of the same time per WU. With very small WUs (aka milkyway), you can get a gain of about 2-3% by loading two per GPU as you save time in the load/unload/windup of each WU. It can be hassle as well, so horses for courses as to whether or not its worth it even at MW. .... but here .... not a chance, dont waste time trying, it will not work the way you anticipate. Regards Zy ID: 2076 · Rating: 0 · rate: /

@@$tars_Finder@@ Send message Joined: 30 Sep 11 Posts: 3 Credit: 43,358,783 RAC: 0	Message 2077 - Posted: 3 Jan 2012, 14:17:12 UTC - in response to Message 2076. Last modified: 3 Jan 2012, 14:30:33 UTC ... I should be able to double, or maybe tripple up and see huge gains ... Not happening. GPUs are only designed for the execution of one program at a time, it cannot multitask like a multi-thread CPU. That will change with AMDs new architecture which will give that ability, however, on anything less than 7XXX, it will not. What happens is .... it will load 2... 3 WUs, whatever you give it (it will choke on four), and then divide time equally between the loaded WUs, net result is no gain, for (say) two WUs it takes twice as long to crunch, with the net result of the same time per WU. With very small WUs (aka milkyway), you can get a gain of about 2-3% by loading two per GPU as you save time in the load/unload/windup of each WU. It can be hassle as well, so horses for courses as to whether or not its worth it even at MW. .... but here .... not a chance, dont waste time trying, it will not work the way you anticipate. Regards Zy Hi. I use this system on another project (two WU running on an Nvidia) and I have a good increase (I recently bought an AMD and I would try on Moo). Did You say that AMD work is different then or you were talking in general about all GPU? Bye and thank you :-) ID: 2077 · Rating: 0 · rate: /

Zydor Send message Joined: 5 May 11 Posts: 233 Credit: 351,414,150 RAC: 0	Message 2079 - Posted: 3 Jan 2012, 15:29:12 UTC - in response to Message 2077. Last modified: 3 Jan 2012, 15:32:12 UTC GPUs can only run one thread, thats all thats physically designed for it, so they can only do one task at a time. If two are present, it shares time between both WUs, it physically cant do anything else. I used to run NVidia in 9800GTX+ days and going back to when they first started, didnt work then either - I moved to AMD after the final Firmi farce - it maybe something in current NVidia architecture that gives some room, but it will not be crunching, it can only crunch one at a time. The other major factor is that as pointed out above Moo is a multi-thread GPU app designed to use up all GPU space as it becomes available. There is an issue with part of that at present as Upstream are feeding us fragmented WUs, and the usual allocation tailored to card type is hard-impossibe until "clean" WUs start again. Thats why Teemu gave an extra 20% as a temporary measure whilst we battled through the fragged ones. That special case may give some temporary space for gain ... dont know until you try I guess, but you will need to use a special app_info to prevent the Moo WU grabbing all GPUs as its designed to do. Bare in mind if this is going for NVidia cards, that NVidia works sloooow here .... not the best of ideas to run them at Moo, unless there is a particular personal reason. Regards Zy ID: 2079 · Rating: 0 · rate: /

Wizzo Send message Joined: 1 Jan 12 Posts: 13 Credit: 21,324,276 RAC: 0	Message 2085 - Posted: 4 Jan 2012, 0:10:59 UTC - in response to Message 2079. I don't intend to doubleup on anything that uses 90%+ of the gpu, I was refering to poem @ home that is running between 25% and 35% of each GPU. I am more looking for the low hanging fruit here then to squeeze every last bit out of the gpus. I have 2 ATI 6990's, and my eyes are rolling back in my head trying to find out how to do this (if it doesn't work, so be it). Can anuone point me in the right direction? ID: 2085 · Rating: 0 · rate: /