Message boards :
Number crunching :
Nvidia GPU workunits won't start
Message board moderation
Author | Message |
---|---|
Send message Joined: 2 May 13 Posts: 13 Credit: 105,231,599 RAC: 9 |
None of my Moo! Wrapper Nvidia GPU workunits will start running. They all say "Ready to Start", but even when there are no other GPU workunits running from other projects, the Moo! Wrapper workunits won't start running. I have tried suspending any other projects (CPU & GPU ones) but still the Moo! Wrapper workunits won't start. I have tried them on 2 different PCs (one with Windows 7 and one with Windows 8) and I have the same problem. I am using v7.4.42 of BOINC Manager. I have tried upgrading my Nvidia drivers to the latest version, but this made no difference. GPU workunits from other projects work fine. When I last crunched GPU workunits for this project (about a month ago) they also worked fine, but not anymore. Any help would be appreciated. |
Send message Joined: 2 May 13 Posts: 13 Credit: 105,231,599 RAC: 9 |
After reading some older threads in this forum, it appears that other people have had the same problem and that it might be due to the fact that both my PCs have dual graphics cards. It's strange that it was working about a month ago, but doesn't work now. I don't recall changing anything. I've tried an app_info.xml file mentioned in another thread but the problem persists. I'll keep trying. |
Send message Joined: 22 Jun 11 Posts: 2080 Credit: 1,844,376,616 RAC: 12,696 |
After reading some older threads in this forum, it appears that other people have had the same problem and that it might be due to the fact that both my PCs have dual graphics cards. Go into your account, computing preferences and see if you have the gpu set to NOT run while the pc is in use? Suspend GPU computing when computer is in use |
Send message Joined: 2 May 13 Posts: 13 Credit: 105,231,599 RAC: 9 |
Hi Mikey, "Suspend when computer is in use" - is unticked "Suspend GPU computing when computer is in use" - is unticked I wonder if it has anything to do with the new v1.04 (opencl_nvidia_101) application that was released around the end of May. Although maybe not, since my PCs aren't actually getting the new application (I'm still receiving the v1.03 (cuda31) application. Daniel. |
Send message Joined: 5 May 11 Posts: 12 Credit: 74,488,285 RAC: 0 |
Looking at units that you did in May show "driver version 340.52", your systems now show "driver: 353.06". I would look here first. Back down the driver and see if it starts processing work. |
Send message Joined: 2 May 13 Posts: 13 Credit: 105,231,599 RAC: 9 |
I was having the problem with the old driver as well. I installed the new driver a few days ago to see if it would fix my problem, but it didn't. |
Send message Joined: 22 Jun 11 Posts: 2080 Credit: 1,844,376,616 RAC: 12,696 |
Hi Mikey, How about this setting? Suspend when non-BOINC CPU usage is above --- % |
Send message Joined: 2 May 13 Posts: 13 Credit: 105,231,599 RAC: 9 |
"Suspend when non-BOINC CPU usage is above" - was set to 25%. I've now unticked it so that it won't suspend, but my problem still persists. I've noticed that in the News section of the main Moo! Wrapper page it says ... The old app version is still available for systems running older BOINC Client. However, if your system has multiple GPUs, please consider updating it's BOINC Client to a support version. There are known issues running the old app on systems with multiple GPUs. New apps fix these by running one cruncher per device. I'm still getting the v1.03 (cuda31) application (which was working OK previously). Do you know if there is a way to get the new v1.04 (opencl_nvidia_101) application ? Maybe the new application will fix my problem. |
Send message Joined: 22 Jun 11 Posts: 2080 Credit: 1,844,376,616 RAC: 12,696 |
"Suspend when non-BOINC CPU usage is above" - was set to 25%. I've now unticked it so that it won't suspend, but my problem still persists. No I do not know how to force the change. |
Send message Joined: 20 Apr 11 Posts: 388 Credit: 822,356,221 RAC: 0 |
None of my Moo! Wrapper Nvidia GPU workunits will start running. They all say "Ready to Start", but even when there are no other GPU workunits running from other projects, the Moo! Wrapper workunits won't start running. I noticed this too on my test nVidia system that has dual GPUs. BOINC Client has major problems on trying to schedule work units that require multiple GPUs (like all our old app workunits do). I've tested all major versions since 7.2.33 and all seem to be affected at least on my test system. I did disable sending old app versions to any multi-GPU systems using BOINC Client v7.2.x or later but it seems some users have managed to get our old app working so this prevented them getting any work. I have no idea why or how to detect those systems in the server side. So for now only multi-GPU systems with BOINC Client v7.2.33 won't get old apps (this is the version that's available for CentOS 6 in their repo). There was a fix committed to BOINC Client for this problem recently (see https://github.com/BOINC/boinc/commit/8c7aef5b997c028e007dc158d76eab3a5502e3c4) but I don't think there's a release version of it yet. I've tested it with a test build and it does seem to fix the scheduling problem. However, that won't help old BOINC Client users. So the best course probably is to have everybody use our new app. For that we have to have new enough Dnet Client that supports selecting the GPU to be used and there's no CUDA builds available for that. I've gotten one compiled for Linux and thus was able to release cuda60/cuda70 builds with new app. I'll try to get Windows ones built too. In the mean time, nVidia OpenCL apps are the thing to get. I've fixed some problems in the BOINC Server code that could have prevented people getting that app. (There has been various problems for sending apps when there's more than one possible app for a platform.) Additinally, more stats the server has for app version, the more it knows it's true speed against other apps. So, could you test if you can get OpenCL apps now. And if not, tell me the host # that's affected and I'll try to debug the app sending problem further. Thanks! -w |
Send message Joined: 2 May 13 Posts: 13 Credit: 105,231,599 RAC: 9 |
Thanks for your reply Teemu. I've just downloaded some more workunits, but I'm still having the same problem. The workunits are v1.03 cuda31 and they won't start running. Here is a link to my computer ... http://moowrap.net/show_host_detail.php?hostid=185032 |
Send message Joined: 22 Jun 11 Posts: 2080 Credit: 1,844,376,616 RAC: 12,696 |
Thanks for your reply Teemu. Is this your only pc, or do you have others at your other projects? Because you seem to be doing several gpu projects at the same time, could it be your cache is too big and you just have too many units over all and Moo is suffering from a lack of time to run them? Your units say 'abandoned' or in the case of one unit 'didn't run by the deadline'. |
Send message Joined: 2 May 13 Posts: 13 Credit: 105,231,599 RAC: 9 |
Yes I have other computers. Two of them have Nvidia GPUs and both of them are having the same problem with this project. I do run other GPU projects, but even when I have no other GPU workunits or when I suspend the other projects, workunits from this project still won't start. |
Send message Joined: 22 Jun 11 Posts: 2080 Credit: 1,844,376,616 RAC: 12,696 |
Yes I have other computers. I am out of ideas then, sorry. |
Send message Joined: 20 Apr 11 Posts: 388 Credit: 822,356,221 RAC: 0 |
Here is a link to my computer ... Right, thanks for the host #. I've blacklisted the nVidia CUDA for that host for now so you should get the nVidia OpenCL work units now. Hope they work better. Please reset the project and fetch new work to test, thanks! The problem with getting nVidia OpenCL work in the first place is that BOINC Server thinks that the OpenCL app is way much slower than the CUDA one for your host. It obviously doesn't take into account that CUDA doesn't start at all now. :( Specifically the scheduler says: "Comparing AV#38 (43.01 GFLOPS) against AV#26 (178.36 GFLOPS)" where AV#38 is the OpenCL and AV#26 is the CUDA. CUDA speed comes from the measured time from succesfully returned work and OpenCL is a guess by the server. I've been trying to make it guess better but have not succeeded fully yet. -w |
Send message Joined: 2 May 13 Posts: 13 Credit: 105,231,599 RAC: 9 |
Thanks Teemu ! You've fixed my problem. I'm now getting the OpenCL workunits and they start running straight away. Perhaps you could add an option in our account preferences that lets us choose which application to run. Thanks also to Mikey & Pooh Bear 27 for your suggestions. |
Send message Joined: 2 May 13 Posts: 13 Credit: 105,231,599 RAC: 9 |
Looks like I spoke too soon. :( While the tasks do now start running straight away, the GPU load reported by GPU-Z is always 0%. After a few seconds of running the task progress bar stops at 17.567% and no further progress is made. I tried rebooting the computer and resetting the project, but this made no difference. |
Send message Joined: 20 Apr 11 Posts: 388 Credit: 822,356,221 RAC: 0 |
After a few seconds of running the task progress bar stops at 17.567% and no further progress is made.. Oh no. :( Can you find the stderr.log file in the slot directory from a task that's stuck and post it either here or PM it to me? (Note that the log file might contain your Distributed.net ID which you might not want to disclose publicly.) -w |
Send message Joined: 2 May 13 Posts: 13 Credit: 105,231,599 RAC: 9 |
I think I know what the problem is. The application is trying to use my Intel GPU instead of NVidia GPUs. When I alternately suspend & resume the Moo!Wrapper project my NVidia GPU usage stays at 0%, but my Intel GPU usage goes from 0 to 100%. A similar problem was reported by anson1998 in this thread http://moowrap.net/forum_thread.php?id=406 Here is the contents of my stderr.txt ... 10:32:13 (1612): wrapper v1.4 build 18 for nVidia OpenCL starting (BOINC Wrapper v7.5.26011) 10:32:13 (1612): device: OpenCL: NVIDIA GPU 0: GeForce GTX 760 (driver version 353.06, device version OpenCL 1.2 CUDA, 2048MB, 1958MB available, 2650 GFLOPS peak) 10:32:13 (1612): device: OpenCL: NVIDIA GPU 1 (not used): GeForce GTX 760 (driver version 353.06, device version OpenCL 1.2 CUDA, 2048MB, 1958MB available, 2650 GFLOPS peak) 10:32:13 (1612): checkpoint interval: 0h15m00s00 (task 268800 GFLOPS, 0h01m41s44 per packet) 10:32:13 (1612): wrapper: running dnetc520-win32-x86-opencl.exe (-ini dnetc.ini -runoffline -multiok=1) - attempt 1/10 dnetc v2.9111-520-CTR-12082118 for OpenCL on Win32 (WindowsNT 6.2). * ========================================================================== * The client is not configured with your email address (distributed.net ID) * Work done cannot be credited until it is set. Please run 'dnetc -config' * ========================================================================== [Jul 11 00:32:14 UTC] Automatic processor type detection did not recognize the processor (tag: "Intel(R) HD Graphics 4000 ") [Jul 11 00:32:34 UTC] RC5-72: using core #1 (CL 1-pipe). [Jul 11 00:32:34 UTC] RC5-72: Loaded D3:CB6FEE00:00000000:64*2^32 Packet was from a different user/core/client cpu/os/build. [Jul 11 00:32:34 UTC] RC5-72: 2 packets (128.00 stats units) remain in in.r72 Projected ideal time to completion: 0.04:07:28.00 [Jul 11 00:32:34 UTC] RC5-72: 0 packets are in out.r72 [Jul 11 00:32:34 UTC] 1 cruncher has been started. |
Send message Joined: 20 Apr 11 Posts: 388 Credit: 822,356,221 RAC: 0 |
The application is trying to use my Intel GPU instead of NVidia GPUs. Aha! Thanks for this information, at least now we know what the problem is. Although, the fix for this might be a bit involved. Meanwhile, I'll probably try and get newer CUDA builds with our new app for Windows build and deployed as those will use 1 device at a time and won't suffer from the scheduling problem. This stems from the two different views on the OpenCL devices on the system. BOINC only sees the nVidia devices while D.net Client sees all the OpenCL capable devices and thus device 0 is different. :( -w |