BOINC Scheduler changes for multiple app version case
BOINC Scheduler has had problems sending different app versions to clients when there's multiple possible versions for a platform. For example, this happens when there's both OpenCL and Stream/CUDA or both 32-bit and 64-bit CPU app version available. To hopefully fix this our scheduler has been changed to send each app version until it has enough host specific speed samples. Only exception is when that version has been failing.
Please report any problems of getting work or having them fail more often in our forums. Thank you and happy crunching!
2 May 2018, 12:00:34 UTC · Discuss
BOINC Server updated
BOINC Server used by the project has been updated to the latest available from upstream that includes various fixes, improvements and changes. Most visible change is the partial migration to Bootstrap web framework and refreshed theme. This should make our web site behave much better on mobile browsers but there are some pages that still could be improved.
As this is a big change, please report on our forums any problems or strange behaviour in both servers and website. Thank you and happy crunching!
7 Oct 2017, 8:42:12 UTC · Discuss
Disk failure on the main project server
Main project server failed about 36h ago due to I/O errors because of bad sectors on it's hard disk. I've migrated the data to the secondary disk and the project services are now recovering. There might be few tasks affected (will be shown as download errors) as some generated data files were not recoverable.
3 May 2017, 13:41:41 UTC · Discuss
Year 2017 is upon us and it has been insanely long time since my last activity on these message boards. So I wanted to let everyone know that I am still around, just hiding in the shadows due to various reasons. My main priority regarding Moo! Wrapper project has been to keep it up and running at the current level. I am very aware that there are various issues and improvements pending that should be addressed and I'm sorry about lack of progress on them. Hopefully I will manage to get these done eventually.
5 Jan 2017, 11:43:43 UTC · Discuss
Outage due to primary disk failure
Due to the primary disk finally failing completely there was about 8h of downtime for the project today. For some reason the disk failure brouht the server down even though we are running from the backup disk since July.
The failed disk has been replaced now and the project is slowly catching up with work distribution. Apologies for the outage!
5 Dec 2015, 15:54:20 UTC · Discuss
We had a few hours of outage today due to network connectivity issues at our server hosting provider that affected our DB server. Things should be stabilizing again barring any pending network maintenance.
18 Sep 2015, 15:54:02 UTC · Discuss
Fresh work available! Come get yours!
Keymaster is back online, at least temporarily pending more hardware replacements, so we are now generating fresh work for your hungry computers. Come get some and get your crunch on!
Thanks for your patience and understanding during this extended work outage due to upstream keymaster hardware failure.
22 Jul 2015, 18:48:44 UTC · Discuss
Out of work
There's no more work available at the moment due to the distributed.net keymaster hardware failure. Our local cache was just also depleted, which lasted for about two days. So let's finish what we have and then move on to backup projects (yes, they are always good idea to have) while we wait for the distributed.net staff to repair keymaster. Rest assured, they are working hard to get the keymaster back up. As soon as that happens, we'll get fresh work out for our hungry systems to crunch.
For more details and latest developments, please read http://blogs.distributed.net/. Thanks for your patience!
12 Jul 2015, 11:40:16 UTC · Discuss
Disk maintenance and work generation woes
Project was down today between 18:00 and 21:30 EEST (that's from 15:00 UTC/7:00 PST to 18:30 UTC/11:30 PST) for about 3 and half hours while the previously failing disk was swapped with a backup disk and data was copied over. Now the project is running from a disk that's not showing signs of collapsing due to read errors. The backup disk is as old as the failed disk but hasn't had that much use so it should last until the project server is migrated to a new server with SSD disks later this year.
This happened a bit unannounced as I took advantage of the D.net keymaster problems that seems to slow down our work generation for some reason. That's also the why we run out of work before the maintenance and why we still don't have full work buffers. Hopefully that will fix itself once the keymaster is back in action. In any case, our local proxy will eventually run out of work unless the keymaster will be resurrected.
For more information about the keymaster failure, please read http://blogs.distributed.net/2015/07/10/04/28/bovine/. Thanks and happy crunching once the dust settles!
10 Jul 2015, 18:44:33 UTC · Discuss
Unscheduled downtime due to disk problems
There was an unscheduled downtime on 5th of July from about 13:00 EEST+3 (that's 10:00 UTC or 3:00 PDT-7) until about 1:30 EEST (22:30 UTC or 15:30 PDT) for a total of 12 hours 30 mins. There might have been problems with services for 2 hours or even longer before the start of downtime.
This was due to disk I/O failures on our main server that meant it had to be brought down for a disk check and temporary repairs. There was no critical data affected as the only file blocks permanently lost due to bad sectors were parts of log files. All other files were repaired successfully.
The affected disk will need to be replaced as it will most likely fail completely soon. This will require additional downtime in coming days.
For now, I'm bringing services back up slowly while the backlog gets processed.
5 Jul 2015, 23:29:31 UTC · Discuss
New v1.4 apps for nVidia OpenCL and Android
More apps based on v1.4 of wrapper code are now available.
First there are Android apps with both non-PIE (for pre-4.1) and PIE (for 4.1 and later, especially 5) versions. Android apps are no longer in beta so everybody should get them but please report any problems at the new Android forum http://moowrap.net/forum_forum.php?id=27.
Secondly there's a new nVidia OpenCL app available for Windows. This brings the same level of support for nVidia GPUs that was previously available only for AMD GPUs. The Linux versions are on the list next and will be deployed soon.
24 May 2015, 1:12:04 UTC · Discuss
New app v1.4 with OpenCL support deployed
Finally we have a newer application version that supports OpenCL! Updated versions of both AMD/ATI Stream and OpenCL apps for Windows have been offered by our server since yesterday to participants meeting requirements.
These apps require BOINC Client v7.2.28 or newer and Windows XP or newer. The Stream app is not available to AMD Radeon HD 7870/7950/7970/R9 280X series (Tahiti) or newer cards due to compatibility issues. OpenCL app is preferred for those cards.
The old app version is still available for systems running older BOINC Client. However, if your system has multiple GPUs, please consider updating it's BOINC Client to a support version. There are known issues running the old app on systems with multiple GPUs. New apps fix these by running one cruncher per device.
For detailed app change history, please read application changelog. For latest requirements enforced by our server, please read latest requirements.
Please report any problems you might encounter with new apps and our changes. Especially if you no longer get any work from our server where you previously did. You can report them by posting at our forum.
7 Sep 2014, 5:18:57 UTC · Discuss
We have badges!
As a result of our recent server code update, we can now have badges. I've enabled the badge assignment task and first ones are now granted and getting exported in our stats. You can also see them all over our website where users are mentioned.
Badges in use are the default user and team gold/silver/bronze badges for top 1%/5%/25% RAC. Please let me know if you have any ideas for other badges. Preferably with cool icons to use.
Happy crunching while hunting those badges!
29 Aug 2014, 10:58:32 UTC · Discuss
I'm migrating our primary server to a new host that has a newer OS and more modern hardware. It's also closer to our DB server so any delays between the servers should be minimized. To top this all off, it's also cheaper.
As things are changing there might be one or more disruptions on services but I'll keep an eye on things and don't expect any loss of work as soon as dust settles. No action is needed on your part unless you are hard coding our server IP address somewhere like your firewall rules. In which case you should prepare to change the address and think about using name (moowrap.net) instead of hard coding an IP.
I'll let you know when I'm done. Thank you and happy crunching with a snappier server!
27 Aug 2014, 7:28:43 UTC · Discuss
I've just finished updating BOINC Server code to the latest available. This brought us about one year worth of fixes and changes from upstream developers. There's better support for communicating with latest BOINC Client versions interesting new CPU list at http://moowrap.net/cpu_list.php in addition to the previously available GPU list at http://moowrap.net/gpu_list.php.
Additionally, earlier our domain was moved to a new registrar and DNS service provider. We also finally have a non-expired certificate in use. And with the code update the protected pages actually work correctly now. Previously they might have refused to load style sheets and other such resources.
As always, please report any new/existing problems and I'll take a look. Thanks and happy crunching!
16 Aug 2014, 16:39:00 UTC · Discuss
Another transitioner problem
We've run out of work to send and transitioner backlog is about 9h at the moment because of what turned out to be a bug in transitioner code. It failed to properly detect and handle anonymous platforms for app_version statistics. I fixed the bug and transitioner is now catching up. As soon as that's done, I'll let the other daemons loose on their backlogs so we'll probably get back to full operations later today.
26 Jan 2014, 11:34:05 UTC · Discuss
Transitioner problem solved
It turned out the problem was actually caused by a corrupted index in one of the primary database tables used by BOINC Server. Recreated the table/index and things are looking much better so I've restarted the system. However, transitioner is now going through all the 600k workunits in the DB and only once that is done, the system will start creating new workunits. Until then we are out of work to send. Sorry! I'll keep an eye out to make sure we recover fully.
22 May 2013, 17:05:39 UTC · Discuss
Weird BOINC server bug causing problems
It seems transitioner is not transitioning anything since 5h ago so nothing is progressing through our pipe. I've shut down most background tasks to be safe until I can figure out what's going on with BOINC server programs at the moment. Unfortunately, this has to wait until later today.
You don't have to cancel current work units you might have as they will get reported and credited once services are back up. They can even be 1-2 days overdue and still get accepted. However, you might want switch to your backup projects until we have new work ready.
Sorry for the problems and this unexpected extended downtime!
22 May 2013, 5:22:06 UTC · Discuss
Work shortage and previous outages
It seems we run out of work to send today for a second time during this week (last incident happened three days ago) and both of them has been because upstream (distributed.net) keymaster has not send us any work to generate into work units for you to crunch on.
Three days ago the keymaster run out of work (they currently have limited alerts for such issues due to a monitoring server failure) until an operator queued more work and today the keymaster was somehow confused and not sending anything until the same operator restarted the service. So work is getting generated and send to you again.
Also, there has been some total outages earlier that has been due to our primary database server crashing. I'm still trying to figure out what's the exact cause of it's crashing under high load but for now my work a rounds has been keeping it up. If all else fails, I'm going to change server hardware to one that's hopefully more reliable.
21 Mar 2013, 18:51:56 UTC · Discuss
Plan class configuration and requirements changed
I'm switching to using an external XML file to define plan classes (for technical details, please read http://boinc.berkeley.edu/trac/wiki/AppPlanSpec) so there can be some strange responses from scheduler while things stabilize. I'll try watch how the scheduler behaves but, please, let me know if your previously working systems no longer get any work assigned.
Additionally, I'm going to lower the ATI memory requirements so that some cards with limited memory can also participate. Native ATI Stream app is no longer available for ATI 7xxx class cards in preparation for getting a working OpenCL app out for them.
Apologies for any problems and thanks for your understanding and crunch power!
15 Dec 2012, 17:57:11 UTC · Discuss
Server code update and fix to GPU work scheduling
Just finished an update to the latest BOINC Server code that brought a lot of fixes and improvements from BOINC devs in last 6 months. These included web feature changes as well as various backend process changes. (And before you ask, no, credit system is still the same and no granting adjustments have been done there.)
Additionally, since the code refresh didn't seem to fix it, I debugged the "only 1 GPU work unit sent by scheduler" problem and hopefully got it fixed. It seems the matchmaker scoring scheduling algorithm we switched to some time ago doesn't work correctly for GPU work request. (Specifically the new resource based work requests of new BOINC client versions.) I switched us back to the old trusty array scheduling algorithm and now the GPU work requests are been fulfilled fully by our scheduler.
As always, please report any oddities you might encounter and thanks for crunching!
10 Sep 2012, 4:17:53 UTC · Discuss
There has been some strange networking problems affecting our proxy connection to the D.net upstream key distribution servers. This is the reason why our work generators have failed to generate fresh work and we've run out of work to send to you. I'm working on trying to figure out what's the problem here. AFAIK, these problems don't affect any other connections or your ability to upload/download work.
In fact, while working on this I broke the network on the server totally and that's the reason we were offline for over 12h. Services are now back up and backlog is getting worked through. Sorry about that, I'll try to be more careful in the future!
10 Jul 2012, 17:01:25 UTC · Discuss
Fresh subspaces ahead!
We have been processing key blocks from a fresh subspace for few days now. This should help with any problems caused by the previous more or less fragmented work units we've been sending out. Thanks for sticking with us through the bumpy times.
You can read the upstream announcement at http://blogs.distributed.net/2012/05/04/15/23/bovine/.
Let's keep our CPUs and GPUs hot with work! :)
12 May 2012, 17:37:17 UTC · Discuss
Domain expiration mishap
The Moo! Wrapper domain moowrap.net was inaccessible for few hours because the domain expired yesterday (30. Apr). I failed to renew it on time because of some miscommunication with my service provider representative and me been too busy with my day job to handle this correctly, on time.
Domain is now renewed for next three (3) years and should be accessible once again for everybody (or slowly getting there as DNS record updates make their way through the net). I'm sorry about this short hickup on the project availability.
1 May 2012, 15:11:51 UTC · Discuss
Hard drive failure on our primary database server
Our shiny new primary database server, that's been responsible for the nice performance lately, decided that things have been way too stable. So this Sunday morning at about 6:38 EET* the server killed it's primary hard drive bringing everything to a grinding halt. :(
I've switched to using our replicate DB until data center staff can replace our failed hard drive and/or server. I'm currently bringing the services back online slowly to catch things up. Note that things might be slower until the first onslaught of clients reconnecting is over.
Good news is that there shouldn't be more than few seconds of DB changes lost because our database is replicated to the secondary server. Please, do tell if you see something strange. Bad news is that there's going to be a maintenance break in the near future when I switch primary DB back to the resurrected server (maybe next weekend, if things run fine with only one DB server).
*=That's 5:38 CET or Sat 20:38 PST and for other timezones, please see http://www.timeanddate.com/worldclock/fixedtime.html?iso=20120212T0635&p1=101&sort=1.
12 Feb 2012, 18:33:28 UTC · Discuss
Planned database maintenance completed
Finished a planned database maintenance few hours ago, where our primary DB was moved to a separate host. Service levels should be returning to normal.. or actually they should be getting better now that this change is done. So no more transitioner backlog or strange scheduler failures. :)
Downtime was a bit longer than expected (started around 1:00EET/23:00UTC/14:00PST) because moving over 30G of data actually takes a while and then there was complications while making our Python based backend and PHP based web use SSL when connecting to the new MySQL server.
27 Jan 2012, 16:10:07 UTC · Discuss
Application v1.3 deployed
New application v1.3 has been deployed. It's first available on Linux but rest of the current platforms will follow as deployment progresses. There's going to be few new platforms as well, namely 64-bit Windows CPU and 32-bit Linux CPU/ATI Stream.
Main fixes are for a Linux crash during suspend and checkpoint/hang detection improvements. There's also support for setting a separate ATI/NV/CPU core for a host, which you can now set in project preferences at http://moowrap.net/prefs.php?subset=project.
Note that there's no urgent need to abort tasks on older applications. Work from them will still be accepted and validated normally.
For a detailed changelog, please see post http://moowrap.net/forum_thread.php?id=206.
16 Jan 2012, 21:27:06 UTC · Discuss
Santa came by and demanded that I do something about recent credit levels. I obviously said no since we are already granting such high levels to begin with but he wouldn't take no for an answer. Now there's some Santa Magic in effect that makes our validator grant double credits for everybody! I'm so sorry about this and I'll fix it as soon as I have time to figure out what Santa did. :(
In other news, base credit is now 9 per stat units (the last number in wu name) and fragmented work will get a 20% bump in compensation. A fragmented work is one that has more than twice the normal amount of packets in it (the second to last number in wu name). Both of these conditions are in effect for now but I might adjust values after seeing how things progress.
24 Dec 2011, 15:32:05 UTC · Discuss
2% done, 98% or 9 to 137 years to go
RC5-72 project of distributed.net (where we get our work from) announced a while back that they had completed 2% of the keyspace we are checking. They also calculated that there's still more work for 9 to 137 years so we don't have to worry about running out work just yet.
For the announcement itself, please read http://blogs.distributed.net/2011/11/27/18/26/bovine/.
Thanks for crunching, keep it up! :) How about we try to push the time to get to 3% under a year? With your help, we can do it!
15 Dec 2011, 6:46:03 UTC · Discuss
Read-only replica DB deployed
There's now a replica DB used that's used for certain heavy read-only DB operations (top statistics, server_status, lists of user workunits/tasks etc) and should take a bit load off the main DB and by doing so, help the scheduler and other critical processes do their job.
Please, do let me know if there's something that no longer seem to work. Thanks and happy crunching!
30 Nov 2011, 14:34:02 UTC · Discuss
BOINC Server code updated
Just finished an update of the BOINC Server code to the latest available. This brought us about three months worth of fixes and new features from the upstream developers. Most notable fixes:
As always, please let us know if you find something broken. Thanks!
26 Oct 2011, 17:01:49 UTC · Discuss
Another unexpected outage
There was another out-of-memory event that affected our main (and only) server to bring the project down for about six hours last Friday. Services were fully down from 21:00 to 3:30 local time (EEST+3, so that's 18:00 to 0:30 UTC and 11:00 to 17:30 PDT).
Anybody who is interested reading the long technical details, please see the http://moowrap.net/forum_thread.php?id=113 forum post. Thanks for crunching!
4 Sep 2011, 3:18:16 UTC · Discuss
It seems we had our first unexpected downtime yesterday. :( We've had some offline times previously but most of them have been for a shorter period of time and on purpose due to maintenance and updates.
Access to our services were failing on Thursday from about 4:00 to 20:30 when I finally brought things back online. Those times are on my local EEST+3 timezone, which means from 1:00 to 17:30 UTC and from Wed 18:00 to 10:30 PDT. This is about 16 and half hours of lost time.
Looking through logs, this seems to have been caused by the server running out of memory and subsequently OOM-killing itself to death. I have few things I can do to prevent same problem bringing us down in the future. (Like moving the DB to a different server as the OOM-killer chose the poor DB to die on first round.)
I did notice the problems in the morning but due to unrelated complications (non-project ones) I didn't manage to get the server back online until that evening. I do apologize for this extending our downtime. :(
Everything should be back to normal now but due let me know if there are still problems around. Thanks and now let's crunch hard to make up for the lost time! ;)
5 Aug 2011, 10:33:13 UTC · Discuss
Donations are now accepted using Paypal. You can go to http://moowrap.net/donations.php or use the link on our home page to help with our monthly operation costs. We would like to cover half of our costs through donations, which is 100 USD per month. Other half is privately sponsored by the project administrator (or covered through discounts and other such means). For details, see the donations page. Thanks for your help, every dollar counts!
8 Jul 2011, 10:59:49 UTC · Discuss
MAC applications deployed
New MAC applications deployed with both CPU and CUDA 3.1 support. There's a separate CPU application for all three CPU variants (PowerPC, Intel 32-bit and Intel 64-bit). For CUDA 3.1 there's only Intel 32-bit application available (might work on Intel 64-bit too) and BOINC Client needs to already detect your nVidia card correctly.
This deployment is still based on application v1.2 so any known problems from Linux version are most likely still there.
14 Jun 2011, 16:28:01 UTC · Discuss
CPU applications deployed and CUDA memory requirements changed
New CPU applications deployed for 32-bit Windows and 64-bit Linux. Windows version should also run on 64-bit Windows systems. So you should now disable CPU applications on project preferences if you don't want to use your CPU for this project.
There's a known problem with checkpoint interval (uses a default 2h interval) and core selection (selected core is shared with ATI/nVidia on the same host) for these applications. Both problems should be fixed in next application version so you should wait for that if you have a problem due to either limitation.
Additionally, CUDA requirements were changed to accept cards with only 64MB of memory.
26 May 2011, 12:41:04 UTC · Discuss
Different workunit sizes added
I just completed adding different workunit sizes and scheduler should now sent you work based on the measured speed of your host. Additionally, scheduler sents workunits that are better "match" for number of cards your host has. This should minimize idle cards at the end of wu.
I'm pretty sure there's still some tweaking to do and I will be watching how the scheduler performs tomorrow. Please, do let me know if there seems to be something odd with workunits given to your host and especially if you are now unable to get any work. Thanks!
22 May 2011, 12:39:09 UTC · Discuss
Application version 1.2 deployed
I've deployed new application v1.2, with following major changes:
18 May 2011, 15:26:53 UTC · Discuss
Switched to granting static credit
Due to numerous inconsistencies in the BOINC credit calculation, I've switched to granting static amount of credit based on stat units in a wu. Stat units in a particular wu is the last number in its name (second to last in a task name) and measures the relative amount of work in a wu.
At the moment we are giving 5cr per stat unit, which gives a little over 2kcr per current WU size. (For example, a standardish wu with 448 stat units gives 2240cr) We probably switch to 7cr eventually, which gives about 3kcr per current WU size (same 448 now gives 3136cr regardless of how long it took).
Partial result crediting works the same way it used to.
12 May 2011, 17:39:16 UTC · Discuss
New application version deployed
I just released v1.01 of our wrapper application. In addition to previous 32-bit Windows versions there's now versions for 64-bit Linux. Both Stream and CUDA are available for both.
Some of the major changes in this release:
6 May 2011, 19:58:20 UTC · Discuss
I've been working on getting the crediting to stabilize. During these changes there was a time where valid results were given a near zero credit (it was shown as zero). This affected 290 results for 54 users and 80 hosts.
I've just fixed credit for these anomalies by granting them a fixed amount. Also, the near zero credits shouldn't happen anymore since it seems to have been result of a --max_granted_result switch for the validator that I tried as a solution for the credits. I'm not sure if it's a bug in BOINC Server code or if it's really supposed to work that way..
Otherwise, the crediting seems to have somewhat stabilized but I think we are giving too much credit at the moment. I'm going to bring it down a bit.
There are still cases were huge credit is given for some results. If validator doesn't correct itself I need to change how we credit results and bypass all these scaling that goes on.
5 May 2011, 10:14:13 UTC · Discuss
Project now enforces following ATI/nVidia requirements. Most notable change is that CC 2.0 and above are no longer rejected for CUDA. This means devices like GTX 465 are now offered the CUDA application. We'll see how things works for them.
There's also better logging (for me) and also a generic notification to new BOINC Clients when these requirements are not met.
ATI Stream requirements:
nVidia CUDA requirements:
I probably should find a place on the website to list these requirements. They might change as we get more information about what are really required.
3 May 2011, 8:13:55 UTC · Discuss
Outbound mail is delayed
Looks like BOINC server software went a little crazy with sending emails so the server hit it's outbound email limit for today. There are 406 mails waiting to be sent. Limit should get reset any minute now (or it'll take another 24 hours) and at least the queued mail should get sent then.
My guess is that this flurry is from initial global team import and other notifications from users registering on the project and on forum.
I hope mail volume gets a little easier once first days pass by and things get rolling. If not, I'll need to do something about that limit. For now, let's wait patiently as queues catch up. Thanks! :)
BTW, inbound mail for the project addresses (mentioned throughout the site) is most likely bouncing at the moment. I'll need to setup those soon but let's use this forum for communication for now.
2 May 2011, 8:58:58 UTC · Discuss
Up and running!
Basic infrastructure is now up and running and at least ATI application should be usable. There's also CUDA application but that can have some performance issues still. I'm opening account creation for those brave enough to alpha test while I finish some loose ends.
1 May 2011, 23:36:35 UTC · Discuss