Another transitioner problem
We've run out of work to send and transitioner backlog is about 9h at the moment because of what turned out to be a bug in transitioner code. It failed to properly detect and handle anonymous platforms for app_version statistics. I fixed the bug and transitioner is now catching up. As soon as that's done, I'll let the other daemons loose on their backlogs so we'll probably get back to full operations later today.
26 Jan 2014, 11:34:05 UTC · Comment
Transitioner problem solved
It turned out the problem was actually caused by a corrupted index in one of the primary database tables used by BOINC Server. Recreated the table/index and things are looking much better so I've restarted the system. However, transitioner is now going through all the 600k workunits in the DB and only once that is done, the system will start creating new workunits. Until then we are out of work to send. Sorry! I'll keep an eye out to make sure we recover fully.
22 May 2013, 17:05:39 UTC · Comment
Weird BOINC server bug causing problems
It seems transitioner is not transitioning anything since 5h ago so nothing is progressing through our pipe. I've shut down most background tasks to be safe until I can figure out what's going on with BOINC server programs at the moment. Unfortunately, this has to wait until later today.
You don't have to cancel current work units you might have as they will get reported and credited once services are back up. They can even be 1-2 days overdue and still get accepted. However, you might want switch to your backup projects until we have new work ready.
Sorry for the problems and this unexpected extended downtime!
22 May 2013, 5:22:06 UTC · Comment
Work shortage and previous outages
It seems we run out of work to send today for a second time during this week (last incident happened three days ago) and both of them has been because upstream (distributed.net) keymaster has not send us any work to generate into work units for you to crunch on.
Three days ago the keymaster run out of work (they currently have limited alerts for such issues due to a monitoring server failure) until an operator queued more work and today the keymaster was somehow confused and not sending anything until the same operator restarted the service. So work is getting generated and send to you again.
Also, there has been some total outages earlier that has been due to our primary database server crashing. I'm still trying to figure out what's the exact cause of it's crashing under high load but for now my work a rounds has been keeping it up. If all else fails, I'm going to change server hardware to one that's hopefully more reliable.
21 Mar 2013, 18:51:56 UTC · Comment
Plan class configuration and requirements changed
I'm switching to using an external XML file to define plan classes (for technical details, please read http://boinc.berkeley.edu/trac/wiki/AppPlanSpec) so there can be some strange responses from scheduler while things stabilize. I'll try watch how the scheduler behaves but, please, let me know if your previously working systems no longer get any work assigned.
Additionally, I'm going to lower the ATI memory requirements so that some cards with limited memory can also participate. Native ATI Stream app is no longer available for ATI 7xxx class cards in preparation for getting a working OpenCL app out for them.
Apologies for any problems and thanks for your understanding and crunch power!
15 Dec 2012, 17:57:11 UTC · Comment
Server code update and fix to GPU work scheduling
Just finished an update to the latest BOINC Server code that brought a lot of fixes and improvements from BOINC devs in last 6 months. These included web feature changes as well as various backend process changes. (And before you ask, no, credit system is still the same and no granting adjustments have been done there.)
Additionally, since the code refresh didn't seem to fix it, I debugged the "only 1 GPU work unit sent by scheduler" problem and hopefully got it fixed. It seems the matchmaker scoring scheduling algorithm we switched to some time ago doesn't work correctly for GPU work request. (Specifically the new resource based work requests of new BOINC client versions.) I switched us back to the old trusty array scheduling algorithm and now the GPU work requests are been fulfilled fully by our scheduler.
As always, please report any oddities you might encounter and thanks for crunching!
10 Sep 2012, 4:17:53 UTC · Comment
There has been some strange networking problems affecting our proxy connection to the D.net upstream key distribution servers. This is the reason why our work generators have failed to generate fresh work and we've run out of work to send to you. I'm working on trying to figure out what's the problem here. AFAIK, these problems don't affect any other connections or your ability to upload/download work.
In fact, while working on this I broke the network on the server totally and that's the reason we were offline for over 12h. Services are now back up and backlog is getting worked through. Sorry about that, I'll try to be more careful in the future!
10 Jul 2012, 17:01:25 UTC · Comment
Fresh subspaces ahead!
We have been processing key blocks from a fresh subspace for few days now. This should help with any problems caused by the previous more or less fragmented work units we've been sending out. Thanks for sticking with us through the bumpy times.
You can read the upstream announcement at http://blogs.distributed.net/2012/05/04/15/23/bovine/.
Let's keep our CPUs and GPUs hot with work! :)
12 May 2012, 17:37:17 UTC · Comment
Domain expiration mishap
The Moo! Wrapper domain moowrap.net was inaccessible for few hours because the domain expired yesterday (30. Apr). I failed to renew it on time because of some miscommunication with my service provider representative and me been too busy with my day job to handle this correctly, on time.
Domain is now renewed for next three (3) years and should be accessible once again for everybody (or slowly getting there as DNS record updates make their way through the net). I'm sorry about this short hickup on the project availability.
1 May 2012, 15:11:51 UTC · Comment
Hard drive failure on our primary database server
Our shiny new primary database server, that's been responsible for the nice performance lately, decided that things have been way too stable. So this Sunday morning at about 6:38 EET* the server killed it's primary hard drive bringing everything to a grinding halt. :(
I've switched to using our replicate DB until data center staff can replace our failed hard drive and/or server. I'm currently bringing the services back online slowly to catch things up. Note that things might be slower until the first onslaught of clients reconnecting is over.
Good news is that there shouldn't be more than few seconds of DB changes lost because our database is replicated to the secondary server. Please, do tell if you see something strange. Bad news is that there's going to be a maintenance break in the near future when I switch primary DB back to the resurrected server (maybe next weekend, if things run fine with only one DB server).
*=That's 5:38 CET or Sat 20:38 PST and for other timezones, please see http://www.timeanddate.com/worldclock/fixedtime.html?iso=20120212T0635&p1=101&sort=1.
12 Feb 2012, 18:33:28 UTC · Comment
Planned database maintenance completed
Finished a planned database maintenance few hours ago, where our primary DB was moved to a separate host. Service levels should be returning to normal.. or actually they should be getting better now that this change is done. So no more transitioner backlog or strange scheduler failures. :)
Downtime was a bit longer than expected (started around 1:00EET/23:00UTC/14:00PST) because moving over 30G of data actually takes a while and then there was complications while making our Python based backend and PHP based web use SSL when connecting to the new MySQL server.
27 Jan 2012, 16:10:07 UTC · Comment
Application v1.3 deployed
New application v1.3 has been deployed. It's first available on Linux but rest of the current platforms will follow as deployment progresses. There's going to be few new platforms as well, namely 64-bit Windows CPU and 32-bit Linux CPU/ATI Stream.
Main fixes are for a Linux crash during suspend and checkpoint/hang detection improvements. There's also support for setting a separate ATI/NV/CPU core for a host, which you can now set in project preferences at http://moowrap.net/prefs.php?subset=project.
Note that there's no urgent need to abort tasks on older applications. Work from them will still be accepted and validated normally.
For a detailed changelog, please see post http://moowrap.net/forum_thread.php?id=206.
16 Jan 2012, 21:27:06 UTC · Comment
Santa came by and demanded that I do something about recent credit levels. I obviously said no since we are already granting such high levels to begin with but he wouldn't take no for an answer. Now there's some Santa Magic in effect that makes our validator grant double credits for everybody! I'm so sorry about this and I'll fix it as soon as I have time to figure out what Santa did. :(
In other news, base credit is now 9 per stat units (the last number in wu name) and fragmented work will get a 20% bump in compensation. A fragmented work is one that has more than twice the normal amount of packets in it (the second to last number in wu name). Both of these conditions are in effect for now but I might adjust values after seeing how things progress.
24 Dec 2011, 15:32:05 UTC · Comment
2% done, 98% or 9 to 137 years to go
RC5-72 project of distributed.net (where we get our work from) announced a while back that they had completed 2% of the keyspace we are checking. They also calculated that there's still more work for 9 to 137 years so we don't have to worry about running out work just yet.
For the announcement itself, please read http://blogs.distributed.net/2011/11/27/18/26/bovine/.
Thanks for crunching, keep it up! :) How about we try to push the time to get to 3% under a year? With your help, we can do it!
15 Dec 2011, 6:46:03 UTC · Comment
Read-only replica DB deployed
There's now a replica DB used that's used for certain heavy read-only DB operations (top statistics, server_status, lists of user workunits/tasks etc) and should take a bit load off the main DB and by doing so, help the scheduler and other critical processes do their job.
Please, do let me know if there's something that no longer seem to work. Thanks and happy crunching!
30 Nov 2011, 14:34:02 UTC · Comment
BOINC Server code updated As always, please let us know if you find something broken. Thanks!
Just finished an update of the BOINC Server code to the latest available. This brought us about three months worth of fixes and new features from the upstream developers. Most notable fixes:
26 Oct 2011, 17:01:49 UTC · Comment
As always, please let us know if you find something broken. Thanks!
Another unexpected outage
There was another out-of-memory event that affected our main (and only) server to bring the project down for about six hours last Friday. Services were fully down from 21:00 to 3:30 local time (EEST+3, so that's 18:00 to 0:30 UTC and 11:00 to 17:30 PDT).
Anybody who is interested reading the long technical details, please see the http://moowrap.net/forum_thread.php?id=113 forum post. Thanks for crunching!
4 Sep 2011, 3:18:16 UTC · Comment
It seems we had our first unexpected downtime yesterday. :( We've had some offline times previously but most of them have been for a shorter period of time and on purpose due to maintenance and updates.
Access to our services were failing on Thursday from about 4:00 to 20:30 when I finally brought things back online. Those times are on my local EEST+3 timezone, which means from 1:00 to 17:30 UTC and from Wed 18:00 to 10:30 PDT. This is about 16 and half hours of lost time.
Looking through logs, this seems to have been caused by the server running out of memory and subsequently OOM-killing itself to death. I have few things I can do to prevent same problem bringing us down in the future. (Like moving the DB to a different server as the OOM-killer chose the poor DB to die on first round.)
I did notice the problems in the morning but due to unrelated complications (non-project ones) I didn't manage to get the server back online until that evening. I do apologize for this extending our downtime. :(
Everything should be back to normal now but due let me know if there are still problems around. Thanks and now let's crunch hard to make up for the lost time! ;)
5 Aug 2011, 10:33:13 UTC · Comment
Donations are now accepted using Paypal. You can go to http://moowrap.net/donations.php or use the link on our home page to help with our monthly operation costs. We would like to cover half of our costs through donations, which is 100 USD per month. Other half is privately sponsored by the project administrator (or covered through discounts and other such means). For details, see the donations page. Thanks for your help, every dollar counts!
8 Jul 2011, 10:59:49 UTC · Comment
MAC applications deployed
New MAC applications deployed with both CPU and CUDA 3.1 support. There's a separate CPU application for all three CPU variants (PowerPC, Intel 32-bit and Intel 64-bit). For CUDA 3.1 there's only Intel 32-bit application available (might work on Intel 64-bit too) and BOINC Client needs to already detect your nVidia card correctly.
This deployment is still based on application v1.2 so any known problems from Linux version are most likely still there.
14 Jun 2011, 16:28:01 UTC · Comment
CPU applications deployed and CUDA memory requirements changed
New CPU applications deployed for 32-bit Windows and 64-bit Linux. Windows version should also run on 64-bit Windows systems. So you should now disable CPU applications on project preferences if you don't want to use your CPU for this project.
There's a known problem with checkpoint interval (uses a default 2h interval) and core selection (selected core is shared with ATI/nVidia on the same host) for these applications. Both problems should be fixed in next application version so you should wait for that if you have a problem due to either limitation.
Additionally, CUDA requirements were changed to accept cards with only 64MB of memory.
26 May 2011, 12:41:04 UTC · Comment
Different workunit sizes added
I just completed adding different workunit sizes and scheduler should now sent you work based on the measured speed of your host. Additionally, scheduler sents workunits that are better "match" for number of cards your host has. This should minimize idle cards at the end of wu.
I'm pretty sure there's still some tweaking to do and I will be watching how the scheduler performs tomorrow. Please, do let me know if there seems to be something odd with workunits given to your host and especially if you are now unable to get any work. Thanks!
22 May 2011, 12:39:09 UTC · Comment
Application version 1.2 deployed
I've deployed new application v1.2, with following major changes:
18 May 2011, 15:26:53 UTC · Comment
Switched to granting static credit
Due to numerous inconsistencies in the BOINC credit calculation, I've switched to granting static amount of credit based on stat units in a wu. Stat units in a particular wu is the last number in its name (second to last in a task name) and measures the relative amount of work in a wu.
At the moment we are giving 5cr per stat unit, which gives a little over 2kcr per current WU size. (For example, a standardish wu with 448 stat units gives 2240cr) We probably switch to 7cr eventually, which gives about 3kcr per current WU size (same 448 now gives 3136cr regardless of how long it took).
Partial result crediting works the same way it used to.
12 May 2011, 17:39:16 UTC · Comment
New application version deployed
I just released v1.01 of our wrapper application. In addition to previous 32-bit Windows versions there's now versions for 64-bit Linux. Both Stream and CUDA are available for both.
Some of the major changes in this release:
6 May 2011, 19:58:20 UTC · Comment
I've been working on getting the crediting to stabilize. During these changes there was a time where valid results were given a near zero credit (it was shown as zero). This affected 290 results for 54 users and 80 hosts.
I've just fixed credit for these anomalies by granting them a fixed amount. Also, the near zero credits shouldn't happen anymore since it seems to have been result of a --max_granted_result switch for the validator that I tried as a solution for the credits. I'm not sure if it's a bug in BOINC Server code or if it's really supposed to work that way..
Otherwise, the crediting seems to have somewhat stabilized but I think we are giving too much credit at the moment. I'm going to bring it down a bit.
There are still cases were huge credit is given for some results. If validator doesn't correct itself I need to change how we credit results and bypass all these scaling that goes on.
5 May 2011, 10:14:13 UTC · Comment
Project now enforces following ATI/nVidia requirements. Most notable change is that CC 2.0 and above are no longer rejected for CUDA. This means devices like GTX 465 are now offered the CUDA application. We'll see how things works for them.
There's also better logging (for me) and also a generic notification to new BOINC Clients when these requirements are not met.
ATI Stream requirements:
nVidia CUDA requirements:
I probably should find a place on the website to list these requirements. They might change as we get more information about what are really required.
3 May 2011, 8:13:55 UTC · Comment
Outbound mail is delayed
Looks like BOINC server software went a little crazy with sending emails so the server hit it's outbound email limit for today. There are 406 mails waiting to be sent. Limit should get reset any minute now (or it'll take another 24 hours) and at least the queued mail should get sent then.
My guess is that this flurry is from initial global team import and other notifications from users registering on the project and on forum.
I hope mail volume gets a little easier once first days pass by and things get rolling. If not, I'll need to do something about that limit. For now, let's wait patiently as queues catch up. Thanks! :)
BTW, inbound mail for the project addresses (mentioned throughout the site) is most likely bouncing at the moment. I'll need to setup those soon but let's use this forum for communication for now.
2 May 2011, 8:58:58 UTC · Comment
Up and running!
Basic infrastructure is now up and running and at least ATI application should be usable. There's also CUDA application but that can have some performance issues still. I'm opening account creation for those brave enough to alpha test while I finish some loose ends.
1 May 2011, 23:36:35 UTC · Comment
News is available as an RSS feed