Crediting problems |
Message boards : Number crunching : Crediting problems
| Author | Message | |
|---|---|---|
Hi, | ||
| ID: 2465 | Rating: 0 | rate: | [Reply to this post] | |
Due to the fact that the db ran dry of work; the scheduler allowed two different types of hosts to claim the same WU (for validation) at the same time (race condition), with the result being that your WA was not matched to another host of the same hardware class. This should be fixed temporarily now because there is now more work. | ||
| ID: 2467 | Rating: 0 | rate: | [Reply to this post] | |
You should perhaps turn on homogeneous redundancy, as WUs from Lin/Apple and Win don't seem to match. | ||
| ID: 2492 | Rating: 0 | rate: | [Reply to this post] | |
I've seen this problem recently on my Linux machine too. | ||
| ID: 2494 | Rating: 0 | rate: | [Reply to this post] | |
Homogeneous redundancy is turned on; this is a race condition in the shmem-table: when the db has not enough work ready, the two result units of the same workunit are both put into the shmem table to be scheduled to hosts. when at that point two different hosts of different hardware classes each claim a result unit of the same workunit; this race condition happens; both crunch fine; both return results, but it won't match and the server will generate another result unit to be scheduled to find concensus. see? this is a 'design' flaw in the boinc feeder schedulers which weren't exactly made for homogeneous redundancy in the first place. | ||
| ID: 2496 | Rating: 0 | rate: | [Reply to this post] | |
Homogeneous redundancy is turned on; Uf HR would be turned on, the different results of the same WU would go only to similar machines, i.e. only Linux or only Windows. The results go out to different machines, so HR os definitely not turned on. If it would have been turned on, and no free result for a Lin machine would be available, something like this would be in my message tab: message from server: there was work but it was committed to other platforms The same WU was sent to as well Linux as Windows computers, that's by definition without HR. ____________ Gruesse vom Saenger ![]() For questions about Boinc look in the BOINC-Wiki | ||
| ID: 2502 | Rating: 0 | rate: | [Reply to this post] | |
No you misunderstand; let me repeat: | ||
| ID: 2503 | Rating: 0 | rate: | [Reply to this post] | |
A bit clearer now? Yep, it is. Is there a way to delay the sending of the second result of a WU, so that this cannot happen? (Probably not, otherwise you would probably have turned it on ;) I don't know how fast you need the WUs back, one possibility to achieve this could be to create only one result first and create the second one after the first was send with the right HR-class. But I'm no programmer, I dunno whether and how that's possible. | ||
| ID: 2505 | Rating: 0 | rate: | [Reply to this post] | |
*IF*, however, both result units of a fresh workunit (with hr_class =0) are present in the shmem-table; and two different hosts at the same time claim it (cause hr_class = 0), the final hr_class value is determined by the host with slowest scheduler request. I just looked at some of my still pending results, and your explanation doesn't fit: 7166268: 17029650 68237 13 Oct 2009 18:39:28 UTC 13 Oct 2009 22:54:32 UTC Over Success Done 3,142.88 15.41 pending 17029651 47289 15 Oct 2009 14:48:07 UTC 15 Oct 2009 18:10:31 UTC Over Success Done 2,077.09 14.60 pending 17061804 68814 20 Oct 2009 13:42:48 UTC 20 Oct 2009 14:19:39 UTC Over Client error Compute error 0.00 0.00 --- 17148681 65520 20 Oct 2009 14:21:09 UTC 24 Oct 2009 14:21:09 UTC In Progress Unknown New --- --- --- My Linux computer is #47289, it got it's result nearly 2 days after the windows computer #68237 got it, and even considerably after this one returned the crunched result. The second result should not have been sent to my computer if your theory would be right. The same has happened on other WUs as well. HR is definitely not switched on, or at lest switched on in an erroneous way. ____________ Gruesse vom Saenger ![]() For questions about Boinc look in the BOINC-Wiki | ||
| ID: 2506 | Rating: 0 | rate: | [Reply to this post] | |
H'mmm... This I need to investigate; I'm definitely sure HR is turned on otherwise we would have all sorts of other serious problems! It could be that this is a glitch somehow when I temporarily stop and restart the project for maintenance and backups... I did have some issues in the beginning of the month when the DB ran dry and I had to shorten the delay time of jobs a bit... Got no clue yet; but I'm really sure HR is turned on and is working fine... | ||
| ID: 2507 | Rating: 0 | rate: | [Reply to this post] | |
Just wondering what happened to this result: | ||
| ID: 2653 | Rating: 0 | rate: | [Reply to this post] | |
Message boards : Number crunching : Crediting problems