Crediting problems


advanced search

Message boards : Number crunching : Crediting problems

Reply to this thread
Subscribe to this thread
Sort
AuthorMessage
[BMF]DevinK
private message
Joined: Oct 28, 2008
Posts: 2
ID: 17935
Credit: 237,988
RAC: 17
Message 2465 - Posted 12 Aug 2009 21:28:14 UTC

Hi,

my linux host finishes the WU's and after a while they get no credit awarded. Don't understand why because the log doesn't indicate an error or faulty hardware, exit status is normal on all wu's.

Sometimes i get credited, but lots of results come back without points. All graphical libraries are installed.


Example

Thx in advance, DevinK

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 641
ID: 1
Credit: 1,417,572
RAC: 2
Message 2467 - Posted 13 Aug 2009 8:38:29 UTC

Due to the fact that the db ran dry of work; the scheduler allowed two different types of hosts to claim the same WU (for validation) at the same time (race condition), with the result being that your WA was not matched to another host of the same hardware class. This should be fixed temporarily now because there is now more work.

m.


____________
M.F. Somers

Saenger User profile image
Avatar
private message
Joined: Feb 15, 2006
Posts: 18
ID: 179
Credit: 27,250
RAC: 8
Message 2492 - Posted 16 Oct 2009 14:19:37 UTC

You should perhaps turn on homogeneous redundancy, as WUs from Lin/Apple and Win don't seem to match.
I've got a lot of WUs where I get invalidated because the other computers of the quorum were Windows, some where a Win got invalidated against two Lin or at least in one case my Lin and a Darwin, and some that went just fine because I was paired with a Lin from the beginning.

It hadn't happened in the past, something seems to have changed either with the app or with the WUs.
____________
Gruesse vom Saenger


For questions about Boinc look in the BOINC-Wiki

Van Fanel User profile image
private message
Joined: Dec 12, 2008
Posts: 1
ID: 18687
Credit: 72,595
RAC: 206
Message 2494 - Posted 20 Oct 2009 18:22:00 UTC

I've seen this problem recently on my Linux machine too.

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 641
ID: 1
Credit: 1,417,572
RAC: 2
Message 2496 - Posted 21 Oct 2009 5:43:14 UTC
Last modified: 21 Oct 2009 5:45:19 UTC

Homogeneous redundancy is turned on; this is a race condition in the shmem-table: when the db has not enough work ready, the two result units of the same workunit are both put into the shmem table to be scheduled to hosts. when at that point two different hosts of different hardware classes each claim a result unit of the same workunit; this race condition happens; both crunch fine; both return results, but it won't match and the server will generate another result unit to be scheduled to find concensus. see? this is a 'design' flaw in the boinc feeder schedulers which weren't exactly made for homogeneous redundancy in the first place.

Our adapted feeder takes this effect into account correctly only if enough work is available in the db. This is a design decision: what would you do if you only have one wu work and with it two result units to schedule; would you not send out the second result to a host and let all hosts wait until the first result has been computed; or do you accept the possibility you will have a mis-match and do schedule it? For the science and project the latter choise is more wise ;-).

m.


____________
M.F. Somers

Saenger User profile image
Avatar
private message
Joined: Feb 15, 2006
Posts: 18
ID: 179
Credit: 27,250
RAC: 8
Message 2502 - Posted 22 Oct 2009 22:15:10 UTC - in response to Message ID 2496.

Homogeneous redundancy is turned on;

Uf HR would be turned on, the different results of the same WU would go only to similar machines, i.e. only Linux or only Windows.
The results go out to different machines, so HR os definitely not turned on.

If it would have been turned on, and no free result for a Lin machine would be available, something like this would be in my message tab:
message from server: there was work but it was committed to other platforms

The same WU was sent to as well Linux as Windows computers, that's by definition without HR.
____________
Gruesse vom Saenger


For questions about Boinc look in the BOINC-Wiki

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 641
ID: 1
Credit: 1,417,572
RAC: 2
Message 2503 - Posted 23 Oct 2009 8:01:14 UTC - in response to Message ID 2502.

No you misunderstand; let me repeat:

"this is a race condition in the shmem-table: when the db has not enough work ready, the two result units of the same workunit are both put into the shmem table to be scheduled to hosts. when at that point two different hosts of different hardware classes each claim a result unit of the same workunit; this race condition happens; both crunch fine; both return results, but it won't match and the server will generate another result unit to be scheduled to find concensus. see? this is a 'design' flaw in the boinc feeder schedulers which weren't exactly made for homogeneous redundancy in the first place."

and explain a bit more:

"What you probably do not know is how the hr_class in the Db is assigned it's value; it starts of with value 0 (meaning any host can get this result unit to crunch) until a host has claimed a result unit and hr_class will get a value corresponding to the hardware class of the machine. After that, the hr_class value is set to a value and only hosts of the same hr_class will get the other rsult unit.

*IF*, however, both result units of a fresh workunit (with hr_class =0) are present in the shmem-table; and two different hosts at the same time claim it (cause hr_class = 0), the final hr_class value is determined by the host with slowest scheduler request. This is a race-condition that happens when fresh results are added to the Db and the Db is running dry."

A bit clearer now?

m.
____________
M.F. Somers

Saenger User profile image
Avatar
private message
Joined: Feb 15, 2006
Posts: 18
ID: 179
Credit: 27,250
RAC: 8
Message 2505 - Posted 24 Oct 2009 8:17:02 UTC - in response to Message ID 2503.

A bit clearer now?

Yep, it is.
Is there a way to delay the sending of the second result of a WU, so that this cannot happen? (Probably not, otherwise you would probably have turned it on ;)

I don't know how fast you need the WUs back, one possibility to achieve this could be to create only one result first and create the second one after the first was send with the right HR-class. But I'm no programmer, I dunno whether and how that's possible.

Saenger User profile image
Avatar
private message
Joined: Feb 15, 2006
Posts: 18
ID: 179
Credit: 27,250
RAC: 8
Message 2506 - Posted 24 Oct 2009 12:08:25 UTC - in response to Message ID 2503.
Last modified: 24 Oct 2009 12:09:53 UTC

*IF*, however, both result units of a fresh workunit (with hr_class =0) are present in the shmem-table; and two different hosts at the same time claim it (cause hr_class = 0), the final hr_class value is determined by the host with slowest scheduler request.


I just looked at some of my still pending results, and your explanation doesn't fit:
7166268:
17029650 68237 13 Oct 2009 18:39:28 UTC 13 Oct 2009 22:54:32 UTC Over Success Done 3,142.88 15.41 pending
17029651 47289 15 Oct 2009 14:48:07 UTC 15 Oct 2009 18:10:31 UTC Over Success Done 2,077.09 14.60 pending
17061804 68814 20 Oct 2009 13:42:48 UTC 20 Oct 2009 14:19:39 UTC Over Client error Compute error 0.00 0.00 ---
17148681 65520 20 Oct 2009 14:21:09 UTC 24 Oct 2009 14:21:09 UTC In Progress Unknown New --- --- ---

My Linux computer is #47289, it got it's result nearly 2 days after the windows computer #68237 got it, and even considerably after this one returned the crunched result. The second result should not have been sent to my computer if your theory would be right.
The same has happened on other WUs as well.

HR is definitely not switched on, or at lest switched on in an erroneous way.
____________
Gruesse vom Saenger


For questions about Boinc look in the BOINC-Wiki

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 641
ID: 1
Credit: 1,417,572
RAC: 2
Message 2507 - Posted 26 Oct 2009 9:20:33 UTC
Last modified: 26 Oct 2009 9:21:42 UTC

H'mmm... This I need to investigate; I'm definitely sure HR is turned on otherwise we would have all sorts of other serious problems! It could be that this is a glitch somehow when I temporarily stop and restart the project for maintenance and backups... I did have some issues in the beginning of the month when the DB ran dry and I had to shorten the delay time of jobs a bit... Got no clue yet; but I'm really sure HR is turned on and is working fine...

m.
____________
M.F. Somers

Phil Klassen User profile image
private message
Joined: Nov 4, 2007
Posts: 1
ID: 10579
Credit: 1,008,804
RAC: 6
Message 2653 - Posted 7 Oct 2010 3:01:27 UTC

Just wondering what happened to this result:

25239302 10923457 5 Oct 2010 13:15:37 UTC 7 Oct 2010 1:27:11 UTC Over Success Done 37,624.66 196.61 pending

25239285 10923439 5 Oct 2010 13:15:37 UTC 6 Oct 2010 2:33:11 UTC Over Success Done 16,963.88 88.64 0.00

The 2'nd result on here was a success but granted 0 credit. Just wondering if something is up with my hardware. Thanks

Reply to this thread

Message boards : Number crunching : Crediting problems



Return to Leiden Classical main page


Copyright © 2014 Leiden University - Leiden Institute of Chemistry - Theoretical Chemistry Department