number of pending wu skyrocketing


advanced search

Message boards : Number crunching : number of pending wu skyrocketing

Reply to this thread
Subscribe to this thread
Sort
AuthorMessage
ColdRain
private message
Joined: Feb 28, 2006
Posts: 52
ID: 548
Credit: 24,245,377
RAC: 13,162
Message 752 - Posted 28 Sep 2006 16:58:13 UTC

Since a day or 2 my number of pending wu's is increasing rapidly. Where it used to be about 2400 on average, it's now already on 4300 and still climbing. I heven't been adding special pc's or special platforms, it's still the same farm as it was before.
Is there an explanation for this? Is it a glitch on the server side, or has it to do with the units? Anyone else seeing the same thing?
____________

Odysseus
private message
Joined: Sep 22, 2006
Posts: 26
ID: 1458
Credit: 56,637
RAC: 23
Message 754 - Posted 29 Sep 2006 2:14:03 UTC - in response to Message ID 752.
Last modified: 29 Sep 2006 2:17:37 UTC

Is there an explanation for this? Is it a glitch on the server side, or has it to do with the units? Anyone else seeing the same thing?

I just signed up, so I don't know what's usual here. I did have some results validated over the first few days. Now I notice that all my 'quorum partner' results are "Unsent", so these WUs could remain pending for a while.

P.S. The website seems rather slow. Overloaded server?
____________

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 662
ID: 1
Credit: 1,417,572
RAC: 2
Message 759 - Posted 29 Sep 2006 8:59:22 UTC

Because a lot of WU's are present to be processed at the moment and scheduler is scheduling 'old' WU's for other hardware first... You'll get your credit eventually when they are processed ;-)... As a spin-off it seems that the grid is doing less work now, but that's also not true actually ;-)... I'll try and see if I can fix/tweak some things...

m.
____________
M.F. Somers

ColdRain
private message
Joined: Feb 28, 2006
Posts: 52
ID: 548
Credit: 24,245,377
RAC: 13,162
Message 764 - Posted 29 Sep 2006 16:48:59 UTC

Straight answer, we'll wait, fair enough ;-)
____________

Buster Gunn
private message
Joined: Sep 17, 2006
Posts: 4
ID: 1352
Credit: 11,813
RAC: 117
Message 768 - Posted 2 Oct 2006 21:53:36 UTC - in response to Message ID 759.

Because a lot of WU's are present to be processed at the moment and scheduler is scheduling 'old' WU's for other hardware first... You'll get your credit eventually when they are processed ;-)... As a spin-off it seems that the grid is doing less work now, but that's also not true actually ;-)... I'll try and see if I can fix/tweak some things...

m.

Well, I'd hate to disagree with the moderator but my pending credits aren't even being sent out to be verified. I only have around 200 w/u pending but most of them (99%) have a "not sent" entry where there should be another computer from someone else. My total credits haven't moved in 4 days.
____________

Odysseus
private message
Joined: Sep 22, 2006
Posts: 26
ID: 1458
Credit: 56,637
RAC: 23
Message 769 - Posted 3 Oct 2006 4:01:00 UTC - in response to Message ID 768.

Well, I'd hate to disagree with the moderator but my pending credits aren't even being sent out to be verified. I only have around 200 w/u pending but most of them (99%) have a "not sent" entry where there should be another computer from someone else. My total credits haven't moved in 4 days.

I'm in much the same boat, albeit on a much smaller scale: I have over 200 credits pending and all the WUs I've checked are still showing "Unsent".

Meantime I'm getting quite a lot of

Mon 2 Oct 20:07:16 2006|Leiden Classical|Message from server: No work sent
Mon 2 Oct 20:07:16 2006|Leiden Classical|Message from server: (there was work for other platforms)
Mon 2 Oct 20:07:16 2006|Leiden Classical|No work from project


in my BOINC message logs.

____________

ColdRain
private message
Joined: Feb 28, 2006
Posts: 52
ID: 548
Credit: 24,245,377
RAC: 13,162
Message 770 - Posted 3 Oct 2006 17:39:40 UTC

Processing these old wu's takes quite long imho.
My current # pending = 10014 ... and climbing.

I'm running for #1 worldwide as a birthday gift to self, which is in approx 10 days. Will that be possible in any way? ;-)
____________

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 662
ID: 1
Credit: 1,417,572
RAC: 2
Message 774 - Posted 4 Oct 2006 7:39:45 UTC - in response to Message ID 769.


I'm in much the same boat, albeit on a much smaller scale: I have over 200 credits pending and all the WUs I've checked are still showing "Unsent".

Meantime I'm getting quite a lot of

Mon 2 Oct 20:07:16 2006|Leiden Classical|Message from server: No work sent
Mon 2 Oct 20:07:16 2006|Leiden Classical|Message from server: (there was work for other platforms)
Mon 2 Oct 20:07:16 2006|Leiden Classical|No work from project


in my BOINC message logs.


I see you are one of the fwe lucky/un-lucky ones with a Mac. Your machine will get work eventually, but you are realy outnumbered by thousands of other machines running windows... You will get your credit and there will be work for you, but not as much as for the others ;-).

m.

____________
M.F. Somers

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 662
ID: 1
Credit: 1,417,572
RAC: 2
Message 775 - Posted 4 Oct 2006 7:56:18 UTC - in response to Message ID 770.

Processing these old wu's takes quite long imho.
My current # pending = 10014 ... and climbing.

I'm running for #1 worldwide as a birthday gift to self, which is in approx 10 days. Will that be possible in any way? ;-)


Don't know... Seems that of the 150k WU's we still have 30k left... your pending result should start getting un-pended pretty soon ;-)...

m.
____________
M.F. Somers

Odysseus
private message
Joined: Sep 22, 2006
Posts: 26
ID: 1458
Credit: 56,637
RAC: 23
Message 776 - Posted 4 Oct 2006 11:19:43 UTC - in response to Message ID 774.
Last modified: 4 Oct 2006 11:25:44 UTC

I see you are one of the fwe lucky/un-lucky ones with a Mac. Your machine will get work eventually, but you are realy outnumbered by thousands of other machines running windows... You will get your credit and there will be work for you, but not as much as for the others ;-).

Sorry, I don't understand what that has to do with either issue. Macs are a minority (~4%) in all BOINC projects that support them, but I've never seen a "there was work for other platforms" message from e.g. SETI@home (except when they switched to the Enhanced version; I was running an optimized app at the time, using the anonymous-platform mechanism, and got some of those messages before I updated). And how would my having a Mac prevent or delay the counterparts of the results I've submitted from being sent to other hosts to make a quorum? All the WUs I've looked at have "Unsent" results, with none in progress.

BTW, I'm still getting a lot of
MacOS Error -43 occured in Mac_Lib.c line 64 
messages in my output files, sprinkled with the occasional
GLUT: Fatal Error in screensaver: could not open display: 
But these errors don't seem to affect the exit status--I only hope these results will turn out to be valid ... if the counterparts ever get processed for comparison ...

Are there any plans to produce a screensaver for the Mac version?

____________

adrianxw
Avatar
private message
Joined: Mar 8, 2006
Posts: 24
ID: 731
Credit: 273,601
RAC: 167
Message 778 - Posted 4 Oct 2006 19:57:47 UTC
Last modified: 4 Oct 2006 20:00:12 UTC

No work for Windows either...

04/10/2006 21:46:28|Leiden Classical|Sending scheduler request to http://boinc.gorlaeus.net/Classical_cgi/cgi
04/10/2006 21:46:28|Leiden Classical|Reason: To fetch work
04/10/2006 21:46:28|Leiden Classical|Requesting 17280 seconds of new work
04/10/2006 21:46:33|Leiden Classical|Scheduler request succeeded
04/10/2006 21:46:33|Leiden Classical|Message from server: No work sent
04/10/2006 21:46:33|Leiden Classical|No work from project
04/10/2006 21:53:39|Leiden Classical|Sending scheduler request to http://boinc.gorlaeus.net/Classical_cgi/cgi
04/10/2006 21:53:39|Leiden Classical|Reason: Requested by user
04/10/2006 21:53:39|Leiden Classical|Requesting 17280 seconds of new work
04/10/2006 21:53:44|Leiden Classical|Scheduler request succeeded
04/10/2006 21:53:44|Leiden Classical|Message from server: No work sent

... etc. despite "Ready to send 111,298 ".
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

KSMarksPsych User profile image
Avatar
private message
Joined: Feb 15, 2006
Posts: 40
ID: 153
Credit: 5,050
RAC: 15
Message 779 - Posted 4 Oct 2006 23:43:44 UTC - in response to Message ID 776.

I see you are one of the fwe lucky/un-lucky ones with a Mac. Your machine will get work eventually, but you are realy outnumbered by thousands of other machines running windows... You will get your credit and there will be work for you, but not as much as for the others ;-).

Sorry, I don't understand what that has to do with either issue. Macs are a minority (~4%) in all BOINC projects that support them, but I've never seen a "there was work for other platforms" message from e.g. SETI@home (except when they switched to the Enhanced version; I was running an optimized app at the time, using the anonymous-platform mechanism, and got some of those messages before I updated). And how would my having a Mac prevent or delay the counterparts of the results I've submitted from being sent to other hosts to make a quorum? All the WUs I've looked at have "Unsent" results, with none in progress.

BTW, I'm still getting a lot of
MacOS Error -43 occured in Mac_Lib.c line 64 
messages in my output files, sprinkled with the occasional
GLUT: Fatal Error in screensaver: could not open display: 
But these errors don't seem to affect the exit status--I only hope these results will turn out to be valid ... if the counterparts ever get processed for comparison ...

Are there any plans to produce a screensaver for the Mac version?



I'm pretty certain that the scheduler uses homogeneous redundency here. Meaning that macs are paired with macs, etc. So you have a ton of pending WUs that are just waiting for another Mac to request work.
____________
Kathryn :o)
The BOINC FAQ Service
The Unofficial BOINC Wiki
The Trac System
More BOINC information than you can shake a stick of RAM at.

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 662
ID: 1
Credit: 1,417,572
RAC: 2
Message 784 - Posted 5 Oct 2006 8:17:37 UTC

And that's absolutly correct ;-)

m.
____________
M.F. Somers

ColdRain
private message
Joined: Feb 28, 2006
Posts: 52
ID: 548
Credit: 24,245,377
RAC: 13,162
Message 795 - Posted 6 Oct 2006 16:00:14 UTC

number of wu's pending = 13169 and still increasing
This is getting ridiculous ...
____________

Buster Gunn
private message
Joined: Sep 17, 2006
Posts: 4
ID: 1352
Credit: 11,813
RAC: 117
Message 798 - Posted 7 Oct 2006 12:32:27 UTC - in response to Message ID 795.

number of wu's pending = 13169 and still increasing
This is getting ridiculous ...


Yup - mine are still going up also. Something is amiss.
____________

[B^S] Gamma^Ray User profile image
Avatar
private message
Joined: Sep 18, 2006
Posts: 5
ID: 1381
Credit: 10,135
RAC: 331
Message 799 - Posted 7 Oct 2006 16:03:02 UTC

Im new here to this project, So excuse my ignorance. But, Is it normal to this project for it to take sooo long to validate and award any type of credit once the wu is crunched ? Or is it a recent problem thats popped up ?

Just Courious,
G^R
____________

River~~
Avatar
private message
Joined: Oct 4, 2006
Posts: 76
ID: 1629
Credit: 17,661
RAC: 37
Message 800 - Posted 7 Oct 2006 17:06:59 UTC - in response to Message ID 799.
Last modified: 7 Oct 2006 17:08:32 UTC

Im new here to this project, So excuse my ignorance. But, Is it normal to this project for it to take sooo long to validate and award any type of credit once the wu is crunched ? Or is it a recent problem thats popped up ?

Just Courious,
G^R


If you look at your results (*) and click on the workunit numbers, you will find three of the six outstanding WU have not yet been issued. This means you are waiting for another XP user to ask for work, and the scheduler to issue that work to them.

You can speed this up at present by asking for more work yourself -- as a temporary exception to the normal BOINC rule, LC currently lets the same computer get the same work to validate it. This is not guaranteed to work, but it has happened to me a couple of times. I expect this feature will be withdrawn once this project settles down.

The other three seem to be ghosts. Again click on the workunit numbers, and you can see the other computer they were issued to a minute or so before you got them. First bit of bad news, all three are issued to the same computer.

Now click on that computer number. It has been issued over 500 workunits. Click on the list, and scan down. You will see that some work issued recently has been returned by that computer, but older work is oustanding. This is very bad news.

Either those three tasks got lost at the point of issue (so called ghost tasks), or the other participant has had a serious problem with his box and has reset the client, making it forget those tasks. It makes no difference now which has happened. You will need to wait till the deadline expires on those tasks, then wait for the scheduler to re-issue them, then wait for whoever gets them to process them.

My suggestion is to go on getting work - I joined after you but have processed about 600 credits worth so far, and have actually received about 440 - around 75% actually received. At first though almost all of my work was held up just like yours is now. The secret is not to stop and wait for the first batch ;-)

River~~

(*) PS: for those who don't yet know:

To look at someone's results (mine perhaps), click on the link that gives my name by this posting and you will see a shorter version of my account page - this has a link to my computers, from which you can see my results.

____________

[B^S] Gamma^Ray User profile image
Avatar
private message
Joined: Sep 18, 2006
Posts: 5
ID: 1381
Credit: 10,135
RAC: 331
Message 801 - Posted 7 Oct 2006 19:13:05 UTC

Thanks for the Detailed explination. Is there any particular reason why the validation process is done like this here, As opposed to the way most other projects are done ? Seems abit strange to have to possibly wait until a work unit has passed its deadline, And then be reissued a second time, And only and if then it is returned valid, Will you get any credit. (Unless someone else runs it and returns it). But, It is what it is I reckon. hehe Was just courious.

Thanks !
G^R
____________

River~~
Avatar
private message
Joined: Oct 4, 2006
Posts: 76
ID: 1629
Credit: 17,661
RAC: 37
Message 812 - Posted 8 Oct 2006 17:36:28 UTC - in response to Message ID 801.
Last modified: 8 Oct 2006 17:56:49 UTC

Is there any particular reason why the validation process is done like this here, As opposed to the way most other projects are done ?


The short answer is that there is only limited freedom of choice - within the BOINC package you either waste computer resources calculating everything more than twice, or you accept the delays that happen when one task of a pair goes astray.

By "most other projects" do you mean most other BOINC projects, or most other Distibuted Computing (DC) projects?

Most BOINC projects do suffer the same delays.

If you want validation at all within BOINC, you have to double-process everything and compare answers. You either send out more than two copies of the same work (so you hope at least two cme back quickly), or you accept there will be delays when one half goes missing.

Einstein@home is an example of a major BOINC project that does exactly what LC does - it sends out only two copies of each WU, and accepts that sometimes it takes a long time to validate. They used to send out four copies, then three - but both times they changed policy they did so because they wanted to get the most processing out of the donated resources. By double-crunching instead of quad-crunching they get twice as much science out of the same pool of donors.

It is true that most non-BOINC DC projects do things differently - differently from BOINC and differently from each other. That is the luxury of writing your own code. But then the disadvantage is you need to actually write all that extra code - or the project staff do.

R~~

[B^S] Gamma^Ray User profile image
Avatar
private message
Joined: Sep 18, 2006
Posts: 5
ID: 1381
Credit: 10,135
RAC: 331
Message 815 - Posted 8 Oct 2006 18:39:56 UTC
Last modified: 8 Oct 2006 18:40:32 UTC

Great explination ! And yea, I was talking about BOINC projects. So I guess then, The issue thus becomes a matter of, Do you send abunch of new work units out to be crunched, Knowing it will take longer for them to be resent a second time, As the majority of work units availabe are new ones, Not reissued ones, Thus causing the delay in credits granted. Or, Do you send less newer ones out daily, And reissue the older ones a second time, Before you move on to the next batch of new work units again. Which would mean quicker credit granted for users, But not as many "New or First time" W/U's being crunched for the project ? If I understand that correctly, It does make sense now. At least to me.

Thanks
G^R
____________

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 662
ID: 1
Credit: 1,417,572
RAC: 2
Message 824 - Posted 10 Oct 2006 8:11:57 UTC

First of all, River~~ thanx for the explanation ;-).

Second of all, credits are being credited and pending WU's should now graduately go down again...

The thing is, I *try* make sure there is always 10% of unassigned WU's in the shmem tables for hosts connecting. The scheduler here tries to find assigned WU's for your hardware first in the shmem table, before accepting new unassigned ones. However, because the odd WU's are on fwnc7128 in shmem table and the even WU's on fwnc7129 in the shmem table, and the DNS should round robin both machines, it might be that you are on the wrong machine by accident. Try again and you might get the matching WU if you are allowed to proccess it already ;-)...

More info on the (very nasty) details can be read on:

http://boinc.gorlaeus.net/forum_thread.php?id=108
http://boinc.gorlaeus.net/forum_thread.php?id=85
http://boinc.gorlaeus.net/forum_thread.php?id=78
http://boinc.gorlaeus.net/forum_thread.php?id=71

m.

____________
M.F. Somers

[B^S] Gamma^Ray User profile image
Avatar
private message
Joined: Sep 18, 2006
Posts: 5
ID: 1381
Credit: 10,135
RAC: 331
Message 831 - Posted 10 Oct 2006 19:12:25 UTC

No worries mate! I was just courious as it seemed different then the other projects I had worked on. Now thanks to rivers explination, It all makes sense now :-) (Mostly anyways hehe). The only problem I have now, Is the occasional "No Work From Project" one. :-((

Regards,
G^R
____________

Odysseus
private message
Joined: Sep 22, 2006
Posts: 26
ID: 1458
Credit: 56,637
RAC: 23
Message 833 - Posted 11 Oct 2006 1:49:32 UTC - in response to Message ID 774.

I see you are one of the fwe lucky/un-lucky ones with a Mac. Your machine will get work eventually, but you are realy outnumbered by thousands of other machines running windows...

OK, understood in principle. But how do you explain the following?

Today I attached my three old G4 Macs to the project. They all show up in my account, but they're getting nothing but "no work from project" messages. Yet when I look at my G5's pending results, a large proportion of them show "Unsent" partners. These should be destined for other Macs, but they're apparently unavailable to my G4s. Why? The situation seems paradoxical to me.
____________

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 662
ID: 1
Credit: 1,417,572
RAC: 2
Message 835 - Posted 11 Oct 2006 6:59:22 UTC

Okay, to answer this... please read

http://boinc.gorlaeus.net/forum_thread.php?id=71#413

first...

The dynamic scheduling takes usually recomputes every hour what hosts might contact the next two to three hours. If new types of hardware are present, the number of requests should first increase for a certain amount of time before the dynamic scheduling makes them significant and actually schedule more for them. The 10% unassigned WU's are only present if there is not to much other outstanding work. Usually these WU's are gone pretty fast too... Perhaps I should tweak the scheduler to increase the 10% level, however I'm reluctant to
do so, it might cause the 'validating WU's' to to be scheduled and handled... again causing much pending of credits... To put it in other words, sometimes the old work needs to be done before new work is done ;-).


m.
____________
M.F. Somers

River~~
Avatar
private message
Joined: Oct 4, 2006
Posts: 76
ID: 1629
Credit: 17,661
RAC: 37
Message 837 - Posted 11 Oct 2006 7:02:55 UTC - in response to Message ID 833.
Last modified: 11 Oct 2006 7:03:59 UTC

I see you are one of the fwe lucky/un-lucky ones with a Mac. Your machine will get work eventually, but you are realy outnumbered by thousands of other machines running windows...

OK, understood in principle. But how do you explain the following?

Today I attached my three old G4 Macs to the project. They all show up in my account, but they're getting nothing but "no work from project" messages. Yet when I look at my G5's pending results, a large proportion of them show "Unsent" partners. These should be destined for other Macs, but they're apparently unavailable to my G4s. Why? The situation seems paradoxical to me.


One explanation is the odd/even effect that Mark mentioned.

If your G5 pending results are all "odd", and by chance your G4s are always connecting to the server which has the "even" numbered work, you'd get none.

Another explanation is that the server recognises that G4 and G5 machines are different. This depends how tightly Mark has set the homogenous redundancy. I know that on LHC (famous as the project with the most finicky validator) you need the hardware as well as the OS to match. For some apps (like LHC) the small differences between so-called identical processors can make a difference.

Even in that case, your G5 machines should be able to pick up the pending work for your G5 machines. Even if you have only one G5, the fact that this project currently allows self-validation means that if your G5 connects to the right server it should eventually get work.

So I am guessing that a third part of the answer is that not all the pending work is in the shmem on the servers? In which case a Mac user has to get lucky enough to connect justr when there is Mac work in shmem?

If this is so, then Mark might like to tweak the code to ensure that shmem contains at least some work for each platform. I don't know how big a job this would be, but it might be worth doing at some time. Especially as rumour has it that Mark himself has a Mac ;-)

River~~

Odysseus
private message
Joined: Sep 22, 2006
Posts: 26
ID: 1458
Credit: 56,637
RAC: 23
Message 838 - Posted 11 Oct 2006 8:24:59 UTC - in response to Message ID 837.

Another explanation is that the server recognises that G4 and G5 machines are different. This depends how tightly Mark has set the homogenous redundancy.

I don't see how that can be, unless something has changed in the past few days (it's been a week since I received any work). In most of the validated results I looked at, my G5 was paired with a G4.
____________

Odysseus
private message
Joined: Sep 22, 2006
Posts: 26
ID: 1458
Credit: 56,637
RAC: 23
Message 839 - Posted 11 Oct 2006 8:46:54 UTC - in response to Message ID 835.
Last modified: 11 Oct 2006 8:50:17 UTC

The dynamic scheduling takes usually recomputes every hour what hosts might contact the next two to three hours. If new types of hardware are present, the number of requests should first increase for a certain amount of time before the dynamic scheduling makes them significant and actually schedule more for them.

Thanks for the background. I suppose the system could be characterized by the proverb, "The squeaky wheel gets the grease." My hosts won't be 'squeaking' much in a while, though: after having numerous requests for work declined, BOINC waits longer and longer before attempting another connection. AIUI the requests will continue to come further and further apart, making it less and less likely that they'll be successful -- I suppose that eventually the systems will be deemed "inactive" ...

(BTW, regarding 'heterogeneously validating' projects, I'm pretty sure that both SETI@home and Einstein@home do a lot of floating-point arithmetic: trigonometry, Fourier transforms, &c. And I believe that although SzTAKI Desktop Grid is working with matrices of integers, it uses FP registers for at least some of the processing, likely to avoid having to break very large integers into, say, 32-bit pieces.)

P.S. Any ideas what might be causing the numerous "MacOS Error -43 occured in Mac_Lib.c line 64" messages I see in the output files for my results? They seem to be fairly harmless, but they may indicate a lurking bug of some kind ...
____________

Christian Diepold
Avatar
private message
Joined: Sep 16, 2006
Posts: 20
ID: 1321
Credit: 100,331
RAC: 414
Message 840 - Posted 11 Oct 2006 13:01:23 UTC

@ Odysseus:

If you do a manual "refresh" with Leiden, the intervals for contancting the server will start from 1 minute again. :-)
____________

River~~
Avatar
private message
Joined: Oct 4, 2006
Posts: 76
ID: 1629
Credit: 17,661
RAC: 37
Message 842 - Posted 11 Oct 2006 19:41:13 UTC - in response to Message ID 838.

Another explanation is that the server recognises that G4 and G5 machines are different. This depends how tightly Mark has set the homogenous redundancy.

I don't see how that can be, unless something has changed in the past few days (it's been a week since I received any work). In most of the validated results I looked at, my G5 was paired with a G4.

You are right - that completely eliminates this possibility

R~~

River~~
Avatar
private message
Joined: Oct 4, 2006
Posts: 76
ID: 1629
Credit: 17,661
RAC: 37
Message 852 - Posted 13 Oct 2006 19:39:54 UTC - in response to Message ID 824.
Last modified: 13 Oct 2006 20:02:40 UTC

However, because the odd WU's are on fwnc7128 in shmem table and the even WU's on fwnc7129 in the shmem table, and the DNS should round robin both machines, it might be that you are on the wrong machine by accident.


Ouch!!!!

I have just realised what may be going wrong here. Or then again, my mind might have suddenly gone off on a wild goose chase...

If I remember rightly, the current boinc clients cache the resolved hostnames. I can't remember where I saw that info, but I believe I have seen it in the last month or so on a BOINC related board somewhere.

If I am right then this means that the round-robin effect is not on a per-update basis as you intend Mark, but only potentially changes on the client being restarted.

Users can test this - try several updates on LC without re-starting the boinc client when you are getting this message. Then try several cycles of (restart, update). If I am right then the latter approach will work about half the time.

If I am right, the real fix is needed in the boinc client to stop this happening - there should either be no cacheing of DNS info or time-limited cacheing (time limited to the TTL in the domain lookup).

It is just a thought, and apologies if I am off track on this.

River~~

ColdRain
private message
Joined: Feb 28, 2006
Posts: 52
ID: 548
Credit: 24,245,377
RAC: 13,162
Message 853 - Posted 13 Oct 2006 21:20:16 UTC - in response to Message ID 852.

If I remember rightly, the current boinc clients cache the resolved hostnames. I can't remember where I saw that info, but I believe I have seen it in the last month or so on a BOINC related board somewhere.

No, they cache the IP address, see Leiden homepage nws section Sep 11. Round robin shouldn't be a problem, more boinc projects use a likewise setup.

Now, can we get this thread back on topic please?
I started this thread 2 weeks ago because my number of pending credits was getting sky-high. Meanwhile I have about 11000 wu's pending. That's the number of wu's, not the amount of claimed credit.
I switched all Windows clients away from this project, only the linux boxes still crunch LC. They normally output about 2000 credits per day, and that's about what I'm crunching the last days according to various stats sites. Also when I check some of the wu pages, the recent ones get credit, the old (pending) ones don't. Which means that these 11000 pending wu's are pending for already a VERY long time. I'm most certainly not alone (but I do (did) crunch alot for LC so the effect is bigger), the global effect is very noticible on the graphs at http://www.boincstats.com/stats/project_graph.php?pr=leiden

On October 4th you wrote:
Don't know... Seems that of the 150k WU's we still have 30k left... your pending result should start getting un-pended pretty soon ;-)...

It's now October 13th. They're still pending.
So ... what is "pretty soon" ?

____________

River~~
Avatar
private message
Joined: Oct 4, 2006
Posts: 76
ID: 1629
Credit: 17,661
RAC: 37
Message 854 - Posted 13 Oct 2006 22:03:18 UTC - in response to Message ID 853.
Last modified: 13 Oct 2006 22:18:36 UTC

If I remember rightly, the current boinc clients cache the resolved hostnames. I can't remember where I saw that info, but I believe I have seen it in the last month or so on a BOINC related board somewhere.

No, they cache the IP address, see Leiden homepage nws section Sep 11. Round robin shouldn't be a problem, more boinc projects use a likewise setup.


That is exactly what I mean. Hostnames resolve to an IP, and the client caches the resolved name, which is the IP. Thanks for pointing me to the reference, I knew I'd seen it.

What it means in plain language is that which server you access only changes when the client is restarted. For many users this will only be at boot up. Therefore Mark's suggestion to "try again" would work if the client was restarted, but not if the user simply clicks the update button.

This is just like the way that when the IP addresses changed in September, clients only changed which server they talked to when they were restarted. If a client does not follow a changed IP address when it chages in the DNS, it will not be going back and re-resolving the hostname and getting the advantage of changing between servers.

Yes, other projects use Round Robin DNS, but as far as I know, other projects have not split up work between the two (or more) servers in a way that affects quorum building. It is the combination of this way of dividing the work together with the RR DNS that I was suggesting might be behind the observed build up, not the RR DNS on its own.

You suggest this is off topic - it was an attempted diagnosis and intended to be relevant & helpful in addressing the issue you raised. I am sorry I did not say enough to make the relevance clear.

R~~

Angus
private message
Joined: Feb 27, 2006
Posts: 14
ID: 507
Credit: 7,237
RAC: 37
Message 855 - Posted 14 Oct 2006 2:03:44 UTC
Last modified: 14 Oct 2006 2:30:04 UTC

I still think also that this thread has gotten off-topic.

I have a raft of results that are 'pending'. In every case, going back at least 3 weeks, the result has only been sent to one PC - mine. The second result in the workunit is still un-sent.

This is a Windows 2000, AMD XP class machine - hardly a rarity so homogeneous redundancy shouldn't be the issue (unless it's broken!).

Here is one workunit that has been sent only once and is well past deadline.

So - why are these not being sent to at least 2 PCs?

I'm going NNW until this is resolved. There's plenty of other projects that don't have trouble sending out enough results for a quorum and granting credit in a timely fashion.
____________

ColdRain
private message
Joined: Feb 28, 2006
Posts: 52
ID: 548
Credit: 24,245,377
RAC: 13,162
Message 858 - Posted 14 Oct 2006 8:26:36 UTC
Last modified: 14 Oct 2006 8:36:33 UTC


____________

ColdRain
private message
Joined: Feb 28, 2006
Posts: 52
ID: 548
Credit: 24,245,377
RAC: 13,162
Message 859 - Posted 14 Oct 2006 8:36:03 UTC

Maybe some more examples.
wu_164284800_1159171435_2759 created 25 Sep 2006 8:03:58 UTC
wu_898976128_1159171435_3869 created 25 Sep 2006 8:03:58 UTC
wu_78596477_1159171435_46887 created 25 Sep 2006 8:07:10 UTC




____________

River~~
Avatar
private message
Joined: Oct 4, 2006
Posts: 76
ID: 1629
Credit: 17,661
RAC: 37
Message 860 - Posted 14 Oct 2006 9:02:34 UTC - in response to Message ID 855.

I still think also that this thread has gotten off-topic.


OK fair comment.

From a user perspective there are two distinct issues, one is that work is being issued from "new" WU before all half-complete WU have issued their tasks, in turn leading to a huge build up of pending work. This is the subject of this thread.

There is also a second issue, that there is no work supplied to win boxes even when there is clearly work for win on at least one of the servers. There is a "no work for windows" thread for that issue, started by Clark about a week ago.

I still think that Mark is likely correct, that these two issues will turn out to be symptoms of the same underlying cause. However, while none of us know for sure, I'll post further comment on the "no work" issue on the other thread.

I apologise for not thinking of doing that earlier.

Also if anyone wants to comment on my Round Robin point, please follow me over to Clark's "no work" thread to do so.

R~~

KSMarksPsych User profile image
Avatar
private message
Joined: Feb 15, 2006
Posts: 40
ID: 153
Credit: 5,050
RAC: 15
Message 864 - Posted 15 Oct 2006 0:49:27 UTC - in response to Message ID 860.

I still think also that this thread has gotten off-topic.


OK fair comment.

From a user perspective there are two distinct issues, one is that work is being issued from "new" WU before all half-complete WU have issued their tasks, in turn leading to a huge build up of pending work. This is the subject of this thread.

There is also a second issue, that there is no work supplied to win boxes even when there is clearly work for win on at least one of the servers. There is a "no work for windows" thread for that issue, started by Clark about a week ago.

I still think that Mark is likely correct, that these two issues will turn out to be symptoms of the same underlying cause. However, while none of us know for sure, I'll post further comment on the "no work" issue on the other thread.

I apologise for not thinking of doing that earlier.

Also if anyone wants to comment on my Round Robin point, please follow me over to Clark's "no work" thread to do so.

R~~



Isn't Spinhenge having the same problem with unsent results piling up. I think it came down to the server code they were running. River~~, I know you read the projects mailing list. Am I remembering right?
____________
Kathryn :o)
The BOINC FAQ Service
The Unofficial BOINC Wiki
The Trac System
More BOINC information than you can shake a stick of RAM at.

River~~
Avatar
private message
Joined: Oct 4, 2006
Posts: 76
ID: 1629
Credit: 17,661
RAC: 37
Message 867 - Posted 15 Oct 2006 16:44:33 UTC - in response to Message ID 864.
Last modified: 15 Oct 2006 16:47:19 UTC

River~~, I know you read the projects mailing list. Am I remembering right?


I'm flattered that you remember me :) but sadly you are out of date.

Due to personal stuff I am doing very little at present, just folowing Rosetta, LHC & LC boards, and no mailing lists (BOINC or others).

So I wouldn't know if Spinhenge are having the same problem. It would not surprise me tho - the standards are clear - any server or any program that keeps a resolved hostname -> ip for an extended time is expected to refresh the lookup, not keep it for ever. Round robin and server moves are the two classic scenarios where extended cacheing causes problems.

If you are currently on the mailing lists, maybe you'd like to pass this comment on (if it has not been made there already of course)

R~~

KSMarksPsych User profile image
Avatar
private message
Joined: Feb 15, 2006
Posts: 40
ID: 153
Credit: 5,050
RAC: 15
Message 871 - Posted 16 Oct 2006 2:42:31 UTC - in response to Message ID 867.

River~~, I know you read the projects mailing list. Am I remembering right?


I'm flattered that you remember me :) but sadly you are out of date.

Due to personal stuff I am doing very little at present, just folowing Rosetta, LHC & LC boards, and no mailing lists (BOINC or others).

So I wouldn't know if Spinhenge are having the same problem. It would not surprise me tho - the standards are clear - any server or any program that keeps a resolved hostname -> ip for an extended time is expected to refresh the lookup, not keep it for ever. Round robin and server moves are the two classic scenarios where extended cacheing causes problems.

If you are currently on the mailing lists, maybe you'd like to pass this comment on (if it has not been made there already of course)

R~~


[offtopic]Hi R~~

I read but don't post on the lists. Too shy and too dumb overall.

Hope things are going ok for you.[/offtopic]

On topic. Spinhenge is having problems with results not getting sent out, but I think it's for a totally different reason. THey were trying to set a flag or something to prioritize results to go out. Turns out they were building their server software from the Stable CVS rather than Head. (what that means is greek to me tho).



Kathryn
____________
Kathryn :o)
The BOINC FAQ Service
The Unofficial BOINC Wiki
The Trac System
More BOINC information than you can shake a stick of RAM at.

BobCat13
private message
Joined: Oct 11, 2006
Posts: 5
ID: 1694
Credit: 173,598
RAC: 322
Message 885 - Posted 17 Oct 2006 15:42:07 UTC - in response to Message ID 752.

How long should it take for a WU to validate after both results have been returned? The following WU has been pending for over 100 hours since the second result was returned.

http://boinc.gorlaeus.net/workunit.php?wuid=1516252

____________

ColdRain
private message
Joined: Feb 28, 2006
Posts: 52
ID: 548
Credit: 24,245,377
RAC: 13,162
Message 888 - Posted 18 Oct 2006 4:06:35 UTC

OK 100 hours might seem long. Anyhow, I see on the wu page you linked to credit has been ganted.
Lucky you ... I'm still sitting on +10 K wu's pending ... no, I'm NOT talking of claimed credit, it's the **number** of wu's, to be exact, 10768 at the moment ... the oldest being from way back in september.

____________

Reply to this thread

Message boards : Number crunching : number of pending wu skyrocketing



Return to Leiden Classical main page


Copyright © 2018 Leiden University - Leiden Institute of Chemistry - Theoretical Chemistry Department