Problem with longer WUs: Maximum disk usage exceeded


advanced search

Message boards : Number crunching : Problem with longer WUs: Maximum disk usage exceeded

Reply to this thread
Subscribe to this thread
Sort
AuthorMessage
Christian Diepold
Avatar
private message
Joined: Sep 16, 2006
Posts: 20
ID: 1321
Credit: 100,331
RAC: 414
Message 1056 - Posted 9 Nov 2006 7:38:50 UTC
Last modified: 9 Nov 2006 7:56:21 UTC

Since the WUs got longer to crunch, I'm experiencing random crashes of WUs, on all of my machines, with the same error message: Maximum disk usage exceeded

I checked these WUs and it look like it a WU problem, not a special PC problem:


WU 1

WU 2

WU 3

WU 4

WU 5



These are just 5 examples. In my current results I can find about 20 of these WUs.


What gives?



PS: You see, that's why keeping the results in the DB for a longer period of time is a good idea: debugging!!! ;-)
____________

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 660
ID: 1
Credit: 1,417,572
RAC: 2
Message 1058 - Posted 9 Nov 2006 8:55:59 UTC

Well, you are absolutely right. I found a stupid typo in one of my template files in setting the disk usage bound for these jobs.

This typo has been fixed and the DB has been adapted to send out results from now on with the correct bound. All new WU's download with the name "wu_80398718_*" from this moment on should not be affected anymore. The ones sent out between monday the 6th of november 10:00 am and now thursday the 9th of november 10:00 am (and there are precisely 1327 of them ;-) are affected and will probably crash.

The hosts running these WU's will get credit for them, but please be patient on that cause there is probably a lot of hand work involved in that...

So, just cancel the "wu_80398718_*" from before the 9th of november...

Sorry 'bout that... grrgrgrgrg stupid vi...


m.
____________
M.F. Somers

Jordan Bashir User profile image
Avatar
private message
Joined: Apr 19, 2006
Posts: 10
ID: 1138
Credit: 154,747
RAC: 4
Message 1061 - Posted 9 Nov 2006 10:01:56 UTC - in response to Message ID 1058.
Last modified: 9 Nov 2006 10:02:24 UTC

I have not seen "wu_80398718_*" on my work list and I got work today (Nov. 9). Is this only on windows pc?

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 660
ID: 1
Credit: 1,417,572
RAC: 2
Message 1064 - Posted 9 Nov 2006 12:51:16 UTC

10 credits have been granted for each job that a user has already run and crashed, or for each job that will crash, whether they already returned the result or not..

This affects only the 1327 jobs that were sent out within the 3 day period, with the names "wu_80398718_*".


m.
____________
M.F. Somers

Christian Diepold
Avatar
private message
Joined: Sep 16, 2006
Posts: 20
ID: 1321
Credit: 100,331
RAC: 414
Message 1068 - Posted 10 Nov 2006 21:16:26 UTC

Hehe, a typo. Oh well, we're all humans, aren't we?

Will these 10 credits show on the "results page" of a user, or on the "work unit" page or will they be added manually without any further indication?
____________

aad
private message
Joined: Feb 14, 2006
Posts: 35
ID: 134
Credit: 17,679,964
RAC: 11,243
Message 1069 - Posted 10 Nov 2006 22:35:32 UTC - in response to Message ID 1058.
Last modified: 10 Nov 2006 22:37:13 UTC


So, just cancel the "wu_80398718_*" from before the 9th of november...

m.


Mark,
By not reading your annoucement correctly I (and lots of other people) keep canceling the jobs beginning with "wu_80398718_".
You better make a brief note that this behaviour will keep this jobs in the database for ever and ever and ever.......
Sorry for that, but rather make announcement more clearly so that even I can understand the first time. ;-))
http://boinc.gorlaeus.net/workunit.php?wuid=1554013
____________

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 660
ID: 1
Credit: 1,417,572
RAC: 2
Message 1070 - Posted 11 Nov 2006 9:58:27 UTC

Hello aad,

the WU's should stay in the DB, the values of the WU in the DB have been updated so that when a host requests the result to the WU, it will get the new settings with the higher disk bound. So just cancel the results you recieved before the 9th. The ones you get afterwards should run fine, even though they are from a WU that had crashed previously... Just cancel once thus... I'll put a post on the main page...

m.


____________
M.F. Somers

River~~
Avatar
private message
Joined: Oct 4, 2006
Posts: 76
ID: 1629
Credit: 17,661
RAC: 37
Message 1071 - Posted 11 Nov 2006 12:18:47 UTC - in response to Message ID 1069.


So, just cancel the "wu_80398718_*" from before the 9th of november...

m.


Mark,
By not reading your annoucement correctly I (and lots of other people) keep canceling the jobs beginning with "wu_80398718_".
You better make a brief note that this behaviour will keep this jobs in the database for ever and ever and ever.......


Not quite for ever.

There is a max number of errors allowed for a given WU, and a cancelled job counts as one error towards that max.

But I agree with your main point, the more explicit it is that only the earlier jobs need to be cancelled, the fewer people will make that mistake, and the less db space will be wasted.

So if it is issued 10th Nov or after, let it run.

Only abort if it is beginning "wu_80398718_" AND issued before 9th Nov.

R~~

River~~
Avatar
private message
Joined: Oct 4, 2006
Posts: 76
ID: 1629
Credit: 17,661
RAC: 37
Message 1072 - Posted 11 Nov 2006 12:28:12 UTC - in response to Message ID 1064.

10 credits have been granted for each job that a user has already run and crashed, or for each job that will crash, whether they already returned the result or not..

This affects only the 1327 jobs that were sent out within the 3 day period, with the names "wu_80398718_*".


m.


If this happens again, I would suggest grant of user claim up to max of 10, rather than flat 10. The SQL is only slighlty more complicated.

Users who abort the task before it starts should not expect any credit (and users of goodwill will abort it as they'd prefer to get credit for doing useful work).

Likewise users who abort the run while it is in progress deserve credit for what is crunched in good faith, but not (imo) for the uncrunched parts.

However, please don't change what you are doing this time: having promised 10 cobblestones per job, if you don't deliver then someone will be upset.

R~~

Reply to this thread

Message boards : Number crunching : Problem with longer WUs: Maximum disk usage exceeded



Return to Leiden Classical main page


Copyright © 2017 Leiden University - Leiden Institute of Chemistry - Theoretical Chemistry Department