Max CPU time exceeded


advanced search

Message boards : Number crunching : Max CPU time exceeded

Reply to this thread
Subscribe to this thread
Sort
AuthorMessage
[BAT] tutta55 User profile image
Avatar
private message
Joined: Feb 26, 2006
Posts: 104
ID: 485
Credit: 4,072,468
RAC: 293
Message 1229 - Posted 11 Jan 2007 11:25:46 UTC
Last modified: 11 Jan 2007 11:26:24 UTC

I have a number of work units where the max CPU time is exceeded. It happens on different machines. See

http://boinc.gorlaeus.net/result.php?resultid=4569681
http://boinc.gorlaeus.net/result.php?resultid=4568534
http://boinc.gorlaeus.net/result.php?resultid=4568302
http://boinc.gorlaeus.net/result.php?resultid=4567176

Is this caused by WU getting stuck, or is the limit set too low? In any case, this is a whole lot of computing gone to waste :(

____________

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair

Trog Dog User profile image
Avatar
private message
Joined: Sep 16, 2006
Posts: 15
ID: 1306
Credit: 20,711
RAC: 0
Message 1230 - Posted 11 Jan 2007 13:22:44 UTC

Why is your core2 consistently taking 8 hours to complete a wu?

And the 4th result is crunched using 5.5.0
____________

[BAT] tutta55 User profile image
Avatar
private message
Joined: Feb 26, 2006
Posts: 104
ID: 485
Credit: 4,072,468
RAC: 293
Message 1231 - Posted 11 Jan 2007 13:50:04 UTC - in response to Message ID 1230.

Why is your core2 consistently taking 8 hours to complete a wu?


Well..... that's what I'm wondering about too. It must be a pretty recent change in the application or work units, because I haven't seen it before. But it's not taking 8 hours consistently, only for these 3 work units until now.


And the 4th result is crunched using 5.5.0


Sorry 'bout that. It's running in a virtual machine I haven't updated yet <blush>. But that isn't the cause of this error.

____________

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair

[BAT] tutta55 User profile image
Avatar
private message
Joined: Feb 26, 2006
Posts: 104
ID: 485
Credit: 4,072,468
RAC: 293
Message 1232 - Posted 11 Jan 2007 14:52:54 UTC
Last modified: 11 Jan 2007 14:57:53 UTC

I got another one running right now, on the Pentium D VM: http://boinc.gorlaeus.net/workunit.php?wuid=1756823

The work unit returned invalid for 2 other machines before. In my case it is running for over 5 hours now. It is not stuck. It progresses at about 0.001 % every 2-3 seconds. It seems to happen once the WU passes 70% ready. It is now at 72.766. At this rate it will take over 10 hours from now to finish, and surely exceed max CPU again before that. I'm going to abort it now and hope it isn't sent to me a second time.

EDIT: I aborted the WU. But before that I tried quitting Boinc and seeing if the WU resumes. It doesn't. It restarts at 0%, but leaves the CPU time intact (which was at 5 hours)
____________

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair

[BAT] tutta55 User profile image
Avatar
private message
Joined: Feb 26, 2006
Posts: 104
ID: 485
Credit: 4,072,468
RAC: 293
Message 1233 - Posted 11 Jan 2007 17:15:33 UTC

And 1 more http://boinc.gorlaeus.net/workunit.php?wuid=1753303

Patience running a bit thin now. Please a reaction from the project devs.
____________

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair

[BAT]Krikke User profile image
private message
Joined: Feb 27, 2006
Posts: 3
ID: 504
Credit: 392,941
RAC: 383
Message 1234 - Posted 11 Jan 2007 18:04:56 UTC

Hopefully this gets solved quickly because this would be serious waste of resources
____________

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 660
ID: 1
Credit: 1,417,572
RAC: 2
Message 1237 - Posted 12 Jan 2007 8:25:07 UTC

If you look more closely... First things first... Howcome these 'invalidated' results claim 10 fould more credit that the validated ones ?!

m.
____________
M.F. Somers

[BAT] tutta55 User profile image
Avatar
private message
Joined: Feb 26, 2006
Posts: 104
ID: 485
Credit: 4,072,468
RAC: 293
Message 1239 - Posted 12 Jan 2007 8:34:53 UTC - in response to Message ID 1237.
Last modified: 12 Jan 2007 9:15:03 UTC

If you look more closely... First things first... Howcome these 'invalidated' results claim 10 fould more credit that the validated ones ?!

m.


That would be because they have been running for more than 8 hours (in the case of my Core 2 Duo), where they normally take less than an hour to finish. The amount claimed by the C2D is usually 10-15% higher in this project than what it normally gets. So it is easy to see that besides the amount of loss work, I also missed out on more than 1000 credits due to this bug. Since it happens on more than 1 of my machines (and also to other people, see the Leiden Classical section), I can but conclude it is a bug.
____________

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 660
ID: 1
Credit: 1,417,572
RAC: 2
Message 1241 - Posted 12 Jan 2007 8:45:04 UTC

See [url]http://boinc.gorlaeus.net/forum_thread.php?id=196[\url]

your host is doing fine now accornding to me

m.
____________
M.F. Somers

[BAT] tutta55 User profile image
Avatar
private message
Joined: Feb 26, 2006
Posts: 104
ID: 485
Credit: 4,072,468
RAC: 293
Message 1244 - Posted 12 Jan 2007 8:59:47 UTC - in response to Message ID 1241.

See [url]http://boinc.gorlaeus.net/forum_thread.php?id=196[url]

your host is doing fine now accornding to me

m.


Mark, I doubt it has anything to do with the host itself. I have it also on my Pentium D. And I noticed that yesterday on my old Pentium III a work unit ran for a whopping 116000 seconds without erroring out on max CPU, but was still found invalid.
____________

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair

phucco
private message
Joined: Jan 2, 2007
Posts: 2
ID: 2716
Credit: 6,538
RAC: 34
Message 1245 - Posted 12 Jan 2007 13:03:19 UTC

http://boinc.gorlaeus.net/workunit.php?wuid=1752927

I've had to abort it as it was going nowhere, maybe someone could check it's not a dud?

Trog Dog User profile image
Avatar
private message
Joined: Sep 16, 2006
Posts: 15
ID: 1306
Credit: 20,711
RAC: 0
Message 1246 - Posted 12 Jan 2007 13:42:22 UTC - in response to Message ID 1231.

Why is your core2 consistently taking 8 hours to complete a wu?


Well..... that's what I'm wondering about too. It must be a pretty recent change in the application or work units, because I haven't seen it before. But it's not taking 8 hours consistently, only for these 3 work units until now.


And the 4th result is crunched using 5.5.0


Sorry 'bout that. It's running in a virtual machine I haven't updated yet <blush>. But that isn't the cause of this error.


Wouldn't be so sure about 5.5.0 not being the cause - this is also reported at QMC. 5.5.0 messes with the times etc, and this interacts with fpops estimate for the work unit which causes the max cpu time exceeded.
____________

[BAT] tutta55 User profile image
Avatar
private message
Joined: Feb 26, 2006
Posts: 104
ID: 485
Credit: 4,072,468
RAC: 293
Message 1247 - Posted 12 Jan 2007 13:49:20 UTC - in response to Message ID 1246.
Last modified: 12 Jan 2007 13:49:45 UTC

Wouldn't be so sure about 5.5.0 not being the cause - this is also reported at QMC. 5.5.0 messes with the times etc, and this interacts with fpops estimate for the work unit which causes the max cpu time exceeded.


Sorry, but this is completely beside the question. 5.5.0 may have an influence on the max CPU time calculation, but it doesn't cause the work unit to run 10 times longer than it should.

This is not a thread to be used for 5.5.0 bashing. It tries to address a problem that I have, and also a number of other people. Besides, my Core 2 Duo runs the standard 5.4.11.
____________

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 660
ID: 1
Credit: 1,417,572
RAC: 2
Message 1248 - Posted 12 Jan 2007 14:09:43 UTC

Of the more than 5000 very similar WU's, only 85 reported back the last week with this "CPU time limit exceeded" error. The rest validates as normal of what is to be expected with homogeneous redundancy.

I'll run the WU myself here on my box to test even further... If it realy bugs you, just abort it... But to be honest, looking at the results your hosts return now, the issue isn't realy a big problem anymore...

m.
____________
M.F. Somers

[BAT] tutta55 User profile image
Avatar
private message
Joined: Feb 26, 2006
Posts: 104
ID: 485
Credit: 4,072,468
RAC: 293
Message 1249 - Posted 12 Jan 2007 16:06:03 UTC

Thanks for the reply, Mark. 85 out of 5000 is 1.7%, and considering that these WU ran about 10 times as long as they should, I think you can say that this last week at least 5% of the total resources were wasted. This is still a considerable amount I think. I hope you can locate the problem. It only occurs with Classical, and I have the impression it started with 5.36.
____________

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair

[BAT] tutta55 User profile image
Avatar
private message
Joined: Feb 26, 2006
Posts: 104
ID: 485
Credit: 4,072,468
RAC: 293
Message 1251 - Posted 12 Jan 2007 17:53:02 UTC
Last modified: 12 Jan 2007 17:53:43 UTC

I got another on the C2D that after 3 hours was at 40% and only progressing 0.001% per 2-3 seconds. It would never have made it in the max CPU deadline, so I aborted it :( I don't think it makes much sense for me to continue until this problem is solved.

See http://boinc.gorlaeus.net/result.php?resultid=4590405
____________

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair

[BAT] tutta55 User profile image
Avatar
private message
Joined: Feb 26, 2006
Posts: 104
ID: 485
Credit: 4,072,468
RAC: 293
Message 1258 - Posted 13 Jan 2007 17:04:57 UTC

I got 4 more today on my Core 2 Duo:

These ran for 8 hours, then reached max CPU time:
http://boinc.gorlaeus.net/result.php?resultid=4582964
http://boinc.gorlaeus.net/result.php?resultid=4586288

These were also not making progress, so I aborted them:
http://boinc.gorlaeus.net/result.php?resultid=4587098
http://boinc.gorlaeus.net/result.php?resultid=4587092

That means another half day of work lost. This bug now has made me and my team lose almost 3000 credits this week, and then I only look at errors on my machines.

Mark, if there is anything I can do to help you track this down, tell me. Otherwise I am out of here. I will finish the trajtou WU I have in cache, all classicals I have aborted.
____________

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 660
ID: 1
Credit: 1,417,572
RAC: 2
Message 1259 - Posted 13 Jan 2007 19:32:24 UTC

I am currently testing the exact WU you mention on a 2.6GHz P4 512Mb RAM linux box. It runs fine and only takes about 2.5 h. Next moday I'll run the same task on a Windows XP box. I already stoped the generation of more WU's like this but the DB already contains a few and these have been sent out. It would be a waste to delete those... Not ?!


If the tests on Windows and Linux for this specific WU are okay, the scientific code of the client is okay and tested. The only two possible things left over are then the BOINC client and or the specific host... They are out of my reach, as you can understand ;-).

What you could try is to download the standalone classical dynamics code from the site and run it a couple of times on the input file of the WU on your machine standalone... Then you can see if there's a problem there then...

Executable:
http://boinc.gorlaeus.net/download/DownLoads/Standalone/Executables/
Input file:
http://boinc.gorlaeus.net/download/classical.butane_molecule_BIG


Hope to have helped you...

m.
____________
M.F. Somers

[BAT] tutta55 User profile image
Avatar
private message
Joined: Feb 26, 2006
Posts: 104
ID: 485
Credit: 4,072,468
RAC: 293
Message 1261 - Posted 14 Jan 2007 0:10:00 UTC
Last modified: 14 Jan 2007 0:20:24 UTC

Mark, I am running the WU in a command window now. I will send the output when it finishes. Pity I probably won't see how much time it will have run.

One thing odd. I can't close the graphics window. The program stops too then. But it's all handled by my GPU anyway I suppose.

EDIT: Although the program is keeping my machine 50% busy, I notice that the graphics are not moving. Is this normal?
____________

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair

[BAT] tutta55 User profile image
Avatar
private message
Joined: Feb 26, 2006
Posts: 104
ID: 485
Credit: 4,072,468
RAC: 293
Message 1263 - Posted 14 Jan 2007 10:13:21 UTC

The WU I started manually has used 10h30 of CPU time now on my Core 2 Duo. If running under Boinc it would have reached max CPU time already after about 8 hours. This indicates that the problem is not due to some random beaviour when running on this machine, but can be reproduced. I wonder if it makes sense to let it continue. There is no way for me to check its progress now. I will probably let it continue for a few more hours.
____________

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 660
ID: 1
Credit: 1,417,572
RAC: 2
Message 1265 - Posted 14 Jan 2007 10:40:20 UTC

The graphics should be moving all the time... and the weird thing is that I run these WU's under linux without a problem here...

So it can't be my scientific codeing ;-)

Could it be that the MS-Visual C++ makes a mess of things when compiling for windowze ?! The other apps, also compiled in such a way do not show this... Next monday I can check the win executables on some P4's...

m.
____________
M.F. Somers

[BAT] tutta55 User profile image
Avatar
private message
Joined: Feb 26, 2006
Posts: 104
ID: 485
Credit: 4,072,468
RAC: 293
Message 1266 - Posted 14 Jan 2007 11:08:08 UTC

Mark, does it make any sense that I let it running? It's now at 11h30 cpu time.
____________

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair

ColdRain
private message
Joined: Feb 28, 2006
Posts: 52
ID: 548
Credit: 24,245,377
RAC: 13,162
Message 1268 - Posted 14 Jan 2007 11:46:44 UTC

I have the same problems with Leiden Tutta55 is experiencing. It's randomly, but never with traitou wu's, allways the classical ones. I'm having the problems on both Windows and Linux platforms, with all kind of CPU's (P3, P4, AMD X2, Xeon, Core2Duo, ...). It isn't boinc version related either, I've had it on 3 different boinc versions.
I'm suspending LC.
____________

[BAT] tutta55 User profile image
Avatar
private message
Joined: Feb 26, 2006
Posts: 104
ID: 485
Credit: 4,072,468
RAC: 293
Message 1269 - Posted 14 Jan 2007 12:17:32 UTC
Last modified: 14 Jan 2007 12:18:52 UTC

I stopped the program after 12h30. I doubt it would finish. Mark, if you want I can send you the file classical.out. There are some data in it, but it hadn't been updated since right after the program started. If you want it, what e-mail address can I send it to? If you don't want to put your e-mail address in the forum, you can contact me here: tutta55 at boinc dot be

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 660
ID: 1
Credit: 1,417,572
RAC: 2
Message 1272 - Posted 14 Jan 2007 13:58:19 UTC

There are two different things going on here...

1) Tutta55, you could hit any key but ESC once in a while to get more data and to pause the run...

2) ColdRain... First of all nice to have you back... Second, the Trajtou application WU's and the Classical application WU's return different types of results... The trajtous return a general reaction probability whereas the Classical run return the final coordinates of the run. The later are far more sensitive to differences in FPU's... That means that the Classical WU's have more troubles with validating than the Trajtou's are... There is also no real possibility to change that, because that would change the science being done too...

Approach to solve these two issues;

figuering out why these Classical WU's take so long and seem to get stuck on what I tend to think is windows specific now that I have ran some tests here on exactly the same WU, and to even add more specific HR_classes into the server software.

Currently we do differentiate the CPU's with the following algorithm:

inline int CPU(HOST& host){
if ( strstr(host.p_vendor, "Intel") != NULL ) {
if( strstr(host.p_model, "Xeon") != NULL ) return IntelXeon;
if( strstr(host.p_model, "Celeron") != NULL ) return IntelCeleron;
if( strstr(host.p_model, "Pentium") != NULL ) {
if( strstr(host.p_model, "III") != NULL ) return IntelPentiumIII;
if( strstr(host.p_model, "II") != NULL ) return IntelPentiumII;
if( strstr(host.p_model, " 4 ") != NULL ) return IntelPentium4;
if( strstr(host.p_model, " D ") != NULL ) return IntelPentiumD;
if( strstr(host.p_model, " M ") != NULL ) return IntelPentiumM;
return IntelPentium;
}
return Intel;
}
else if( strstr(host.p_vendor, "AMD") != NULL ) {
if( strstr(host.p_model, "Duron") != NULL ) return AMDDuron;
if( strstr(host.p_model, "Opteron") != NULL ) return AMDOpteron;
if( strstr(host.p_model, "Sempron") != NULL ) return AMDSempron;
if( strstr(host.p_model, "Athlon") != NULL ) {
if( strstr(host.p_model, "XP") != NULL ) return AMDAthlonXP;
if( strstr(host.p_model, "64") != NULL ) return AMDAthlon64;
return AMDAthlon;
}
return AMD;
}
else if( strstr(host.p_vendor, "Macintosh") != NULL ) return Macintosh;
else return nocpu;
};

Perhaps now you guys see how complicated it gets if different versions of Windows rename the CPU's and so on... Keep in mind that a P4 Celeron is not equivalent to a P4 Mobile or a regular P4...

Furthermore, here the output of my tests on my laptop on the specific WU on linux. They both took roughly 2.5h and run just fine:

mark::/home/mark/temp/Classical>time GLUT_ClassicalDynamics_Linux.x examples/but
Starting the conformational search:

Max number of steps: 10000
Max step size: 2.500000e+00
Max error for convergence: 1.000000e-04

Total potential energy of system at start of search: 1.685336e+01
Mean of distance squared of system at start of search: 2.916964e+01

Conformation searcher stopped at itteration: 3865

Total potential energy of system: 7.303259e-06

Center of mass of system: [-7.326446e-08,3.434790e-
Dipole moment of system: [-4.663713e+01,2.193912e+
Mean of distance squared of system: 2.613226e+01

Starting the dynamical simulation:

Single cell: [2.500000e+01,2.500000e+0
Start time: 0.000000e+00
End time: 2.067055e+07
Max time step: 4.134110e-01
Number of automatic snapshots to take in this run: 0
The boltzmann constant: 3.166829e-06
The temperature will be kept constant at: 2.930000e+02
The temperature rescaling time: 2.067055e+02

Dynamical simulation has finished:

End time: 2.067055e+07
Total potential energy of system: 5.416916e-04
Total kinetic energy of system: 1.948380e-02
The average potential energy of system: 9.833240e-04
The average kinetic energy of system: 1.948524e-02
The average temperature of the system: 2.929960e+02
The average pressure of the system: -8.314794e-07
The density of system: 6.774844e+00
Total mass of system: 1.058569e+05
Total charge of system: 3.400000e+01
Center of mass of system: [-7.312773e-08,3.386995e-08,5.749006e-09]
Dipole moment of system: [-4.662384e+01,2.189264e+01,3.463478e+00]
Mean of distance squared of system: 2.645044e+01

8566.835u 215.639s 3:27:15.94 70.6% 0+0k 0+0io 0pf+0w

mark::/home/mark/temp/Classical>time GLUT_ClassicalDynamics_Linux.x examples/butane_BIG
Starting the conformational search:

Max number of steps: 10000
Max step size: 2.500000e+00
Max error for convergence: 1.000000e-04

Total potential energy of system at start of search: 1.685336e+01
Mean of distance squared of system at start of search: 2.916964e+01

Conformation searcher stopped at itteration: 3865

Total potential energy of system: 7.303259e-06

Center of mass of system: [-7.326446e-08,3.434790e-08,5.303640e-09]
Dipole moment of system: [-4.663713e+01,2.193912e+01,3.420166e+00]
Mean of distance squared of system: 2.613226e+01

Starting the dynamical simulation:

Single cell: [2.500000e+01,2.500000e+01,2.500000e+01]
Start time: 0.000000e+00
End time: 2.067055e+07
Max time step: 4.134110e-01
Number of automatic snapshots to take in this run: 0
The boltzmann constant: 3.166829e-06
The temperature will be kept constant at: 2.930000e+02
The temperature rescaling time: 2.067055e+02

Dynamical simulation has finished:

End time: 2.067055e+07
Total potential energy of system: 5.416916e-04
Total kinetic energy of system: 1.948380e-02
The average potential energy of system: 9.833240e-04
The average kinetic energy of system: 1.948524e-02
The average temperature of the system: 2.929960e+02
The average pressure of the system: -8.314794e-07
The density of system: 6.774844e+00
Total mass of system: 1.058569e+05
Total charge of system: 3.400000e+01
Center of mass of system: [-7.312773e-08,3.386995e-08,5.749006e-09]
Dipole moment of system: [-4.662384e+01,2.189264e+01,3.463478e+00]
Mean of distance squared of system: 2.645044e+01

8572.332u 158.100s 3:48:20.02 63.7% 0+0k 0+0io 4pf+0w

So it's not the scientific code for sure, cause then this test would fail too. Left over is windows effects and BOINC... I'm currently running the BOINC enabled code, standalone, on the same WU on linux... But this will take some time ;-)...

m.




____________
M.F. Somers

[BAT] tutta55 User profile image
Avatar
private message
Joined: Feb 26, 2006
Posts: 104
ID: 485
Credit: 4,072,468
RAC: 293
Message 1275 - Posted 14 Jan 2007 16:42:45 UTC

Mark, like I said in my previous post: I stopped the run after 12h30. Is there anything you can do with the output file?

As for the fp differences, I'm sure you thought of that before, but can't you use a platform-independent floating point library? Some other projects do that. I think the recently started wep-m+2 does.
____________

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair

Brian B
private message
Joined: Mar 21, 2006
Posts: 17
ID: 896
Credit: 13,715
RAC: 16
Message 1278 - Posted 15 Jan 2007 5:26:19 UTC
Last modified: 15 Jan 2007 5:29:41 UTC

Hi Mark. Restated from this thread.

I ended up aborting this wu as it was at a CPU time of 11:02:40 with 26.734% Progress and 09:57:02 and increasing for To completion. I opened the 'Show graphics' and the red box was there, but nothing was happening, i.e. there were no molecules floating around. Seemed to be stuck. After another project had run and the wu restarted, the graphics was showing the molecules again, but after a while (I'm not sure how long) it would hang again. Also seen the same problem with 1759334 and 4598129, which I aborted these two as well.

Also, these two, 1758156/result and 1762384/result, had the error "No heartbeat from core client for 31 sec - exiting". I don't know if it's the same issue or not, but they both ran for about 3 hours before exiting, but it shows these two had an Outcome of Success.

I'm running Windows 2000 SP4 with the following:
01/13/2007 12:45:07 AM||Starting BOINC client version 5.4.11 for windows_intelx86
01/13/2007 12:45:07 AM||libcurl/7.15.3 OpenSSL/0.9.8a zlib/1.2.3
01/13/2007 12:45:07 AM||Data directory: C:Program FilesBOINC
01/13/2007 12:45:08 AM||Processor: 1 GenuineIntel Intel(R) Pentium(R) 4 Mobile CPU 1.60GHz
01/13/2007 12:45:08 AM||Memory: 766.98 MB physical, 1.08 GB virtual
01/13/2007 12:45:08 AM||Disk: 9.76 GB total, 1.52 GB free

Hope this helps fix the issues with the new version of Classical. Good luck!
____________

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 660
ID: 1
Credit: 1,417,572
RAC: 2
Message 1279 - Posted 15 Jan 2007 8:43:40 UTC

Some new info...

I ran the troublesome WU standalone and as it turns out, when linking against the BOINC library, this WU segfaults at the very end. The weird thing is that the numbers are okay and the segfault is located somewere in the GlutMainLoop event handler. Eeven more funny is that the GLUT enabled standalone application, thus without the BOINC lib, but with GLUT, works just fine...

I'm beginning to suspect that there is a little bug in the BOINC library...

Currently running more tests, but as you can imagine, you have to give me some time on this... the runs take about 2.5 h each ;-P...

With respect to the emualtion of the FPU... would be very much a waste in time... But perhaps I can compile for i386 with 387 FPU instructions only, so not using mmx, sse, sse and or 3dnow... Maybe that doesn't cost to much in performance... Currently testing that too...

m.
____________
M.F. Somers

[BAT]Krikke User profile image
private message
Joined: Feb 27, 2006
Posts: 3
ID: 504
Credit: 392,941
RAC: 383
Message 1280 - Posted 15 Jan 2007 9:52:46 UTC

I'm seeing the same type of errors here.

I'll be suspending the crunching until the problem gets solved.

regards,
____________

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 660
ID: 1
Credit: 1,417,572
RAC: 2
Message 1281 - Posted 15 Jan 2007 10:10:53 UTC

No need to stop all the crunching... just the WU's with the name wu_164284800_*... Just abort them in your list and other WU's will run fine... there are plenty of them ;-)...



m.


____________
M.F. Somers

[BAT] tutta55 User profile image
Avatar
private message
Joined: Feb 26, 2006
Posts: 104
ID: 485
Credit: 4,072,468
RAC: 293
Message 1282 - Posted 15 Jan 2007 10:37:51 UTC
Last modified: 15 Jan 2007 10:40:58 UTC

Mark, as this is an error that caused major loss of credits, and I am talking thousands here for me and some of my teammates, (I know, I know, credits are not important :p ), it would be a nice gesture if all work units that errored out with "Max CPU time" be awarded a fair amount of credit after all. Something similar like happened in uFluids: they awarded all work units that were stuck, with an upper limit of 500. 500 may be a bit steep here. In my case a quick calculation shows that 8 hours of work on 1 core of my C2D amounts to a conservative estimate of 175 credits.
____________

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 660
ID: 1
Credit: 1,417,572
RAC: 2
Message 1283 - Posted 15 Jan 2007 13:06:35 UTC

I will at some point... Not this week though... Made a snapshot of the DB of all the affected hosts...

BTW, not all these WU's crash and certainly not on all hosts:

mysql> select count(id) from result where name like "wu_164284800_%" and outcome=3;
+-----------+
| count(id) |
+-----------+
| 1402 |
+-----------+
1 row in set (0.09 sec)

mysql>

I count 367 hosts only that are affected...

mark::/home/mark>cat result | sort | uniq | wc -l
367

Compare that to the total number:
mysql> select count(id) from result where name like "wu_164284800_%";
+-----------+
| count(id) |
+-----------+
| 5411 |
+-----------+
1 row in set (0.08 sec)

Which tells me that roughly 25% of the WU's are affected and thus that 75% are *not* affected by this weird thing...

I'll sort it out as soon as I have a bit more time for this....

m.






____________
M.F. Somers

[BAT] tutta55 User profile image
Avatar
private message
Joined: Feb 26, 2006
Posts: 104
ID: 485
Credit: 4,072,468
RAC: 293
Message 1289 - Posted 16 Jan 2007 9:34:07 UTC

Mark, if you need me to run a few tests outside Boinc while trying to solve this problem, just tell me. My e-mail address is about 9 posts down in this thread.
____________

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 660
ID: 1
Credit: 1,417,572
RAC: 2
Message 1293 - Posted 16 Jan 2007 14:16:14 UTC

Nice offer, but there is not much more you can do... I'm currently running the app within a debugger on my 3.0 GHz EM64T machine on 4Gb RAM... It just takes time...

m.
____________
M.F. Somers

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 660
ID: 1
Credit: 1,417,572
RAC: 2
Message 1294 - Posted 16 Jan 2007 22:17:29 UTC

FYI

http://www.boinc.be/vbulletin/showpost.php?p=27691&postcount=14

Me thinks that I am going to invite ColdRain for a visit to Leiden... Wanna come and give a little talk here on how you think things can be improved ?!

m.


____________
M.F. Somers

ColdRain
private message
Joined: Feb 28, 2006
Posts: 52
ID: 548
Credit: 24,245,377
RAC: 13,162
Message 1295 - Posted 17 Jan 2007 21:29:50 UTC

Je pense, donc je suis.
As long as you're thinking, you're still alive.
There's still hope for mankind.

____________

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 660
ID: 1
Credit: 1,417,572
RAC: 2
Message 1297 - Posted 19 Jan 2007 9:26:30 UTC - in response to Message ID 1295.
Last modified: 19 Jan 2007 15:11:38 UTC

Yes I know I know... But does this mean that you are going to pay us a visit ?

m.

Edit:

Well, according to http://www.boinc.be/vbulletin/showthread.php?t=1835&page=3 you won't... You are just being jealous...

Shame realy... I hoped to have had a fruitfull and constructive meeting with two sensible people, as I'm used to have with i.e. my fellow scientists...

BTW, in what sort of projects do you have 20 years of experience then?


Je pense, donc je suis.
As long as you're thinking, you're still alive.
There's still hope for mankind.


____________
M.F. Somers

[BAT] tutta55 User profile image
Avatar
private message
Joined: Feb 26, 2006
Posts: 104
ID: 485
Credit: 4,072,468
RAC: 293
Message 1301 - Posted 19 Jan 2007 22:14:13 UTC
Last modified: 19 Jan 2007 22:15:29 UTC

Since this is my thread being lead off-topic - which I think is regretable - I feel compelled to react to the previous remark.

ColdRain's remark is made in our team's own forum. While it is normal that team members discuss what is going on in the various projects, I find it odd that a project's forum is used to discuss what is said among a team's members. Our team is in competition with another, which is apparently promoted by this project's leader. If ColdRain's remark is a sign of anything, then it is not one of jealousy, but maybe a sign that project leaders may do well to remain neutral. Promoting one team, while ignoring another - which has been this project's largest contributor since september - is not a sign of neutrality.

The best thing to do in my opinion is to get back to business and work on solving the problem reported in this thread. As far as I am concerned, the messages with ID's 1294, 1295, 1297, and also this message here are best removed.

____________

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 660
ID: 1
Credit: 1,417,572
RAC: 2
Message 1306 - Posted 20 Jan 2007 10:56:26 UTC

Putting the thread back on toppic; look at http://boinc.gorlaeus.net/forum_thread.php?id=170 which will tell you how busy I am in trying to solve the issue...

m.
____________
M.F. Somers

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 660
ID: 1
Credit: 1,417,572
RAC: 2
Message 1319 - Posted 22 Jan 2007 10:22:57 UTC

Some extra news....

Implemented a fix into the code and I'm currently testing it on some of the WU's that seemed to get stuck... It seems that the trouble was related to having a sqrt of a very small negative number that should have been zero... When tests work out a new version will be made active...

Furthermore, wrt the hr_classes... Currently a discussion is being held on the boinc developer mailing list on how to solve this... The client probably needs changeing before the server software can be changed... Working on it... In the mean time I'm trying to get a dirty hack done so that for the time being less problems are encountered...

Finally, wrt the lost credit... Probably at the end of the week, if I'm not too loaded with other work, the credits will be taken care off...

m.
____________
M.F. Somers

[BAT] tutta55 User profile image
Avatar
private message
Joined: Feb 26, 2006
Posts: 104
ID: 485
Credit: 4,072,468
RAC: 293
Message 1323 - Posted 22 Jan 2007 13:55:01 UTC - in response to Message ID 1319.

Some extra news....

Implemented a fix into the code and I'm currently testing it on some of the WU's that seemed to get stuck... It seems that the trouble was related to having a sqrt of a very small negative number that should have been zero... When tests work out a new version will be made active...

Furthermore, wrt the hr_classes... Currently a discussion is being held on the boinc developer mailing list on how to solve this... The client probably needs changeing before the server software can be changed... Working on it... In the mean time I'm trying to get a dirty hack done so that for the time being less problems are encountered...

Finally, wrt the lost credit... Probably at the end of the week, if I'm not too loaded with other work, the credits will be taken care off...

m.


Thanks, Mark. Well done. And that on a monday ;)

____________

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair

Chris
private message
Joined: Nov 28, 2006
Posts: 25
ID: 2224
Credit: 2,247
RAC: 71
Message 1326 - Posted 22 Jan 2007 21:37:01 UTC

Great Mark! Keep us posted on when the new version is out.

Chris
____________

m.somers User profile image
Forum moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
private message
Joined: Nov 14, 2005
Posts: 660
ID: 1
Credit: 1,417,572
RAC: 2
Message 1327 - Posted 23 Jan 2007 14:41:59 UTC

See the news... App has been updated...

m.
____________
M.F. Somers

Reply to this thread

Message boards : Number crunching : Max CPU time exceeded



Return to Leiden Classical main page


Copyright © 2017 Leiden University - Leiden Institute of Chemistry - Theoretical Chemistry Department