|
Main
Date: 10 Nov 2006 00:46:42
From: [email protected]
Subject: Test suites and ply depth
|
Would it be useful to have test suites which consist of a set of positions which should be solvable in 12 ply, a set which should be solvable in 13 ply, ... 14 ply, ... and so on up as far as desirable? That way, one could quickly check a program by seeing how well it performed on the 12 ply set, for example, since it takes much less time at lower ply depth. And one could see if there was a problem at a certain depth, say the program works well up to 16 ply, but does poorly at higher ply. There could be both tactical sets and strategic sets which should be solvable by the time the search reaches a certain ply depth. I don't know much about test suites (most likely the above has already been done if it is a good idea, as it seems obvious enough), but was thinking about the issue after reading about the recent experiment where Capablanca did better than the other world champions when tested against Crafty limited to a search depth of 12 ply. Obviously it would be interesting to try the test again at greater ply depths and with other engines as well as that most admirable engine Crafty. Also more games could be included by other strong players, in sufficiently strong tournaments and matches. Also a correlation might be established between such results against engines and ELO ratings or similar rating systems. I was thinking an engine might be deliberately tuned to be consistent with the style of a given player, like Capablanca, but then realized the existing test suites probably already do even better than that at tuning an engines playing style. To compensate for opening book knowledge, perhaps a chess tree could be modified to show at what date a move from a given position was first played in high level competition, and the engine test only begin with the first new move which hadn't been played before.
|
|
|
Date: 13 Nov 2006 22:34:17
From: Simon Waters
Subject: Re: Test suites and ply depth
|
> <[email protected]> wrote in message > news:[email protected]... >> Would it be useful to have test suites which consist of a set of >> positions which should be solvable in 12 ply, a set which should be >> solvable in 13 ply, ... 14 ply, ... and so on up as far as desirable? > > No, not really. Although a similar test is done. There are published tables for the correct number of legal moves to specific depths from the initial position. A reasonable way to detect flaws in the move generator.
|
| |
Date: 14 Nov 2006 10:10:36
From: Mr. Question
Subject: Re: Test suites and ply depth
|
"Simon Waters" <[email protected] > wrote in message news:4558f2e9.0@entanet... >> <[email protected]> wrote in message >> news:[email protected]... >>> Would it be useful to have test suites which consist of a set of >>> positions which should be solvable in 12 ply, a set which should be >>> solvable in 13 ply, ... 14 ply, ... and so on up as far as desirable? >> >> No, not really. > > Although a similar test is done. There are published tables > for the correct number of legal moves to specific depths from the > initial position. A reasonable way to detect flaws in the move generator. Yes, that's called "Perft". And actually, the initial position isn't a good position to test. It misses a lot of types of moves (enpassant, promotion, mate, etc.) Positions like "Kiwi Pete" are better to test with Perft. Therer are some limitations to Perft. Although it tests the move generator, the make & unmake move routines, and the InCheck() stuff, that still leaves a lot untested. Also, the guy was wanting to know about solving positions to find out how good the programs are. Perft doesn't help that. One final comment about Perft.... Many people use perft as a benchk, but you should not do that. It's runtime behavior is *very* different from an actual search. You can't use it as a real benchk. You can use it as a measure to determine if your core routines are faster or slower than before, but that doesn't relate well to actual search performance because that depends mostly on your move ordering, not the low level performance of individual routines. (Perft stands for "Per"formance "T"uning. It was originall conceived as a way to measure your lowlevel routine performance, as well as debugging your core routines. The way it behaves is totally different from the way a search works, so you can't relate perft performance to search performance or to program strength.) Also, you can't compare perft results among programs because they may be doing things differently than what you do. They may be updating more information (databases etc.) that they use in their evaluator or nifty search extensions, or whatever. So it's possible for their makemove() to be slower but their overall program to be faster and stronger. (And as a side note, faster does not mean stronger. A program can be very fast but have a stupid evaluator and play poorly. And a slow program doesn't mean it's st, either.) Finally, many people *cheat* in perft tests. Because they treat it as a benchk to compare against other programs, they will do things like not doing the makemove() on the final ply of the perft test. Or they'll use hash tables specifically designed for perft tests. Perft is not designed as a benchk. It's designed as a debugging aid and as a way to tell if your own performance improvements in your makemove() & unmakemove() are actually faster than before. It can not be compared among other programs. Now, having said all of that, I'm quite willing to admit that I too have used perft as a benchk to compare my program to others. Even though I know I shouldn't, I've done it anyway.... ----== Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet News==---- http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups ----= East and West-Coast Server Farms - Total Privacy via Encryption =----
|
|
Date: 10 Nov 2006 22:09:46
From: Mr. Question
Subject: Re: Test suites and ply depth
|
<[email protected] > wrote in message news:[email protected]... > Would it be useful to have test suites which consist of a set of > positions which should be solvable in 12 ply, a set which should be > solvable in 13 ply, ... 14 ply, ... and so on up as far as desirable? No, not really. Most programs do search depth differently. They don't simply search 'x' plies deep and stop. There are a variety of search extensions that can be done, including null moves, various reductions, etc. Then there are the search extensions that can be done during the main search. For example, not counting a move that deals with being in check, or not counting a move that promotes a pawn. Or whatever. Then there are the search extensions during the q-search. Which nodes to expand and which ones to ignore. So search depth can't be compared among programs. One program's depth 8 may be another program's depth 10 even though they may take about the same length of time to search and seem to find the same moves. It's only valid for a particular program, for comparision with other versions as a way of helping to guage whether a modification is an improvement or not. And even then, the usefulness is limited. Nor can you depend on search time. Many programs are tuned for specific type of architectures. Or are designed for SMP systems. Or certain types of parallel systems. Or whatever. So comparing their performance on their non-native hardware isn't a reliable indicator of what they can do. So, even doing timed based tests are only valid when run on their preferred hardware, and you can't compare those results to another program even if running on the same hardware because the hardware may not be what it's designed for. There are some 'standard' test suites that some people use. Bratko-Kopec, Win At Chess, etc. But it's really hard to compare results from one program to the next. About all you can really say is something like "On my system (cpu=xyz, mhz=abc, board=123, ram=fgh, etc.) I got these results...." Whether you report the search time or the search depth, the results are only valid for that system and your program. There is so much variation among programs that it's not really easy to compare their performance. The only reasonable way to do that is full tournaments where each program plays dozens of games against the others. And even then, there's more that could be said. ----== Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet News==---- http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups ----= East and West-Coast Server Farms - Total Privacy via Encryption =----
|
|