|
Main
Date: 01 Aug 2008 10:40:57
From: Guest
Subject: How do you know if an engine is xyz% better???
|
One item shows up in this group regularly.... Namely whether some engine is now xyz % stronger than a previous version. Hyatt has been doing some tests over the past year and so, and he's posted occasionally on it. He's just posted some data again... http://www.talkchess.com/forum/viewtopic.php?t=22731&postdays=0&postorder=asc&topic_view=&start=0 For those who don't want to read the thread, here's the basic situation. Let's say you have an engine that plays at 'abc' ELO. (This score is done by serious testing, rated games, etc.) And you make some changes. How can you tell if the changes make the program stronger or weaker? The obvious answer is to play a few games and find out. (Often this is done either with a standard set of program opponents or on the various internet chess servers.) The questions that have often been raised in the talkchess forums is "How many games do you really need to play?" Hyatt has published several responses to that question (often getting arguments in return.) Well, Hyatt has published some more data on answering that question... (And you can bet your home he has the hard game data still to back up the results. One thing you can always be sure of him is that he does serious testing and gets hard data.) To answer the question... Even 800 games is *NOT* enough to determine if a new version of your program is actually stronger than your old version. (These were standard even opening positions, from both sides, against a fixed set of program opponents.) (This is about program changes. Not performance / speed improvements like you'd get if you moved to faster hardware, etc.) There's enough statistical variance and timing differences (clock skew, OS overhead, random stuff, etc.) that even a few NPS can lead to wildly different results. So if anybody is thinking that playing a handful of games in enough to say a new version of their pgoram is better than the previous version, you might want to rethink. Note that this does not mean that a hand full of tests isn't enough to detect *massive* changes in program strength. If your program suddenly jumps 300 points, then that kind of change will be easier to detect. But smaller changes, like you would get from refining your program can become very hard to detect. Well, I just thought I should post this and give interested people a "heads up" attention signal to go over to TalkChess and follow the discussion. Doubt it'll do any good, but.... (shrug) ----== Posted via Pronews.Com - Unlimited-Unrestricted-Secure Usenet News==---- http://www.pronews.com The #1 Newsgroup Service in the World! >100,000 Newsgroups ---= - Total Privacy via Encryption =---
|
|
|
Date: 03 Aug 2008 14:02:28
From: johnny_t
Subject: Re: How do you know if an engine is xyz% better???
|
The engines get tested thousands of times when released. They get tested 100's of times before they are released. Strength and variance is just math. People have been doing this correctly for a long time. Try looking up CEGT or CCRL for the latest lists, methodologies, variances, and ELO. Sheesh. Guest wrote: > One item shows up in this group regularly.... Namely whether some engine > is now xyz % stronger than a previous version. > > Hyatt has been doing some tests over the past year and so, and he's posted > occasionally on it. > > He's just posted some data again... > > http://www.talkchess.com/forum/viewtopic.php?t=22731&postdays=0&postorder=asc&topic_view=&start=0 > > For those who don't want to read the thread, here's the basic situation. > > Let's say you have an engine that plays at 'abc' ELO. (This score is done > by serious testing, rated games, etc.) And you make some changes. > > How can you tell if the changes make the program stronger or weaker? > > The obvious answer is to play a few games and find out. (Often this is done > either with a standard set of program opponents or on the various internet > chess servers.) > > The questions that have often been raised in the talkchess forums is "How > many games do you really need to play?" > > Hyatt has published several responses to that question (often getting > arguments in return.) > > Well, Hyatt has published some more data on answering that question... (And > you can bet your home he has the hard game data still to back up the > results. One thing you can always be sure of him is that he does serious > testing and gets hard data.) > > To answer the question... Even 800 games is *NOT* enough to determine if a > new version of your program is actually stronger than your old version. > (These were standard even opening positions, from both sides, against a > fixed set of program opponents.) > > (This is about program changes. Not performance / speed improvements like > you'd get if you moved to faster hardware, etc.) > > There's enough statistical variance and timing differences (clock skew, OS > overhead, random stuff, etc.) that even a few NPS can lead to wildly > different results. > > So if anybody is thinking that playing a handful of games in enough to say a > new version of their pgoram is better than the previous version, you might > want to rethink. > > > Note that this does not mean that a hand full of tests isn't enough to > detect *massive* changes in program strength. If your program suddenly > jumps 300 points, then that kind of change will be easier to detect. But > smaller changes, like you would get from refining your program can become > very hard to detect. > > > > Well, I just thought I should post this and give interested people a "heads > up" attention signal to go over to TalkChess and follow the discussion. > Doubt it'll do any good, but.... (shrug) > > > > > ----== Posted via Pronews.Com - Unlimited-Unrestricted-Secure Usenet News==---- > http://www.pronews.com The #1 Newsgroup Service in the World! >100,000 Newsgroups > ---= - Total Privacy via Encryption =---
|
| |
Date: 03 Aug 2008 18:14:05
From: Guest
Subject: Re: How do you know if an engine is xyz% better???
|
"johnny_t" <[email protected] > wrote in message news:[email protected]... > The engines get tested thousands of times when released. They get tested > 100's of times before they are released. Strength and variance is just > math. It's "just math" if you have humans involved. When it's computer vs. computer testing though, things don't behave as expected. There is more involved than the normal asumptions implied in the "just math". What Hyatt has shown (in that thread and several others) is that when he plays matches consisting of hundreds of thousands (something very few people can do), the results he gets does not match the expected results. That there is so much variance that you can't depend on running a few hundred automated matches to test changes in your program. Thoughout chess programming history, people generally tune their engines one of two ways. 1) used some test positions and let your program generate a move and compare that to the 'predicted' move, then adjust weights various ways. 2) Play a few dozen to a few hundred games with a few (sometimes just one) other programs. If the added idea causes worse play, then toss it out or adjust the weight. Number 1 can be used for casual tuning but it's not the most accurate. Number 2 is what most people do. Hyatt has repeatedly shown (not just that one thread, but several past threads) that doing a few dozen or even a few hundred automated games is not enough to accurately determine if a modification is better or not. Until recently, very very few people had access to the computing clusters like he does now. So very few people have ever been in a position to run such massive testing. Not many people can run tests involving anywhere from 25,000 games on up to a couple million games. The results he's getting do not set well with many people. They violate expected behavior which is usually based on matches with humans involved, or on computer vs. computer testing with only a few hundred games played. But he is getting the results. > > People have been doing this correctly for a long time. People *think* they've been doing this correctly for a long time. That's not the same thing. When you get testing results from a few hundred tests that don't agree with the results you get from hundreds of thousands to millions of tests, then you've got problems if all you can do is a few hundred tests. > Try looking up CEGT or CCRL for the latest lists, methodologies, > variances, and ELO. > > Sheesh. I know about those kinds of tests. And those aren't the same kinds of tests that Hyatt is doing. He's doing it on a much more massive scale, and like you'd do if you were trying to determine if a program change is better or worse than an older version. Read the thread. And he's getting provable, repeatable results that do not agree with the small tests that others have been doing. People have been assuming that the results from a 'small' match involving a few hundreds computer vs. computer games are accurate. Hyatt's tests have repeatedly shown they aren't. There's enough randomness and computer vs. computer interaction to keep even large scale testing inaccurate enough to detect small changes in the playing quality. That's kind of the problem.... This is not the first thread he's done on this. He's been running tests like that for a couple years now, ever since his University added some cluster computers. Nobody had done this kind of massive testing before now. And these unexpected results are definetly pissing off a lot of people, because they just don't agree with their prefered beliefs and how everybody has been doing their testing over the years. Unfortunately, not too many other people have the resources to be able to run such massive testing. So right now, nobody can repeat his experiment. We have to talk with Bob and try to determine what might have caused such results, and when nothing can reasonable explain it, assume they are right. Just like the other reports he's done on massive testing. ----== Posted via Pronews.Com - Unlimited-Unrestricted-Secure Usenet News==---- http://www.pronews.com The #1 Newsgroup Service in the World! >100,000 Newsgroups ---= - Total Privacy via Encryption =---
|
|
Date: 02 Aug 2008 06:00:28
From: Sanny
Subject: Re: How do you know if an engine is xyz% better???
|
> Note that this does not mean that a hand full of tests isn't enough to > detect *massive* changes in program strength. =A0If your program suddenly > jumps 300 points, then that kind of change will be easier to detect. =A0B= ut > smaller changes, like you would get from refining your program can become > very hard to detect. When GetClub Chess improved 30% better I was unable to detect the changes. But when 4 months back the game used to double its strength I was able to find the improvements by seeing a few games. Bye Sanny Play Chess at: http://www.GetClub.com/Chess.html
|
| |
Date: 02 Aug 2008 09:43:06
From: Guest
Subject: Re: How do you know if an engine is xyz% better???
|
>"Sanny" <[email protected]> wrote in message >news:e20b61e1-c869-4676-8cd7-af84f9a946b9@r35g2000prm.googlegroups.com... >> Note that this does not mean that a hand full of tests isn't enough to >> detect *massive* changes in program strength. If your program suddenly >> jumps 300 points, then that kind of change will be easier to detect. But >> smaller changes, like you would get from refining your program can become >> very hard to detect. > >When GetClub Chess improved 30% better I was unable to detect the >changes. Then how the expletive do you know it actually improved by 30%??! Unless you mean as if you moved to hardware that was 30% faster. That doesn't mean it was 30% better, though. Just that it ran 30% faster. >changes. But when 4 months back the game used to double its strength I >was able to find the improvements by seeing a few games. It depends on what you define as "double in strength". If you mean going from 1000 to 2000 points, then yes, that would be easy to detect with a high degree of confidence. (Or similar doubling in scales that aren't linear.) If you mean program speed while getting the exact same result and searching the exact same tree (as if you moved to faster hardware with no program changes), then that too would be fairly easy to detect and no real need to actually test, since that would be pure speed improvement. If you mean any other kind of 'double in strength', then that's a worthless estimate. Saying that it predicts move XYZ in 30% less time or can hold its own against program ABC in 30% less time or plays games against program ABC that are 30% longer are utterly bogus. Pure crap. It's simply not a valid way to measure playing improvements. (The first two only say they get the same old result at 30% less time but say nothing about whether they would actually improve at the full time. It may change its mind to a worse move. The last one assumes there's a direct correlation between game length and strength, but there's not. A long game does not mean the program is strong, and an even longer game does not mean the program is even stronger. By that reasoning, a game that is dragged out for 200 moves would mean the programs are super super strong. Game length can be related to strength but it says nothing about the quality of the moves themselves.) Elo (or other) ratings and raw speed are the only two measurements that have any meaning. Now, test sets (like the WAC set or even the classic Bratko-Kopec set, and many others) have their uses. They can be used as a test to see if anything was broken by your latest changes or if it can 'see' something new. And it can be useful to keep track of improvements in those areas. There's nothing wrong if you report results from standardized tests as long as you report them as such. But only full games can be used to gauge a program's strenth with any sort of accuracy. And based on Hyatt's results over the past few years LOTS of full games are needed. (Some may ask why Hyatt is seeing these results and not anybody else. Certainly a valid question! It's possible there's a flaw in Hyatt's method. But it's more likely that few chess programmers have access to the kind of hardware he does and just can't / don't run hundreds of thousands of games to try to detect small improvements. Most programs go their entire life without ever playing a hundred thousand games. Hyatt, on the other hand, can do this kind of testing almost casually.) > >Bye >Sanny > >Play Chess at: http://www.GetClub.com/Chess.html ----== Posted via Pronews.Com - Unlimited-Unrestricted-Secure Usenet News==---- http://www.pronews.com The #1 Newsgroup Service in the World! >100,000 Newsgroups ---= - Total Privacy via Encryption =---
|
|