Tuesday, July 10, 2007

The Hutter prize: AI pattern recognition developed through compression algorithm tests

To parse out the Slashdot | Text Compressor 1% Away From AI Threshold story, first read the Wikipedia article on the Hutter prize. That's the real story.

In essence, pattern recognition (search) has always been thought to be a key skill for any mind, biologic or abiologic. Pattern recognition is also a key component of compression algorithms (most people are familiar with .zip, .tiff, .jpg and other kinds of compressed files). The Hutter prize funders want to advance AI development (Google preserve us from well intended fools!), so they fund efforts to improve pattern recognition technology by awarding prizes for compression algorithms. (Incidentally, "prizes" as incentives were big in the 19th century and have made a come back in the past 10 years.)

The latest attempt to win the Hutter prize was the subject of the slashdot article. It suggests that this particular skill is close to the theoretic optimum - for text:
Slashdot | Text Compressor 1% Away From AI Threshold

Alexander Ratushnyak compressed the first 100,000,000 bytes of Wikipedia to a record-small 16,481,655 bytes (including decompression program), thereby not only winning the second payout of The Hutter Prize for Compression of Human Knowledge, but also bringing text compression within 1% of the threshold for artificial intelligence. Achieving 1.319 bits per character, this makes the next winner of the Hutter Prize likely to reach the threshold of human performance (between 0.6 and 1.3 bits per character) estimated by the founder of information theory, Claude Shannon and confirmed by Cover and King in 1978 using text prediction gambling. When the Hutter Prize started, less than a year ago, the best performance was 1.466 bits per character. Alexander Ratushnyak's open-sourced GPL program is called paq8hp12 [rar file]...
and from a later comment suggesting the text optimum may be misleading:
... Compression is about recognizing patterns. Once you have a pattern, you can substitute that pattern with a smaller pattern and a lookup table. Pattern recognition is a primary branch of AI, and is something that actual intelligences are currently much better at.

We can generally show this is true by applying the "grad student algorithm" to compression - i.e., lock a grad student in a room for a week and tell him he can't come out until he gets optimum compression on some data (with breaks for pizza and bathroom), and present the resulting compressed data at the end. So far this beats out compression produced by a compression program because people are exceedingly clever at finding patterns.

Of course, while this is somewhat interesting in text, it's a lot more interesting in images, and more interesting still in video. You can do a lot better with those by actually having some concept of objects - with a model of the world, essentially, than you can without. With text you can cheat - exploiting patterns that come up because of the nature of the language rather than because of the semantics of the situation. In other words, your text compressor can be quite "stupid" in the way it finds patterns and still get a result rivaling a human.

No comments: