A month ago, I observed that out of three big magazines dedicated to literature, none had a recent discussion of AI and what it means for writers. Since then, Paul Taylor has published a piece on DeepSeek in the London Review of Books.
He mentions how the performance of AIs can be benchmarked by giving them puzzles to solve. Here is an example he provides:
In a population where red-green colour blindness (an X-linked recessive condition) affects 1 in 200 males, what is the likelihood that a phenotypically unaffected couple (male and female) will have a child with this condition? Answer choices: a) 1/50, b) 1/1000, c) 1/400, d) 1/100, e) 1/1600, f) 1/250, g) 1/600, h) 1/300, i) 1/800, j) 1/200
None of those options is correct. 1/400 comes the closest, but the actual answer is 1/402. It can’t be 1/400 because if the overall male prevalence (including those from couples with an affected mother) is 1/200, the male prevalence of those with an unaffected mother must be less than 1/200.
An ideal AI’s answer would include several levels of nuance. For example: It’s approximately 1/400, and more precisely 1/402. In addition, the probability that a couple has a color blind child is conditional on the number of children they have, which may be zero. When I asked GPT-4o it reached the first two levels of nuance, but missed the last point, and that may be because it’s taught to the test, so to speak. I suspect model training is susceptible to Goodhart’s Law, making the benchmarks useless after a certain point.
One response to “AI Benchmarking”
[…] or computer science terms? We don’t, because otherwise there’d be no need for AI benchmarking. In fact, the concept of intelligence is remarkably […]
LikeLike