GPT-4 is as Dumb as GPT-3.5

Historical Sense
5 min readMar 15, 2023

--

Well, GPT-4 came out today. Superficially, it is an impressive leap in skillfulness from GPT-3.5. On the Uniform Bar Exam, it jumped from the 10th percentile to the 90th percentile. On the AP Calculus exam, it soared from a 1 to a 4. On almost every test, GPT-4 improved, with a few critical exceptions.

Let me draw your attention specifically to its performance on the AP English and Language Composition Exam as well as the English Literature and Composition Exams.

Not only did GPT-4 fail to pass those exams, it failed to improve on GPT-3.5’s performance.

The question is, why did it stagnate in these areas, and why does that matter? I contend that more so than most standardized tests, the AP English exams measure an actual form of generalized intelligence, the ability to creatively analyze and insightfully interpret novel information.

Most standardized tests rely on either multiple choice or short answer. Crucially, standardized tests asks questions which already have answers.

The AP English Exam is not about measuring your ability to answer a question that has already been answered. It is not testing your knowledge of a subject matter. It asks you to look at a set of data you have probably never seen, synthesize it, take a clear position, formulate a compelling argument, choose relevant evidence in defense of that argument, and then analyze that evidence to explain how it substantiates your argument. To emphasize, because it is testing your skills in analysis, synthesis, and interpretation, it can provide you texts that have not been widely analyzed or studied.

It has been suggested that with prompt-tweaking, GPT-4 could perform better on AP English exams.

Perhaps there is some truth to Mollusk’s claim, but the AP exams prompts are thorough and unambiguous. If GPT-4 can’t answer those prompts, it doesn’t speak too well of its generalized intelligence.

I was actually surprised by GPT-4’s scores on the AP English Exams. I would expect it to score a three, as that would indicate that at least it was able to construct coherent paragraphs, devise a basic thesis, and explain evidence, even if tritely. But apparently it couldn’t even do that.

Something else I should address. Why is it that GPT-4 performed well on subject matter tests which contain similar essay style questions to the AP English exams? For example, GPT-4 scored a 5 on the AP US History Exam, an improvement from GPT-3.5’s score of 4. Like AP English exams, the AP US History Exams contains questions that require the test-taker to write an essay after examining a selection of evidence.

The difference is that on exam like history or political science or law is testing a student on her knowledge of a specific subject matter. Thus, the documents provided will be common documents that have been analyzed a thousand times by scholars in the field. They will be asked to answer a question that is vehemently contested and covered in the field. Consider some of the documents from the sample AP US History Exam.

A report on the causes of the War of 1812, and an annual message to Congress from a President from a well-trod topic in the early Republic, internal improvement. Each one of these documents would appear in countless historical articles and books about the nineteenth century United States. GPT-4 has voluminous examples in its training data related to these exam questions that it could paraphrase.

The difference between GPT-4’s performance on the AP English Exams and on subject matter exams such as AP US History illustrate what the skeptics have felt all along about the grandiose claims about AI and disparaging claims about the human mind. GPT-4 is great at aggregating and compositing and paraphrasing what is in its training data. It is great at answering problems that are similar (if not exactly the same) to what it has been trained upon. But it is not able to efficiently adapt to novel problems. It does not truly “think” creatively and insightfully. And it did not improve in that regard from GPT-3.5 to GPT-4.

The claims that humans are also just aggregators, compositors, and paraphrasers are belied by the test results. Students can score 4s and 5s on the AP English exams because they are truly able to think creatively and insightfully. I’m not saying that their answers are necessarily novel in the history of the world. I’m saying that to these students, their answers are unique, that they did not simply parrot training data, as the likelihood they would have seen relevant data and questions is exceedingly small.

Intelligence requires the ability to answer questions that as yet have never been answered. There is little indication that “AI” is improving in that direction.

--

--

No responses yet