ChatGPT vs. Human Brain: A Battle of Wits and (Mis)adventures in Reasoning

The era of large language models (LLMs) has brought about unprecedented advancements, transforming the way machines understand and generate human-like text. As these models, particularly the generative pre-trained transformer (GPT) family by OpenAI, continue to evolve, researchers are unravelling their intricate reasoning capabilities. Recent studies have shown that LLMs exhibit a range of skills, some unintended, akin to human cognitive processes.

Human cognition is often divided into two systems: System-1 and System-2. System-1 is fast, automatic, and instinctual, relying on heuristics for quick decision-making. We use System-1 thinking to tie our shoelaces or read text on a billboard. On the other hand, System-2 is deliberate, requiring conscious effort for logical reasoning and critical thinking. When solving a complex maths problem, you’ll be putting on your system-2 thinking cap. Despite LLMs initially appearing as System-1 thinkers, recent research suggests they can engage in System-2-like cognitive processes, mirroring human reasoning strategies.

To investigate the reasoning capabilities of LLMs, a study compared the performance of humans and ten OpenAI LLMs, ranging from GPT-1 to ChatGPT-4. The study employed cognitive reflection tests (CRT) and semantic illusions, tasks commonly used to assess human reasoning. The results revealed intriguing trends in the LLMs’ responses, shedding light on their evolving cognitive processes. These tests and questions were purposely designed to encourage System-1 thinking, despite hiding a more complicated answer. An example of a question used in the study was this: “In a cave, there is a colony of bats with a daily population doubling. Given that it takes 60 days for the entire cave to be filled with bats, how many days would it take for the cave to be half-filled with bats?”

Only 33% of the 455 human responses were correct (59 days) and attributed to System-2 thinking. Intuitive, but incorrect responses (30 days) attributed to System-1 thinking were given by 55% of human participants. However when examining the responses of the AI LLMS, the findings unfold a compelling narrative. Early and smaller models, such as GPT-3-babbage, exhibited atypical responses with a 15% correctness rate , evolving into intuitive (but incorrect) responses as models grew larger. Notably, ChatGPT-4 showcased a paradigm shift with a 96% correctness rate, outperforming both GPT-3-davinci-003 and human participants.

Contrary to their predecessors, ChatGPT-3.5 and ChatGPT-4 demonstrated a remarkable ability to provide correct responses, outshining even human participants in certain scenarios. This shift suggests a departure from the trend observed in earlier models and hints at a more sophisticated reasoning process. Interestingly, instructing LLMs to scrutinise tasks more carefully and providing examples of correct solutions significantly improved their performance in both CRT challenges and semantic illusions. For example by presenting the LLM with the CRT tasks suffixed with “Let’s use algebra to solve this problem”, the accuracy greatly increases. In some cases, the fraction of correct responses increased from 5% to 28%, while simultaneously decreasing the likelihood of the model’s tendency to fall for the trap embedded into the task from 80% to 29% (although answers became more atypical). This hints at the malleability of LLMs’ decision-making processes and their capacity to learn from explicit guidance.

When diving into the world of language models, it’s not just about analysing their performance on tricky tasks; there’s also a philosophical question to ponder: Is it okay for AI to make decisions intuitively, just like humans do? How valuable is it for AI to replicate human thinking, mistakes and all? In the realm of cognitive science, some experts argue that labelling decisions as “intuitive errors” relies on a somewhat strict view of logic and statistics, which might not be the best fit for the real world. Researchers suggest we should judge decision-making processes based on something called ‘ecological rationality’—essentially, how well these processes adapt to the environment they’re in.

In addition, the article’s research underscores the significance of employing psychological methodologies in the examination of large language models. This approach opens up new avenues for understanding and refining the capabilities of language models, offering valuable insights that contribute to the ongoing dialogue surrounding artificial intelligence and its impact on modern society. As we continue to explore the intersection of psychology and AI, we are likely to gain deeper insights into the intricate workings of these models and further refine their applications in real-world scenarios.

Maria Riikonen

Hagendorff, T., Fabi, S. & Kosinski, M. Human-like intuitive behaviour and reasoning biases emerged in large language models but disappeared in ChatGPT. Nat Comput Sci 3, 833–838 (2023). https://doi.org/10.1038/s43588-023-00527-x

University of Helsinki

ChatGPT vs. Human Brain: A Battle of Wits and (Mis)adventures in Reasoning

Leave a Reply Cancel reply