Notes on SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Link to paper: https://arxiv.org/abs/2307.10635

Paper published on: 2023-07-20

Paper's authors: Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, Wei Wang

GPT3 API Cost: $0.06

GPT4 API Cost: $0.13

Total Cost To Write This: $0.19

Time Savings: 28:1

The ELI5 TLDR:

This research paper introduces a benchmark suite called SCIBENCH, which is used to test how well large language models (LLMs) can solve college-level scientific problems. SCIBENCH includes two datasets, one with problems from college textbooks and another with exam questions. The paper evaluates the performance of two LLMs, GPT-3.5 and GPT-4, using these datasets. The results show that the LLMs perform better when given specific prompts and external tools. The paper also analyzes the errors made by the LLMs and identifies areas where they struggle, such as assumptions and code conversion. The findings of this research could lead to improvements in LLMs and have applications in education, research, and industry.

The Deeper Dive:

Summary and Novel Aspects of the Research

The research paper at hand introduces a benchmark suite named SCIBENCH, designed to evaluate the reasoning capabilities of large language models (LLMs) in tackling college-level scientific problem solving. It claims to present a novel approach to understanding the abilities of LLMs, specifically in the realm of scientific problem-solving.

SCIBENCH comprises two datasets: an open set, which includes problems from collegiate-level textbooks of math, chemistry, and physics, and a closed set that has problems from undergraduate exams in computer science and mathematics. The paper further provides an in-depth analysis of the performance of two LLMs, GPT-3.5 and GPT-4, using these datasets.

For instance, consider the formula B(λ, T) = 2hc^2 / λ^5 * (e^(hc / λkBT) - 1), which calculates the spectral radiance of a black body at a given wavelength (λ) and temperature (T). In a practical scenario, the research evaluates how well an LLM can use this formula to find the value of B at specific wavelengths and temperatures, such as B(450 nm, 298 K) and B(700 nm, 298 K).

Understanding SCIBENCH

SCIBENCH is a unique benchmark suite, including an open dataset with 695 problems from college textbooks and a closed dataset with midterm and final exam questions. The problems in this suite are open-ended and require multiple steps of reasoning and complex arithmetic operations.

For example, the formula for the ratio of u(λ2, T) to u(λ1, T) is given as ((lambda2 / lambda1)*5) ((math.exp((h c) / (lambda1 k T)) - 1) / (math.exp((h c) / (lambda2 k T)) - 1)). This formula calculates the ratio of the energy density of two light sources at a given temperature. The variables lambda1 and lambda2 represent the wavelengths of two different light sources, while T, h, c, and k represent temperature, Planck's constant, the speed of light, and Boltzmann's constant, respectively.

Evaluating LLMs with SCIBENCH

The paper presents a detailed evaluation of two representative LLMs, GPT-3.5 and GPT-4, using the SCIBENCH datasets. The evaluation process includes various prompting strategies and the use of external tools. For example, the researchers used chain-of-thought (CoT) prompting, which encourages LLMs to generate detailed solution steps. They also tried prompting the models to use external tools like Python.

The results showed that the baseline LLMs had low accuracy scores on the open textbook dataset, but the performance improved with the inclusion of CoT prompting and external tools. For instance, GPT-4 outperformed GPT-3.5 across all experimental settings in the textbook dataset, with significant improvements in few-shot learning with CoT prompting and Python as external tools.

Error Analysis and Problem-Solving Skills

The paper categorizes the errors made by LLMs into ten problem-solving abilities through a user study. This analysis is crucial to understand the limitations and potential improvements in the problem-solving capabilities of LLMs. For instance, the paper identifies "Identification of Assumptions" as an error reason when the model used the ideal gas law without information about the temperature of the air.

Similarly, "Code Conversion Skills" was identified as an error reason when the model's solution contained a syntax error in the Wolfram Language code, causing the program to terminate prematurely. Another error reason was "Spatial Perception", which was identified when the model's solution was incomplete as it only provided equations and did not provide any visual representation.

Future Implications and Applications

The findings of this research could potentially drive further developments in the reasoning abilities of LLMs and contribute to scientific research and discovery. For instance, by understanding the specific areas where LLMs struggle, developers can focus on enhancing these areas, thereby improving the overall capabilities of these models.

The research also highlights the need for future research to enhance the problem-solving capabilities of LLMs in scientific domains. This could lead to the development of more advanced LLMs capable of solving complex scientific problems, which could be a game-changer in various fields, including education, research, and industry.

For example, in the field of education, these advanced LLMs could be used to develop intelligent tutoring systems capable of providing personalized learning experiences to students. In the field of research, these LLMs could assist researchers in solving complex scientific problems, thereby accelerating scientific discovery. And in the industry, these LLMs could be used to develop advanced AI-powered tools and applications that can solve complex problems in various domains, such as healthcare, finance, and energy.