The Black Spatula Project

Background

The Black Spatula Project is an open initiative to investigate the potential of large language models (LLMs) to identify errors in scientific papers. We seek to answer the following questions: How many errors can LLMs detect? How serious are those errors? Which model/prompt/pipeline performs the best? And ultimately, how can we use AI to improve scientific integrity?

The project was inspired by a scientific paper that, due to a simple math error that even an AI reviewer could catch, caused many people to toss all of their black plastic kitchen implements. To learn more about the story of this project, please check out the initial post and latest update by Steve Newman, the initiator of the project.

If you’d like to get involved, join our currently very active WhatsApp group (for high-level discussion) and/or Discord (more focused on technical work). To contribute, check out the Ongoing Tasks section below.

Important Links

Worksheets

WithdrarXiv Spreasheet - Contains arXiv papers with known errors, prompts for LLMs, and test results (maintaned by Discord community)
Black Spatula Draft Analysis Prompts - Collection of prompts (maintained by WhatsApp community)

Other

Ongoing Tasks

Research (#prompt-and-model)

Manually try some papers you like or from the WithdrarXiv dataset for testing.
- You can use whatever model you like or have access to, including but not limited to ChatGPT, GPT-o1, Gemini 2.0 Flash Thinking, and Claude 3.5.
- Please save your prompts and results in the WithdrarXiv Spreasheet or Black Spatula Draft Analysis Prompts. This helps us keep track of the statistics and guide future directions.
If you are a domain expert and would like to help verify LLMs’ results, please enter your name and expertise in the Volunteer Evaluators sheet

Pipeline (#data-ingestion)

Figure out a way to export papers into a format easy to read for LLMs
Explore different methods for paper submission (checkout #data-ingestion and the repository)

Website

Website to post updates and important documents about the project

Resources & Ideas

Potential Paper Sources

Preprint repositories:
- arXiv: LaTeX, PDF, and HTML
- ChemRxiv: chemistry. PDF only
- PsyArXiv: psychology
- SOCArXiv: social science
PubMed Central (PMC): biomed papers in PDF and some have HTML. Has API for retrieval.
Social Science Open Access Repository
Semantic Scholar: papers from all open-access sources including the sources above. Has API for search and retrieval.
WithdrarXiv paper & dataset on HuggingFace: a large dataset of withdrawn papers from arXiv with associated retraction comments.
PubPeer: a platform for biomed researchers to post comments on published papers.
Retraction Watch Database: a database of retracted papers and reasons.

Potential Error Types

Math and numerical errors: data inconsistencies, calculation mistakes, etc.
Methodology problems: problematic methods, inconsistent methods, etc.
Writing and logical problems: incorrect interpretation of results, unsupported conclusions, etc.
Figure and table discrepancies: labeling/formatting errors, mismatches with narrative, etc.
Citation errors: invalid citations, missing citations, etc.
Other minor problems: grammar issues, typos, incorrect table/figure numbers, etc.

Implementation Ideas

Prioritize using papers in LaTeX to retain math formulas. Papers in other domains are usually in PDF but PDF conversion may mess it up.
From @mhmazur: Use TeX files; Azure’s Document Intelligence service for PDF OCR.
From @Frecias: Grobid for parsing the text in PDF while retaining connections to references. Don’t use it for math/table/figure.