The Black Spatula Project
WhatsApp - Discord - GitHub
Background
The Black Spatula Project is an open initiative to investigate the potential of large language models (LLMs) to identify errors in scientific papers. We seek to answer the following questions: How many errors can LLMs detect? How serious are those errors? Which model/prompt/pipeline performs the best? And ultimately, how can we use AI to improve scientific integrity?
The project was inspired by a scientific paper that, due to a simple math error that even an AI reviewer could catch, caused many people to toss all of their black plastic kitchen implements. To learn more about the story of this project, please check out the initial post and latest update by Steve Newman, the initiator of the project.
If you’d like to get involved, join our currently very active WhatsApp group (for high-level discussion) and/or Discord (more focused on technical work). To contribute, check out the Ongoing Tasks section below.
Important Links
Worksheets
Other
Ongoing Tasks
- Manually try some papers you like or from the WithdrarXiv dataset for testing.
- You can use whatever model you like or have access to, including but not limited to ChatGPT, GPT-o1, Gemini 2.0 Flash Thinking, and Claude 3.5.
- Please save your prompts and results in the WithdrarXiv Spreasheet or Black Spatula Draft Analysis Prompts. This helps us keep track of the statistics and guide future directions.
- If you are a domain expert and would like to help verify LLMs’ results, please enter your name and expertise in the Volunteer Evaluators sheet
- Figure out a way to export papers into a format easy to read for LLMs
- Explore different methods for paper submission (checkout #data-ingestion and the repository)
Website
- Website to post updates and important documents about the project
Resources & Ideas
Potential Paper Sources
Potential Error Types
- Math and numerical errors: data inconsistencies, calculation mistakes, etc.
- Methodology problems: problematic methods, inconsistent methods, etc.
- Writing and logical problems: incorrect interpretation of results, unsupported conclusions, etc.
- Figure and table discrepancies: labeling/formatting errors, mismatches with narrative, etc.
- Citation errors: invalid citations, missing citations, etc.
- Other minor problems: grammar issues, typos, incorrect table/figure numbers, etc.
Implementation Ideas
- Prioritize using papers in LaTeX to retain math formulas. Papers in other domains are usually in PDF but PDF conversion may mess it up.
- From @mhmazur: Use TeX files; Azure’s Document Intelligence service for PDF OCR.
- From @Frecias: Grobid for parsing the text in PDF while retaining connections to references. Don’t use it for math/table/figure.