Leveraging LLM feedback to enhance review quality
(This post is written by James Zou, Associate Program Chair for ICLR 2025, and Nitya Thakkar)
Peer review is a key element of research and innovation. However, it faces growing strain from the rapidly rising volume of paper submissions, particularly at AI conferences; as a result, authors increasingly express dissatisfaction with low-quality reviews. Therefore, there is growing interest in exploring how large language models (LLMs) can enhance the quality and usefulness of peer reviews for authors.
At ICLR 2025, we piloted an AI agent that leveraged multiple LLMs to provide reviewers with optional feedback on their reviews. This feedback agent was optimized to give suggestions that made reviews more informative, clear, and actionable. We also implemented multiple LLM-based guardrails, called reliability tests, which evaluated specific attributes of the AI feedback before it was posted. The agent provided feedback to 18,946 randomly selected reviews (see Figure 1A). During the review period, reviewers could choose to ignore the LLM feedback (Not updated) or revise their review in response (Updated), as the system did not make any direct changes. See our previous blog post, Assisting ICLR 2025 reviewers with feedback, for more details about the IRB-approved study setup.
Key Findings
Figure 1: (A) Among all ICLR 2025 reviews, 22,467 were randomly selected to receive feedback (feedback group), and 22,364 were randomly selected not to receive feedback (control group). Among those who received feedback, 26.6% of reviewers updated their reviews. In total, reviewers who updated incorporated 12,222 feedback items. (B) (Top) Most reviews were submitted 2-3 days before the ICLR review deadline (November 4, 2024). (Bottom) Reviewers who received feedback were much more likely to update their reviews than those in the control group. Across both groups, reviewers were more likely to update their review if they submitted it early relative to the deadline.
Our findings reveal several significant impacts of this LLM-based system:
- Reviewers incorporated 12,222 specific suggestions from the feedback agent into their reviews, indicating that many reviewers found the LLM recommendations helpful. In total, 26.6% of reviewers who received feedback updated their reviews (Figure 1A).
- In a blinded preference assessment, machine learning researchers found that LLM feedback improved review quality in 89% of the cases.
- Reviewers who updated their reviews after receiving LLM feedback increased review length by an average of 80 words, suggesting more detailed reviews.
- LLM feedback led to more engaged discussions during the rebuttal period, evidenced by longer author rebuttals and reviewer responses (see Table 1). This could be due to clearer and more actionable reviews resulting from the feedback, leading to more productive rebuttals.
- There was no statistically significant difference in the acceptance outcomes of the final papers between the feedback and control groups. This is consistent with the feedback agent’s goal of enhancing the author-reviewer discussion rather than advocating for or criticizing the paper.
Table 1: Average rebuttal and reply lengths across control and feedback groups, and between reviewers who did or did not update their review after receiving feedback. We observe that being selected to receive feedback causally increased the length of author rebuttals by an average of 48 words (*** p ≤ 0.001) for reviews written by reviewers selected to receive feedback. We also see that the average length of reviewer replies to author rebuttals is longer among those in the feedback group (***p ≤ 0.001). Note that the feedback group includes reviews that were selected to receive feedback but ignored the feedback, which can dilute the effect size.
Our large randomized control study highlights the potential of a carefully designed LLM-based system to enhance peer review quality at scale. By providing targeted feedback to reviewers at ICLR 2025 and enabling them to choose how to incorporate it, we observed improvements in review specificity, engagement, and actionability. We provide a more detailed analysis and discussions of the limitations of LLM feedback in our paper. As LLM capabilities continue to advance, more research and rigorous assessments are needed to understand how AI can responsibly enhance peer review.
Nitya Thakkar, Mert Yuksekgonul, Jake Silberg, James Zou (Associate Program Chair)
The Review Feedback Agent Team
Carl Vondrick, Rose Yu, Violet Peng, Fei Sha, Animesh Garg
ICLR 2025 Program Chairs