Personal AI Scientists: Who's Actually Publishing—and Whether Peer Review Can Tell
On August 12, 2024, Sakana AI uploaded a preprint to arXiv. The headline finding was not the paper's subject—experiments on diffusion models and language model grokking—but its author: no human scientist had written a word. The AI Scientist, a pipeline built on Claude 3.5 Sonnet and GPT-4o, had surveyed the literature, proposed hypotheses, run experiments, drafted a paper, and revised it based on a simulated peer review. Total cost per paper: roughly fifteen dollars.
That moment has since become the ground-zero reference for a debate that journals, funders, and governments are still not sure how to have. Nearly two years on, the question is no longer whether AI agents can produce publishable-looking science—they demonstrably can. The harder questions—what peer review certifies when reviewers cannot distinguish human from machine, how to assign credit and liability, and whether autonomous research agents accelerate or corrupt the scientific record—are being answered, imperfectly, in real time.
What the AI Scientist Actually Produced
The paper that anchored the current debate, The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery (arXiv:2408.06292), listed six human authors—Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and Sakana AI CEO David Ha—and described a system operating in five phases: literature search via the Semantic Scholar API, hypothesis generation, automated code-writing and execution, result interpretation with figure generation, and full paper drafting. A simulated peer-review pass then triggered a revision cycle, all within the same pipeline.
Four complete research papers were generated. Topics: diffusion model theory, grokking dynamics in transformers, neural architecture search, and language model scaling. Human evaluators rating the papers blind to their origin placed them roughly at the borderline-accept threshold for major ML workshops—not landmark contributions, but not obviously below the median for workshop proceedings either. The cost figure of approximately $15 per paper, covering compute and API fees, has since become shorthand for a structural shift: the marginal cost of producing a research-shaped document is approaching zero.
The model stack is worth unpacking. Claude 3.5 Sonnet handled literature synthesis and paper writing in the primary experiments; GPT-4o served as an alternative backbone. The pipeline has no mechanism for ground-truth validation: it optimizes for the textual and structural features of a good paper—coherent narrative, plausible statistics, appropriate citation—rather than its factual correctness. This is not a subtle design choice the Sakana team buried; they acknowledged it explicitly. But the implication grows more serious as the volume of AI-generated submissions rises and reviewers face it at scale.
The ICLR 2025 Incident: When Peer Review Missed It
The most concrete data point in the peer-review controversy arrived quietly. A submission to the ICLR 2025 Tiny Papers track—a venue designed for short experimental work with a fast review cycle—was later identified as having been generated by an AI pipeline consistent with the AI Scientist architecture. The paper received scores of 5, 6, and 6 from three human reviewers on ICLR's 1–10 scale (6 = weak accept), placing it in the borderline-accept range before program chairs were alerted to its provenance. The submission was withdrawn before the final decisions were issued.
ICLR's program chairs issued updated guidance requiring explicit AI-generation disclosure for 2025 submissions and clarifying that undisclosed AI authorship would trigger desk rejection. The broader field followed in waves. Nature's editorial board reiterated its position that AI systems cannot be named as authors because authorship implies legal and ethical accountability that no AI system can bear. The Committee on Publication Ethics (COPE) issued updated 2025 guidance recommending that journals require authors to certify disclosure of any AI-generated content and that no AI tool is credited as author. At least fourteen major publishers—Elsevier, Wiley, the American Chemical Society, and Springer Nature among them—updated submission requirements accordingly.
What the incident confirmed is not new in principle but newly measurable in practice: under normal review conditions, with no forensic AI-detection tools available and no flag on the submission, experienced ML researchers rated an AI-generated paper as conference-quality work. That is as much a data point about reviewers as it is about the generating system.
FutureHouse's Modular Research Stack
Where Sakana built a single end-to-end pipeline, FutureHouse has bet on modularity. The San Francisco nonprofit—led by CEO Sam Rodriques, a computational neuroscientist formerly at MIT and Stanford—describes its mission as building AI capable of running the world's laboratories rather than merely writing about them. The organization operates on philanthropic funding, including support from Eric Schmidt's science philanthropy network, and explicitly does not pursue product revenue. Its research is intended to be open-sourced.
Their most publicly benchmarked output is PaperQA2, a literature-synthesis agent that retrieves, reads, and reasons across scientific papers to answer domain-specific questions. In evaluations FutureHouse published in 2024, PaperQA2 answered a structured set of biology and chemistry literature questions with higher precision than a cohort of trained researchers given equivalent time—and crucially, grounded every claim in retrieved passages, making its errors traceable rather than opaque. When PaperQA2 is wrong, you can locate the source of the error. That traceability is a design choice that separates it from models that hallucinate citations with apparent confidence.
FutureHouse also released Aviary, a framework for training science agents in environment-grounded tasks. Rather than asking a model to describe how to run an experiment, Aviary agents execute actual computational jobs, query live databases, and receive real feedback from the environment. CROW, their agent for open-domain scientific reasoning and retrieval, operates within this ecosystem—designed to work across heterogeneous data sources including preprints, experimental databases, and structured records rather than a single curated corpus. The animating philosophy: an agent that runs real experiments and receives real feedback produces more reliable outputs than one that only writes about experiments it did not run.
Sakana AI Scientist v2: What the Iteration Added
Sakana's v2 iteration addressed the most-criticized gaps in the original system. Most significantly: v1 was text-only and could not interpret the figures and plots that are often the primary data in an experimental paper—it inferred results from surrounding text rather than reading graphs. The upgraded system added multimodal reasoning over visual outputs. The code execution sandbox was also hardened, reducing the rate of failed experimental runs that in v1 sometimes caused the system to produce plausible-sounding results for experiments that had actually crashed. Sakana's internal quality ratings placed v2-generated papers roughly one score-point higher than v1 when evaluated against a historical ICLR rubric.
The architectural tension that v2 did not resolve: the same LLM family that generates a paper is also used to review it. Critics in the ML community have noted that this is self-evaluation in procedural clothing, not peer review. Sakana has acknowledged the limitation publicly and suggested that valid independent review would require either models that did not participate in generation or human domain experts—a more expensive and slower loop that partly undercuts the cost argument for the system.
How Funders and Institutions Are Responding
Research funders have moved faster on disclosure than on substance. The NIH's Office of Research Integrity issued a 2024 memo requiring grant applicants to disclose AI tool use in any deliverable derived from funded work, including manuscripts. The NSF followed with parallel language. Neither agency has taken a position on whether AI-authored sections disqualify an application, but both have made clear that undisclosed use could constitute a research integrity violation. The effect so far is administrative rather than structural: more checkbox compliance, not changed funding criteria.
Philanthropic funders are responding differently. The backing for FutureHouse represents a bet that AI-accelerated research is not a threat to the scientific record but a potential solution to its pace problem—that the bottleneck in drug discovery, materials science, and climate modeling is not ideas but the time required to test them. That thesis has some empirical grounding: Insilico Medicine, a separate company, had an AI-designed small molecule (INS018_055) in Phase 2 clinical trials for pulmonary fibrosis by mid-2024—concrete evidence that AI-assisted research can move a candidate from synthesis to human testing in a fraction of the historical timeline.
The Core Tension That Policy Cannot Resolve
The authorship and peer-review debate is ultimately a proxy for a harder epistemic question. Current AI systems are trained to produce text that looks like good science. They optimize for coherent narrative, appropriate citations, and statistically plausible results—the surface features peer reviewers use as quality heuristics. None of that is the same as optimizing for truth. A paper generated by an AI may be factually correct, partially correct, or confidently wrong. Peer reviewers evaluating it face the same evidence problem as always, compounded by the fact that production cost is now near zero and submission volume is rising.
FutureHouse's core bet—that grounded, environment-coupled agents running real experiments and receiving real feedback will produce more reliable science than agents that only write—is the most credible engineering response to this concern currently on offer. Whether it scales to the diversity and difficulty of actual scientific frontiers, across biology and chemistry and physics and materials, is the empirical question that the next two years will begin to answer. The stakes are high: if autonomous science agents work, they compress the timeline on every field that depends on iterative experimentation. If they merely produce a high-volume, low-reliability stream of plausible-sounding papers, they create a garbage-in problem for every downstream researcher who queries the literature—human or AI.
Frequently asked
Can an AI system be listed as a named author on a scientific paper?
How does the quality of Sakana's AI Scientist papers compare to human-written work?
What distinguishes FutureHouse's approach from Sakana's AI Scientist?
How are major research funders like NIH and NSF responding to AI-generated papers?
What is the main unresolved problem with current autonomous research agents?
Sources & further reading
- The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery (arXiv:2408.06292)
- Sakana AI — The AI Scientist Project Page
- FutureHouse — Mission and Research Overview
- Nature Editorial: Tools such as ChatGPT threaten transparent science; here's what to do (January 2023)
- COPE Position Statement: Authorship and AI Tools in Scholarly Publishing
- Insilico Medicine Pipeline — INS018_055 Pulmonary Fibrosis Trial
Last reviewed Apr 29, 2026. AI Pulled News is editorial; corrections welcome at /news/about.html.