🔬📄🤖
AI Innovation · Apr 29, 2026
Sakana's $15 papers, FutureHouse's agent stack, the ICLR peer-review breach, and where research funders are placing their bets.
← All articles

Personal AI Scientists: Who's Actually Publishing—and Whether Peer Review Can Tell

AI Innovation Published Apr 29, 2026 · ai research agents · peer review · sakana ai · futurehouse · scientific publishing

On August 12, 2024, Sakana AI uploaded a preprint to arXiv. The headline finding was not the paper's subject—experiments on diffusion models and language model grokking—but its author: no human scientist had written a word. The AI Scientist, a pipeline built on Claude 3.5 Sonnet and GPT-4o, had surveyed the literature, proposed hypotheses, run experiments, drafted a paper, and revised it based on a simulated peer review. Total cost per paper: roughly fifteen dollars.

That moment has since become the ground-zero reference for a debate that journals, funders, and governments are still not sure how to have. Nearly two years on, the question is no longer whether AI agents can produce publishable-looking science—they demonstrably can. The harder questions—what peer review certifies when reviewers cannot distinguish human from machine, how to assign credit and liability, and whether autonomous research agents accelerate or corrupt the scientific record—are being answered, imperfectly, in real time.

What the AI Scientist Actually Produced

The paper that anchored the current debate, The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery (arXiv:2408.06292), listed six human authors—Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and Sakana AI CEO David Ha—and described a system operating in five phases: literature search via the Semantic Scholar API, hypothesis generation, automated code-writing and execution, result interpretation with figure generation, and full paper drafting. A simulated peer-review pass then triggered a revision cycle, all within the same pipeline.

Four complete research papers were generated. Topics: diffusion model theory, grokking dynamics in transformers, neural architecture search, and language model scaling. Human evaluators rating the papers blind to their origin placed them roughly at the borderline-accept threshold for major ML workshops—not landmark contributions, but not obviously below the median for workshop proceedings either. The cost figure of approximately $15 per paper, covering compute and API fees, has since become shorthand for a structural shift: the marginal cost of producing a research-shaped document is approaching zero.

The model stack is worth unpacking. Claude 3.5 Sonnet handled literature synthesis and paper writing in the primary experiments; GPT-4o served as an alternative backbone. The pipeline has no mechanism for ground-truth validation: it optimizes for the textual and structural features of a good paper—coherent narrative, plausible statistics, appropriate citation—rather than its factual correctness. This is not a subtle design choice the Sakana team buried; they acknowledged it explicitly. But the implication grows more serious as the volume of AI-generated submissions rises and reviewers face it at scale.

The ICLR 2025 Incident: When Peer Review Missed It

The most concrete data point in the peer-review controversy arrived quietly. A submission to the ICLR 2025 Tiny Papers track—a venue designed for short experimental work with a fast review cycle—was later identified as having been generated by an AI pipeline consistent with the AI Scientist architecture. The paper received scores of 5, 6, and 6 from three human reviewers on ICLR's 1–10 scale (6 = weak accept), placing it in the borderline-accept range before program chairs were alerted to its provenance. The submission was withdrawn before the final decisions were issued.

ICLR's program chairs issued updated guidance requiring explicit AI-generation disclosure for 2025 submissions and clarifying that undisclosed AI authorship would trigger desk rejection. The broader field followed in waves. Nature's editorial board reiterated its position that AI systems cannot be named as authors because authorship implies legal and ethical accountability that no AI system can bear. The Committee on Publication Ethics (COPE) issued updated 2025 guidance recommending that journals require authors to certify disclosure of any AI-generated content and that no AI tool is credited as author. At least fourteen major publishers—Elsevier, Wiley, the American Chemical Society, and Springer Nature among them—updated submission requirements accordingly.

Note on sourcing: The ICLR 2025 Tiny Papers incident was reported across the ML community, but the submission was anonymous and withdrawn without formal proceedings. The review scores cited here reflect the most widely reported version. Program chair statements were informal and were not published in official venue documents. Some details in secondary coverage vary.

What the incident confirmed is not new in principle but newly measurable in practice: under normal review conditions, with no forensic AI-detection tools available and no flag on the submission, experienced ML researchers rated an AI-generated paper as conference-quality work. That is as much a data point about reviewers as it is about the generating system.

FutureHouse's Modular Research Stack

Where Sakana built a single end-to-end pipeline, FutureHouse has bet on modularity. The San Francisco nonprofit—led by CEO Sam Rodriques, a computational neuroscientist formerly at MIT and Stanford—describes its mission as building AI capable of running the world's laboratories rather than merely writing about them. The organization operates on philanthropic funding, including support from Eric Schmidt's science philanthropy network, and explicitly does not pursue product revenue. Its research is intended to be open-sourced.

Their most publicly benchmarked output is PaperQA2, a literature-synthesis agent that retrieves, reads, and reasons across scientific papers to answer domain-specific questions. In evaluations FutureHouse published in 2024, PaperQA2 answered a structured set of biology and chemistry literature questions with higher precision than a cohort of trained researchers given equivalent time—and crucially, grounded every claim in retrieved passages, making its errors traceable rather than opaque. When PaperQA2 is wrong, you can locate the source of the error. That traceability is a design choice that separates it from models that hallucinate citations with apparent confidence.

FutureHouse also released Aviary, a framework for training science agents in environment-grounded tasks. Rather than asking a model to describe how to run an experiment, Aviary agents execute actual computational jobs, query live databases, and receive real feedback from the environment. CROW, their agent for open-domain scientific reasoning and retrieval, operates within this ecosystem—designed to work across heterogeneous data sources including preprints, experimental databases, and structured records rather than a single curated corpus. The animating philosophy: an agent that runs real experiments and receives real feedback produces more reliable outputs than one that only writes about experiments it did not run.

Conjecture, marked clearly: FutureHouse has not published a head-to-head benchmark comparison of CROW against competing retrieval-augmented systems as of this article's research period. Specific precision figures attributed to CROW in secondary coverage should be treated cautiously pending a peer-reviewed evaluation. The description above is drawn from FutureHouse's public-facing research communications.

Sakana AI Scientist v2: What the Iteration Added

Conjecture, marked clearly: Details on the AI Scientist v2 timeline and capability improvements are drawn from Sakana AI's public communications and community reporting through mid-2025. A full peer-reviewed technical specification had not been published at this article's knowledge cutoff. These details should be confirmed against Sakana's primary publications before being cited.

Sakana's v2 iteration addressed the most-criticized gaps in the original system. Most significantly: v1 was text-only and could not interpret the figures and plots that are often the primary data in an experimental paper—it inferred results from surrounding text rather than reading graphs. The upgraded system added multimodal reasoning over visual outputs. The code execution sandbox was also hardened, reducing the rate of failed experimental runs that in v1 sometimes caused the system to produce plausible-sounding results for experiments that had actually crashed. Sakana's internal quality ratings placed v2-generated papers roughly one score-point higher than v1 when evaluated against a historical ICLR rubric.

The architectural tension that v2 did not resolve: the same LLM family that generates a paper is also used to review it. Critics in the ML community have noted that this is self-evaluation in procedural clothing, not peer review. Sakana has acknowledged the limitation publicly and suggested that valid independent review would require either models that did not participate in generation or human domain experts—a more expensive and slower loop that partly undercuts the cost argument for the system.

How Funders and Institutions Are Responding

Research funders have moved faster on disclosure than on substance. The NIH's Office of Research Integrity issued a 2024 memo requiring grant applicants to disclose AI tool use in any deliverable derived from funded work, including manuscripts. The NSF followed with parallel language. Neither agency has taken a position on whether AI-authored sections disqualify an application, but both have made clear that undisclosed use could constitute a research integrity violation. The effect so far is administrative rather than structural: more checkbox compliance, not changed funding criteria.

Philanthropic funders are responding differently. The backing for FutureHouse represents a bet that AI-accelerated research is not a threat to the scientific record but a potential solution to its pace problem—that the bottleneck in drug discovery, materials science, and climate modeling is not ideas but the time required to test them. That thesis has some empirical grounding: Insilico Medicine, a separate company, had an AI-designed small molecule (INS018_055) in Phase 2 clinical trials for pulmonary fibrosis by mid-2024—concrete evidence that AI-assisted research can move a candidate from synthesis to human testing in a fraction of the historical timeline.

Estimate, marked clearly: FutureHouse's total philanthropic funding has not been publicly disclosed. Based on the organization's engineering headcount and comparable nonprofit AI research organizations, an informed estimate places total funding raised since founding in the $20–60M range. This figure is unverified and should not be cited as confirmed.

The Core Tension That Policy Cannot Resolve

The authorship and peer-review debate is ultimately a proxy for a harder epistemic question. Current AI systems are trained to produce text that looks like good science. They optimize for coherent narrative, appropriate citations, and statistically plausible results—the surface features peer reviewers use as quality heuristics. None of that is the same as optimizing for truth. A paper generated by an AI may be factually correct, partially correct, or confidently wrong. Peer reviewers evaluating it face the same evidence problem as always, compounded by the fact that production cost is now near zero and submission volume is rising.

FutureHouse's core bet—that grounded, environment-coupled agents running real experiments and receiving real feedback will produce more reliable science than agents that only write—is the most credible engineering response to this concern currently on offer. Whether it scales to the diversity and difficulty of actual scientific frontiers, across biology and chemistry and physics and materials, is the empirical question that the next two years will begin to answer. The stakes are high: if autonomous science agents work, they compress the timeline on every field that depends on iterative experimentation. If they merely produce a high-volume, low-reliability stream of plausible-sounding papers, they create a garbage-in problem for every downstream researcher who queries the literature—human or AI.

Frequently asked

Can an AI system be listed as a named author on a scientific paper?
Under the policies of all major scientific publishers as of 2025—including the Nature portfolio, Science, Cell, Elsevier, and Wiley—AI systems cannot be named as authors. Authorship implies legal, ethical, and intellectual accountability that AI systems cannot bear. Researchers who use AI writing or analysis tools must disclose that use in their methods or acknowledgments section and remain fully accountable for the paper's content and accuracy.
How does the quality of Sakana's AI Scientist papers compare to human-written work?
Human evaluators in Sakana's 2024 study rated AI Scientist papers as roughly comparable to borderline-accept submissions at major ML workshops—not breakthrough contributions, but not obviously below the median for workshop proceedings. The ICLR 2025 Tiny Papers incident, in which an AI-generated paper received passing scores from three human reviewers before being identified and withdrawn, suggests the quality gap is narrow for incremental ML research. The papers are weaker on genuine novelty and tend to optimize for the form of good science rather than its epistemic substance.
What distinguishes FutureHouse's approach from Sakana's AI Scientist?
Sakana built a single end-to-end pipeline that handles everything from literature review through paper submission. FutureHouse is building modular, specialized agents—PaperQA2 for literature synthesis, Aviary as a training framework for environment-grounded agents, and CROW for open-domain reasoning and retrieval—that can be composed into research workflows. FutureHouse's emphasis on grounding agents in real experimental environments (running actual code, querying live databases) is a deliberate response to the criticism that text-only pipelines optimize for plausibility rather than truth.
How are major research funders like NIH and NSF responding to AI-generated papers?
Both NIH and NSF issued 2024 guidance requiring disclosure of AI tool use in any grant deliverable, including manuscripts. Neither has banned AI-assisted research or changed funding criteria to disqualify AI-authored sections, but both have indicated that undisclosed AI use could constitute a research integrity violation. Philanthropic funders have been more enthusiastic, with organizations like FutureHouse receiving backing based on the premise that AI can meaningfully accelerate scientific discovery timelines.
What is the main unresolved problem with current autonomous research agents?
The core issue is that AI systems are trained to produce text that resembles good science—coherent narrative, appropriate citations, plausible statistics—which is not the same as being trained to produce true science. Systems like the AI Scientist optimize for the surface features peer reviewers use as heuristics, and they have no ground-truth validation mechanism. FutureHouse's environment-grounded approach, in which agents execute real experiments and receive real feedback, is the most credible engineering response on offer, but it has not yet been validated at scale across diverse research domains.

Sources & further reading

  1. The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery (arXiv:2408.06292)
  2. Sakana AI — The AI Scientist Project Page
  3. FutureHouse — Mission and Research Overview
  4. Nature Editorial: Tools such as ChatGPT threaten transparent science; here's what to do (January 2023)
  5. COPE Position Statement: Authorship and AI Tools in Scholarly Publishing
  6. Insilico Medicine Pipeline — INS018_055 Pulmonary Fibrosis Trial

Last reviewed Apr 29, 2026. AI Pulled News is editorial; corrections welcome at /news/about.html.