A New Conference Presented Scientific Papers Written and Reviewed by A.I. as an Experiment. Here's What Happened While some researchers note the models made tasks more efficient, many scientists remain skeptical about using A.I. to author scientific work Sara Hashemi - Daily Correspondent

Should scientists use artificial intelligence to help conduct research or write up the results? Most publishers prohibit listing A.I. as an author in a study; conferences don’t allow speakers to use A.I. tools.

But on October 22, an experimental—and controversial—conference flipped those rules: All of the work shared at the meeting was prepared and reviewed by A.I. Even if humans were involved with the papers, the A.I. models had to do enough to be considered similar to the lead author of the study.

“We’re seeing this interesting paradigm shift,” says James Zou, a computer scientist at Stanford University who co-organized the Agents4Science conference, to Kathryn Hulick at Science News. “People are starting to explore using A.I. as a co-scientist.”

The endeavor was an experiment in itself. Anyone can take a look at the submissions online and see how both A.I. and human reviewers assessed the work. “Those of us in the A.I. world need to do a better job at understanding what the strengths and weaknesses are of using systems in this way,” computer scientist Margaret Mitchell, who studies A.I. ethics at Hugging Face, tells Elizabeth Gibney at Nature, especially as the technology can come up with false positive results. “How to evaluate A.I. agents at all is an open research area.”

The virtual event garnered 1,800 registrations, reports Jeffrey Brainard at Science. The conference received 315 A.I. submissions, which were then assessed by a panel of A.I. reviewers. Eighty papers made the initial cut, and human reviewers then helped organizers chose the final 48 papers presented at the conference.

The stated goal of the conference was to assess “if and how A.I. can independently generate novel scientific insights, hypotheses and methodologies while maintaining quality through A.I.-driven peer review.”

Zou tells Science that “There’s still some stigma about using A.I., and people are incentivized to hide or to minimize it.” The organizers wanted “to have this study in the open so that we can start to collect real data, to start to answer these important questions,” Zou tells the outlet.

Most of the papers were on the topic of artificial intelligence and human learning, according to the event’s website. They’re also mainly computational studies, Zou tells Nature, rather than based on physical experiments.

Three works took home the title of outstanding paper: One that looked at how artificial intelligence agents behave in economic marketplaces; one examining how reduced towing fees in San Francisco impacted low-income residents; and another that investigated whether an A.I. agent can fool A.I. reviewers with bad papers.

The conference’s ethos is not without critics, who are unconvinced that A.I. can meaningfully design and author studies. “If the authors and reviewers are A.I., then perhaps the conference attendees should be A.I., too, because no human should mistake this for scholarship,” Raffaele Ciriello, a digital innovation researcher at the University of Sydney, said in a statement released ahead of the conference.

“Science is not a factory that converts data into conclusions,” he added. “It is a collective human enterprise grounded in interpretation, judgment and critique. Treating research as a mechanistic pipeline where hypotheses, experiments and papers can be autonomously generated and evaluated by machines reduces science to empiricism on steroids.”

Even Min Min Fong, an economist at the University of California, Berkeley, who collaborated with A.I. on the car-towing study, urges caution when working with the technology. “A.I. was really great at helping us with computational acceleration,” she tells Science News. But it still made mistakes—the A.I. kept using the wrong date for when San Francisco’s new towing fees were implemented. “The core scientific work still remains human-driven,” she says to the outlet.

Previous research suggests A.I. reviewers cannot assess a study’s novelty and significance as well as humans can, Matthew Gombolay, a computer scientist at the Georgia Institute of Technology, tells Nature. Another issue: Many large language models are simply too nice. They are “not going to produce the level of conflict and diverse perspectives that are required for really pathbreaking work,” James Evans, a computational social scientist at the University of Chicago, tells Science.

A better experiment, Gombolay tells Nature, would be for an existing conference to randomly assign humans or A.I. review to papers, then evaluate which path results in more breakthroughs.