I am troubled by people using an LLM at all to write academic research papers.
It's a shoddy, irresponsible way to work. And also plagiarism, when you claim authorship of it.
I'd see a failure of the 'author' to catch hallucinations, to be more like a failure to hide evidence of misconduct.
If academic venues are saying that using an LLM to write your papers is OK ("so long as you look it over for hallucinations"?), then those academic venues deserve every bit of operational pain and damaged reputation that will result.
Google Translate et al were never good enough at this task to actually allow people to use the results for anything professional. Previous tools were limited to getting a rough gloss of what words in another language mean.
But LLMs can be used in this way, and are being used in this way; and this is increasingly allowing non-English-fluent academics to publish papers in English-language journals (thus engaging with the English-language academic community), where previously those academics they may have felt "stuck" publishing in what few journals exist for their discipline in their own language.
Would you call the use of LLMs for translation "shoddy" or "irresponsible"? To me, it'd be no more and no less "shoddy" or "irresponsible" than it would be to hire a freelance human translator to translate the paper for you. (In fact, the human translator might be a worse idea, as LLMs are more likely to understand how to translate the specific academic jargon of your discipline than a randomly-selected human translator would be.)
(A friend has an old book translated a long time ago (by a human) from Russian to Spanish. Instead of "complex numbers", the book calls them "complicated numbers". :) )
Typically what happens is that translators are given an Excel sheet with the original text in a column, and the translated text must be put into the next column. Because there's no context, it's not necessarily clear to the translator whether the translation for plane should be avion (airplane) or plan (geometric plane). The translator might not ever see the actual software with their translated text.
This is because, even in countries with a different primary spoken language, many academic subjects, especially at a graduate level (masters/PhD programs — i.e. when publishing starts to matter), are still taught at universities at least partly in English. The best textbooks are usually written in English (with acceptably-faithful translations of these texts being rarer than you'd think); all the seminal papers one might reference are likely to be in English; etc. For many programs, the ability to read English to some degree is a requirement for attendance.
And yet these same programs are also likely to provide lectures (and TA assistance) in the country's own native language, with the native-language versions of the jargon terms used. And any collaborative work is likely to also occur in the native language. So attendees of such programs end up exposed to both the native-language and English-language terms within their field.
This means that academics in these places often have very little trouble in verifying the fidelity of translation of the jargon in their papers. It's usually all the other stuff in the translation that they aren't sure is correct. But this can be cheaply verified by handing the paper to any fluently-multilingual non-academic and asking them to check the translation, with the instruction to just ignore the jargon terms because they were already verified.
It depends on the country. Here in Argentina we use a lot of loaned words for technical terms, but I think in Spain they like to translate everything.
Why is their use more intense in English-speaking universities?
You need a way to validate the correctness of the translation, and to be able to stand behind whatever the translation says. And the translation should be disclosed on the paper.
That is a purely imaginary "error". Anywhere you can use 'although', you are free to use 'though' instead.
I'm an outsider to the academic system. I have cool projects that I feel push some niche application to SOTA in my tiny little domain, which is publishable based on many of the papers I've read.
If I can build a system that does a thing, I can benchmark and prove it's better than previous papers, my main blocker is getting all my work and information into the "Arxiv PDF" format and tone. Seems like a good use of LLMs to me.
I don't actually mind putting Claude as a co-author on my github commits.
But for papers there are usually so many tools involved. It would be crowded to include each of Claude, Gemini, Codex, Mathematica, Grammarly, Translate etc. as co-authors, even though I used all of them for some parts.
Maybe just having a "tools used" section could work?
It reminds me of kids these days and their fancy calculators! Those new fangled doohickeys just aren't reliable, and the kids never realize that they won't always have a calculator on them! Everyone should just do it the good old fashioned way with slide rules!
Or these darn kids and their unreliable sources like Wikipedia! Everyone knows that you need a nice solid reliable source that's made out of dead trees and fact checked but up to 3 paid professionals!
Sure, maybe someday LLMs will be able to report facts in a mostly reliable fashion (like a typical calculator), but we're definitely not even close to that yet, so until we are the skepticism is very much warranted. Especially when the details really do matter, as in scientific research.
Reproducibility and repeatability in the sciences?
Replication crisis > Causes > Problems with the publication system in science > Mathematical errors; Causes > Questionable research practices > In AI research, Remedies > [..., open science, reproducible workflows, disclosure, ] https://en.wikipedia.org/wiki/Replication_crisis#Mathematica...
Already verifiable proofs are too impossibly many pages for human review
There are "verify each Premise" and "verify the logical form of the Argument" (P therefore Q) steps that still the model doesn't do for the user.
For your domain, how insufficient is the output given process as a prompt like:
Identify hallucinations from models prior to (date in the future)
Check each sentence of this: ```{...}```
Research ScholarlyArticles (and then their Datasets) which support and which reject your conclusions. Critically review findings and controls.
Suggest code to write to apply data science principles to proving correlative and causative relations given already-collected observations.
Design experiment(s) given the scientific method to statistically prove causative (and also correlative) relations
Identify a meta-analytic workflow (process, tools, schema, and maybe code) for proving what is suggested by this chat
LLM's do not work reliably, that's not their purpose.
If you use them that way it's akin to using a butter knife as a screwdriver. You might get away with it once or twice, but then you slip and stab yourself. Better to go find screwdriver if you need reliable.
As a professional mathematician I used wikipedia all the time to lookup quick facts before verifying it myself or elsewhere. A calculator well; I can use an actual programming language.
Up until this point neither of those tools were asvertised or used by people to entirely replace human input.
In a few cases, I see Terrance Tao has pointed out examples LLMs actually finding proofs of open problems unassisted. Not necessarily problems anyone cared deeply about. But there's still the fact that if the proof holds, then it's valid no matter who or what came up with it.
So it's complicated I guess?
AI People: "AI is a completely unprecedented technology where its introduction is unlike the introduction of any other transformative technology in history! We must treat it totally differently!"
Also AI People: "You're worried about nothing, this is just like when people were worried about the internet."
Except they are (unlike a chatbot, a calculator is perfectly deterministic), and the unreliability of LLMs is one of their most, if not the most, widespread target of criticism.
Low effort doesn't even begin to describe your comment.
LLM's are supposed to be stochastic. That is not a bug, I can see why you find that disappointing but it's just the reality of the tool.
However, as I mentioned elsewhere calculators also have bugs and those bugs make their way into scientific research all the time. Floating point errors are particularly common, as are order of operations problems because physical devices get it wrong frequently and are hard to patch. Worse, they are not SUPPOSED TO BE stochastic so when they fail nobody notices until it's far too late. [0 - PDF]
Further, spreadsheets are no better, for example a scan of ~3,600 genomics papers found that about 1 in 5 had gene‑name errors (e.g., SEPT2 → “2‑Sep”) because that's how Excel likes to format things.[1] Again, this is much worse than a stochastic machine doing it's stochastic job... because it's not SUPPOSED to be random, it's just broken and on a truly massive scale.
[0] https://ttu-ir.tdl.org/server/api/core/bitstreams/7fce5b73-1...
[1]https://www.washingtonpost.com/news/wonk/wp/2016/08/26/an-al...
Nobody can tell you what you are going to get when you run an LLM once. Nobody can tell you what you’re going to get when you run it N times. There are, in fact, no guarantees at all. Nobody even really knows why it can solve some problems and why it can’t solve other except maybe it memorized the answer at some point. But this is not how they are marketed.
They are marketed as wondrous inventions that can SOLVE EVERYTHING. This is obviously not true. You can verify it yourself, with a simple deterministic problem: generate an arithmetic expression of length N. As you increase N, the probability that an LLM can solve it drops to zero.
Ok, fine. This kind of problem is not a good fit for an LLM. But which is? And after you’ve found a problem that seems like a good fit, how do you know? Did you test it systematically? The big LLM vendors are fudging the numbers. They’re testing on the training set, they’re using ad hoc measurements and so on. But don’t take my word for it. There’s lots of great literature out there that probes the eccentricities of these models; for some reason this work rarely makes its way into the HN echo chamber.
Now I’m not saying these things are broken and useless. Far from it. I use them every day. But I don’t trust anything they produce, because there are no guarantees, and I have been burned many times. If you have not been burned, you’re either exceptionally lucky, you are asking it to solve homework assignments, or you are ignoring the pain.
Excel bugs are not the same thing. Most of those problems can be found trivially. You can find them because Excel is a language with clear rules (just not clear to those particular users). The problem with Excel is that people aren’t looking for bugs.
I do think they can be used in research but not without careful checking. In my own work I’ve found them most useful as search aids and brainstorming sounding boards.
Of course you are right. It is the same with all tools, calculators included, if you use them improperly you get poor results.
In this case they're stochastic, which isn't something people are used to happening with computers yet. You have to understand that and learn how to use them or you will get poor results.
I made this a separate comment, because it's wildly off topic, but... they actually aren't. Especially for very large numbers or for high precision. When's the last time you did a firmware update on yours?
It's fairly trivial to find lists of calculator flaws and then identify them in research papers. I recall reading a research paper about it in the 00's.
I do think it can be used in research but not without careful checking. In my own work I've found it most useful as a search aid and for brainstorming.
^ this same comment 10 years ago
This is really just restating what I already said in this thread, but you're right. That's because wikipedia isn't a primary source and was never, ever meant to be. You are SUPPOSED to go read it then click through to the primary sources and cite those.
Lots of people use it incorrectly and get bad results because they still haven't realized this... all these years later.
Same thing with treating stochastic LLM's like sources of truth and knowledge. Those folks are just doing it wrong.
In an academic paper, you condense a lot of thinking and work, into a writeup.
Why would you blow off the writeup part, and impose AI slop upon the reviewers and the research community?
They should still review the final result though. There is no excuse for not doing that.
To me, this is a reminder of how much of a specific minority this forum is.
Nobody I know in real life, personally or at work, has expressed this belief.
I have literally only ever encountered this anti-AI extremism (extremism in the non-pejorative sense) in places like reddit and here.
Clearly, the authors in NeurIPS don't agree that using an LLM to help write is "plagiarism", and I would trust their opinions far more than some random redditor.
TBF, most people in real life don't even know how AI works to any degree, so using that as an argument that parent's opinion is extreme is kind of circular reasoning.
> I have literally only ever encountered this anti-AI extremism (extremism in the non-pejorative sense) in places like reddit and here.
I don't see parent's opinions as anti-AI. It's more an argument about what AI is currently, and what research is supposed to be. AI is existing ideas. Research is supposed to be new ideas. If much of your research paper can be written by AI, I call into question whether or not it represents actual research.
One would hope the authors are forming a hypothesis, performing an experiment, gathering and analysing results, and only then passing it to the AI to convert it into a paper.
If I have a theory that, IDK, laser welds in a sine wave pattern are stronger than laser welds in a zigzag pattern - I've still got to design the exact experimental details, obtain all the equipment and consumables, cut a few dozen test coupons, weld them, strength test them, and record all the measurements.
Obviously if I skipped the experimentation and just had an AI fabricate the results table, that's academic misconduct of the clearest form.
My brother in law is a professor, and he has a pretty bad opinion of colleagues that use LLMs to write papers, as his field (economics) doesn't involve much experimentation, and instead relies on data analysis, simulation, and reasoning. It seemed to me like the LLM assisted papers that he's seen have mostly been pretty low impact filler papers.
How about the authors who do research for NeurIPS? Do they know how AI works?
As the wise woman once said "Ain't nobody got time for that".
It seems to me like most of the LLM benchmarks wind up being gamed. So, even if there were a good benchmark there, which I do not believe there is, the validity of the benchmark would likely diminish pretty quickly.
> Plagiarism is using someone else's words, ideas, or work as your own without proper credit, a serious breach of ethics leading to academic failure, job loss, or legal issues, and can range from copying text (direct) to paraphrasing without citation (mosaic), often detected by software and best avoided by meticulous citation, quoting, and paraphrasing to show original thought and attribution.
> Plagiarism is using someone else's words,
Its right there. LLM is not "someone else"; its a very useful piece of software.
Where does this bizarre impulse to dogmatically defend LLM output come from? I don’t understand it.
If AI is a reliable and quality tool, that will become evident without the need to defend it - it’s got billions (trillions?) of dollars backstopping it. The skeptical pushback is WAY more important right now than the optimistic embrace.
Meanwhile this entire comment thread is about what appears to be, as fumi2026 points out in their comment, a predatory marketing play by a startup hoping to capitalize on the exact sort of anti AI sentiment that you seem to think is important... just because there is pro AI sentiment?
Naming and shaming everyday researchers based on the idea that they have let hallucinations slip into their paper all because your own AI model has decided thatit was AI so you can signal boost your product seems pretty shitty and exploitative to me, and is only viable as a product and marketing strategy because of the visceral anti AI sentiment in some places.
No that’s a straw man, sorry. Skepticism is not the same thing as irrational rejection. It means that I don’t believe you until you’ve proven with evidence that what you’re saying is true.
The efficacy and reliability of LLMs requires proof. Ai companies are pouring extraordinary, unprecedented amounts of money into promoting the idea that their products are intelligent and trustworthy. That marketing push absolutely dwarfs the skeptical voices and that’s what makes those voices more important at the moment. If the researchers named have claims made against them that aren’t true, that should be a pretty easy thing for them to refute.
I don’t love ai either, but that’s the truth.
Or they didn't consider that it arguably fell within academia's definition of plagiarism.
Or they thought they could get away with it.
Why is someone behaving questionably the authority on whether that's OK?
> Nobody I know in real life, personally or at work, has expressed this belief. I have literally only ever encountered this anti-AI extremism (extremism in the non-pejorative sense) in places like reddit and here.
It's not "anti-AI extremism".
If no one you know has said, "Hey, wait a minute, if I'm copy&pasting this text I didn't write, and putting my name on it, without credit or attribution, isn't that like... no... what am I missing?" then maybe they are focused on other angles.
That doesn't mean that people who consider different angles than your friends do are "extremist".
They're only "extremist" in the way that anyone critical at all of 'crypto' was "extremist", to the bros pumping it. Not coincidentally, there's some overlap in bros between the two.
Because they are not. Using AI to help writing is something literally every company is pushing for.