Enigma: GPT-2 trained on 10K Nature Papers: Can you spot the difference? (opens in new tab)

(stefanzukin.com)

183 pointsMrUssek4y ago105 comments

105 comments

The side-by-side display makes it pretty easy to distinguish the one from the other, simply compare them at a level where the one that makes the least sense is the one that is nonsense. Like that I score 10/11. But when looking at just the left side one suddenly the problem is much harder, and I'm happy to get better than even. Bits that don't help: not an English native writer. Seen too many real life papers with crappy writing that quite a few of these look plausible, especially when they are about fields that I know very little of.

Presumably when you're a native English speaker and have a broader interest the difficulty goes down a bit.

I like this project very much and would like to see some overall scores, and it might not hurt to allow for a verified result link to detect bragging rather than actual results (not that anybody on HN would ever brag about their score ;) ).

Overall: I'm not worried that generated papers will swamp the publications any day soon but for spam/click farms this must be a godsend and for sure it will cause trouble for search engines to classify real content from generated content.

userbinator4y ago

and for sure it will cause trouble for search engines to classify real content from generated content.

Fortunately, SEO spam is currently nowhere near as coherent as this, and often features some phrases that are a dead giveaway ("Are you looking for X? You've come to the right place!" or a strangely-thesaurised version thereof), but I am also worried about this new generation of manufactured deception.

zitterbewegung4y ago

I tried making fake tweets using GPT-2 two years ago. When I actually interviewed people to verify my model I got good results for people who didn't actively engage with twitter versus people who regularly engaged in twitter (note that this was an N ~ 10 people and it was limited to GPT-2 774M.

I found that people would also refuse the test and would believe whatever the output of the model was due to my choice of subject.

Others that did a similar exercise and tried to verify their results using reddit had a great deal of people who would be able to spot fakes quite easily.

The biggest issue would be someone using a system to deliberately fool a targeted set of people which is easy given how ad networks are run.

lainga4y ago

Having read your comment first (ooh, horribile dictu on HN) I decided to try playing by only looking at the left paper and deciding if it was fake. Luckily the model seems to have picked up that "last names can be units" too strongly and the 2nd fake paper was discussing a frequency of "10 Jones".

jacquesm4y ago

Good one :) What's your score across 10 in sequence just looking at one? I get about 7 out of 10 that way if I repeat it a number of times.

jvanderbot4y ago

Quite easy when you know one is fake. Flagging fake articles in a review queue, by abstract only, and when none may exist all the way up to all being fake ....

Now that's a challenge.

Also, if you train GPT on the whole corpus of Nature / Science / whatever articles up to, say, 2005, could you feed it leading text about discoveries after 2005 and see if it hypothesizes the justification for those discoveries in the same way that the authors did?

pjc504y ago

I find the whole thing ominous because there is no "there" there: there is no understanding in the GPT-2 system, but it's able to generate increasingly plausible text. This greatly increases the amount of plausible nonsense that can be used to drown out actual research. You could certainly replace a lot of pop-sci and start several political movements with GPT-2... all of which has no actual nutritional content.

jvanderbot4y ago

I find it intriguing for exactly that reason. I agree there's no fundamental "there", but I suggest that you may find that ominous because it implies there's no fundamental understanding anywhere. Only stories that survive scrutiny.

GPT can write "about" something from a prompt. This is not much different than me interpreting data that I'm analyzing. I'm constantly generating stories and checking them, until one story survives it all. How do I generate stories!? Seriously. I'm sure I have a GPT module in my left frontal cortex. I use it all the time when I think about actions I take, and it's what I try to ignore when I meditate. Its ongoing narrative is what feeds back into how I feel about things, which affects how I interact with things and what things I interact with ... not necessarily as a goal-driven decision process, more as a feedback-driven randomized selection. Isn't this kind of the basis of Cognitive Behavioral Therapy, meditation, etc. See [1,2]. If you stick GPT and sentiment analysis into a room, will they produce a rumination feedback like a depressed person?

Anyway, if you can tell a coherent story to justify a result (once presented with a result), one that is convincing enough for people to believe and internalize the result in their future studies, how is that different from understanding that result and teaching it to others? The act of teaching is itself story generation. Mental models are just story-driven hacks that allow people to generalize results in an insanely complex system.

1. Happiness Hypothesis Jonathan Haidt 2. Buddhism and modern psychology, coursera

jrumbut4y ago

You could probably ask an undergraduate or pop science fan to write a paper title and first sentence of an abstract that would turn heads and get good results.

Faking an entire 10 page paper with figures and citations is much harder. I'm sure it'll happen next week, but until then I can still say that's where real understanding is demonstrated.

sdenton44y ago

"there is no understanding in the GPT-2 system, but it's able to generate increasingly plausible text."

So, basically it's achieved undergraduate level skills.

sellyme4y ago

> there is no understanding in the GPT-2 system, but it's able to generate increasingly plausible text

Under some definition of "understanding". GPT understands how to link words and concepts in a broadly correct manner. As long as the training data is valid, it's very plausible that it could connect some concepts that it genuinely and correctly understands are compatible, but doing so in a way which humans had not considered.

It can't do research or verify truth, but I've seen several examples of it coming up with an idea that as far as I can tell had never been explored, but made perfect sense. It understood that those concepts fit together because it saw a chain connecting them through god-knows how many input texts, yet a human wouldn't have ever thought of it. That's still valuable.

As to how far that understanding can be developed... I'm not sure. It's hard to believe that algorithmically generated text would ever be able to somehow ensure that it produces true text, but then again ten years ago I would have scoffed at the idea of it getting as far as it already has.

cwkoss4y ago

Makes you wonder if humans also have less "there" than we give ourselves credit for. How much of sentence construction is just repeating familiar tropes in barely-novel ways?

To what extent have our brains already decided what to say while we still perceive ourselves as 'thinking about the wording'?

dnautics4y ago

No, the text is generally very implausible if you know anything about the science. For example, describing dna twists with protein folding descriptors, mixing up quantum computing with astronomy, inorganic chemistry with biochemistry, virology with bacteriology...

I was really impressed with gpt-2 but seeing this really gave me a feel for how much of a lack of understanding it has.

eMPee5844y ago

A striking argument, at this point in time. Quite likey this weakness will be overcome soon, when deep learning becomes integrated with [KR²]-spectrum methods...

[KR²] https://en.wikipedia.org/wiki/Knowledge_representation_and_r...

fock4y ago

this is already happening. maybe not in Nature, but all over the "small" papers of this world (and so also with any Nature derivatives). There's people milling the same article with nonsense "results" through 3 journals each in Elsevier, Springer, Wiley and RCS. And they stay up. Forever. And anyone looking into this topic will waste an hour because these 10x cited articles are just self-citations of the same crap.

lmm4y ago

What definition of "understanding" are you using? Why are you confident that the existing authors of pop-sci and political movements have it?

dilippkumar4y ago

Good point.

This challenge would be more interesting if there were "Neither is fake" and "Both are fake" buttons (and obviously, the test randomly showed two fake and two real articles in the mix)

brokencode4y ago

Absolutely. Especially once you factor in the fact that sometimes things are simply written poorly or don’t make sense, even if it was a real human being at the keyboard.

A lot of the communication I have with folks is subtly flawed in logic or grammar, but that doesn’t make me think I’m working with a bunch of androids.

It’s natural and often even necessary to try to figure out an author’s intent when their writing doesn’t fully make sense.

NorwegianDude4y ago

Considering the terrible quality of the writing in these examples, it's simple no matter how it's presented.

jcims4y ago

Excellent point. Serializing them would make it more difficult.

userbinator4y ago

Some of the fake ones are hilarious:

The chicken genome (the genome of a chicken that is the subject of much chicken-related activity) is now compared to its chicken chicken-to-pecking age: from a genome sequence of chicken egg, only approximately 70% of the chicken genome sequences match the chicken egg genome, which suggests that the chicken may have beenancreatic.

ronsor4y ago

Even hard mode isn't that hard because GPT-2 tends to ramble on while saying nothing substantive. If I can't figure out what a paper is supposed to be talking about, it's fake.

4/4 on hard. Never read a Nature paper before.

argvargc4y ago

Did way better on Hard mode than Easy. I think people get bored of doing this before we can see real indicative results.

Scores under 5 on what amounts to a coin flip doesn't strike me as so remarkable, especially when coupled with an incentivised reporting-bias as we see here. ("I got a high score! Proud to share!" Vs. "I got a low score, or an even score and look at all the people reporting high scores, think I might keep it to myself")

Being as it is, at this juncture, I think the AI may still have a chance to be strong with this one.

Also, were the AI to do well consistently, I'd think it might say more about the external unfamiliarity with, and the internal prevalence of, field-specific scientific jargon, than any AI's or human's innate intelligence.

meowface4y ago

(Easy/cliched joke, but)

>If I can't figure out what a paper is supposed to be talking about, it's fake.

Depends on the field...

thereddaikon4y ago

The engineering/materials science/physics ones were fairly easy to identify for me. Usually it would be one or two sentences that were grammatically cohesive but would make a statement that didn't make any sense if you had even a basic understanding of the topic. One that stood out to me was an astrophysics paper that said a planet was orbiting solar wind. I don't have to be a PhD to know that's BS.

The medical and biotech ones are much harder.

1 more reply

iamgopal4y ago

Same for me, hard mode is quite easy, our brain is pattern matching Engine and constantly try to make sense of this and hence it may look like legitimate, it wouldn’t if text were 2D.

anigbrowl4y ago

N=4

drenvuk4y ago

The hard version usually requires me to understand why a number, measurement or chemical or other substance doesn't make sense in the context of what each paragraph is describing. This means I can't just skim it in order to spot the fake, I need to figure out that what it's saying is wrong.

That's close enough for this to be a success if the purpose was to persuade or fool laymen.

lkbm4y ago

This was a hard-mode fake I just got:

> A new era of hyperaridididididemia revealed by single-cell RNA-seq

So some are better than others. :-)

neltnerb4y ago

I also got one that was a fake that was talking about measuring two actual quantum properties simultaneously by using a third state to probe it indirectly.

Which is absolutely a real thing except that the exact quantum properties in fact didn't commute while they claimed they did commute and said for some reason simultaneous measurement required a third state anyway.

I don't know how I would have been able to distinguish that from completely reasonable methods for quantum error correction without knowing ahead of time which quantum states commute and which don't... pretty cool.

If I were skimming or half asleep I definitely wouldn't have caught a lot of these on hard, abstracts are always so poorly written and usually trying too hard to be complicated sounding by using big words when small ones would do just fine!

quantum_mcts4y ago

I always wanted similar thing but for some philosophy texts. Notably Hegel - I'd love to see a philosopher trying to figure out which pile of gibberish is generated and which is the work of a father of modern dialectics.

carbocation4y ago

Easy mode is cake.

Hard mode is good enough that I'd like to see some sort of distance metric to the nearest real story, to be sure the model isn't accidentally copying truth.

jmgao4y ago

> Hard mode is good enough that I'd like to see some sort of distance metric to the nearest real story, to be sure the model isn't accidentally copying truth.

Yeah, I ran into at least one example which basically regurgitated a real paper. The "fake" article was:

    Efficient organic light-emitting diodes from delayed fluorescence

    A class of metal-free organic electroluminescent molecules is designed in which both singlet and triplet excitons contribute to light emission, leading to an intrinsic fluorescence efficiency greater than 90 per cent and an external electroluminescence efficiency comparable to that achieved in high-efficiency phosphorescence-based organic light-emitting diodes.

and the real one at https://www.nature.com/articles/nature11687 has:

    Highly efficient organic light-emitting diodes from delayed fluorescence

    Here we report a class of metal-free organic electroluminescent molecules in which the energy gap between the singlet and triplet excited states is minimized by design4, thereby promoting highly efficient spin up-conversion from non-radiative triplet states to radiative singlet states while maintaining high radiative decay rates, of more than 106 decays per second. In other words, these molecules harness both singlet and triplet excitons for light emission through fluorescence decay channels, leading to an intrinsic fluorescence efficiency in excess of 90 per cent and a very high external electroluminescence efficiency, of more than 19 per cent, which is comparable to that achieved in high-efficiency phosphorescence-based OLEDs

TuringTest4y ago

That's something I've been suspecting for a while. More powerful models produce overfitting, so it is likely that GPT-2 is simply memorizing whole texts and then regurgitating them in whole.

Knowing how they're generated, a sequence of sentences that make sense are likely copied almost verbatim from an article written by a human. Without understanding the concepts, the algorithm may simply repeat words that go well together - and what goes together better than sentences that were written together in the first place?

What the GPT model is really good at is at is identifying when a sentence makes sense in the current context. Given that it has half of the internet as its learning corpus, it is easy that it's simply returning a piece of text that we do not know about. The real achievement thus is finding ideas that are actually appropriate in relation to their input text.

kurthr4y ago

Yes, I got a really short astronomical one about the discovery of a metallic core planet circling a G-Type star, and I only knew it was fake, because I would have heard about it!

jandrese4y ago

Sometimes the articles are really short which makes it even harder to figure out which is fake. GPT's big weakness is that it tends to forget what it was talking about and wanders off after a couple of paragraphs. With just one sentence to examine it can be very hard to spot.

beforeolives4y ago

Cool demo.

With these GPT models, I don't get the appeal of creating fake text that at best can pass as real to someone who doesn't understand the topic and context. What's the use case? Generating more believable spam for social media? Anything else? Because there's no real knowledge representation or information extraction going on here.

drusepth4y ago

I use GPT-2/3 in creative writing to generate rough text that I then go back and edit/improve because it gives a good starting point (and often has a lot of high-quality factors).

I don't know if this translates to technical writing, but it's possible someone might complete a prompt on some specific topic(s) and then use that as a point to start from, especially if they're knowledgeable enough on the topic to correct the output. It's nice to be able to skip a lot of boilerplate words (how many words in this comment are actually the meat of this idea, and how many words are just there to tie all those morsels together?)

bulldog134y ago

What platform do you use ? Is it all run from home ?

drusepth4y ago

I built my own editor on top of the API, which I ran at home, yeah. You can see the code and/or run it yourself at https://github.com/indentlabs/gpt-3-writer, but it's very rough and not really ready for public use. Just make sure your OpenAI secret key is set in ENV['OPENAI_SK'] and you should be able to run it yourself.

I also built a more polished version to add to the Notebook.ai document editor (so writers can get some continuation prompts whenever they get a bit of writer's block), but the pricing made it unfeasible to actually release. Notebook.ai is also open source though, and you can see the GPT-3 functionality in the unmerged PR here: https://github.com/indentlabs/notebook/pull/739

parksy4y ago

Mainly at present I see it as a demonstration of progress in the capabilities of the models. A machine is writing coherent syntax and grammar in close to real-time, maintaining a semblance of context while it does so, without requiring excruciatingly complex and detailed semantic algorithms.

One use-case is as a creative writing assistant. Typical ways to beat writer's block are to do something else (read, walk, talk, dream). At some point, either consciously or subconsciously, the hope is that these activities will elicit inspiration in the form of an experience or idea that will connect with the central vision of the author's work, allowing the writing to continue. So too it could be with prompts generated from these models, just another way of prompting and filtering ideas from the sensory soup of reality.

(While the "dark" version of reality has bedroom "writers" pooping out entire forests of auto-generated pulp upon ever-jaded readers, being able to instantly mashup the entire literary works of mankind into small contextual prompts would be another tool in the belt for more measured and experienced authors.)

There are other more practical use-cases for these models. There's work being done on auto-generating working or near-working software components from human language descriptions. Personally I'd love to just write functional tests and let the "AI" keep at it until all the tests pass. So seeing the models improve over time is a sign that this may not be an impossible feat.

From a less utilitarian perspective, I'd love computer assistants to have a bit of "personality". "Hey Jeeves, tell me the story about the druid who tapdanced on the moon." and just let the word salad play out in the background. Yeah it's a toy, but it would jazz up the place a bit, add a bit of sass even.

I do think we're a ways off the first computer generated science paper being successfully peer reviewed and contributing something new to human understanding of nature, I'd have scoffed at the idea ten years ago but now I'm sure it's only a matter of time.

CamperBob24y ago

Here's an example I just ran into today. This is post #17 in a thread on a car forum where people are discussing whether or not to run the A/C periodically in the winter:

https://www.planet-9.com/threads/air-conditioning-compressor...

Screenshot including the spam post, in case it's removed: https://imgur.com/GgCkp1r

This is going to suck.

minimaxir4y ago

For fun.

aabhay4y ago

Fun? Show me the fun!

1 more reply

codeflo4y ago

5/5 on hard node, but it’s tough sometimes, I don’t actually know much about biology. But if you’ve played around with GPT before, you get better at spotting the subtle logical errors it tends to make. I wonder whether the ability to identify machine generated texts will become a useful skill at some point.

minkzilla4y ago

Or you just train a machine to do it and then generate a bunch and have this second machine sort out any it thinks are machine generated.

truth_4y ago

The best fake-detecting model detecting fakes generated by the best generator model will always lag behind the latter model.

minkzilla4y ago

I think I see what you’re saying, but why is this so?

1 more reply

toxik4y ago

You basically just described a GAN. Neat!

minimaxir4y ago

GANs work by feeding back the mistakes and forcing the generator model to improve its cheating. In this case, filtering out titles that are ambiguous would act as an independent filter.

1 more reply

ta9884y ago

I'm sure GPT2 abstracts would fly through many conferences screening processes. I've seen talks and posters that were utter non-sense but everybody was too polite to say anything to the person or advisors.

I've reviewed articles that were completely made up and the other reviewer didnt even detect that. Nor did the editor.

I've contacted editors about utterly wrong papers, criticized the article on pubpeer, and the article is still published... Because it would harm their notoriety. Thats one of the madenning ascpects of academic publishing.

matthewdgreen4y ago

There is a long tail of weak journals in just about any field. When you think about it, this is inevitable in any society that has freedom of the press and where there exist incentives (evaluated by non-experts) for publishing. You have to evaluate journals the same way that you would evaluate products purchased in a flea market.

ta9884y ago

I'm talking about top of the field journals unfortunately.

not2b4y ago

Abstracts are relatively short, so for the length of an abstract GPT-2 might just be fusing together the abstracts of two or three related papers, so the result might look legit. It tends to wander around when the length is increased, though, and if asked to go on for long enough it will lose the plot.

CSDude4y ago

How does one train GPT-2 with their own content and produce nice results at arbitrary lengths? I found a few libraries but I could not use them well, I get lost very quickly. I just want to train our internal Confluence and have fun with it.

bulldog134y ago

I am interested as well. I can't seem to find a decent end-to-end article on this.

supermatt4y ago

Even on hard, if you understand the terminology, the fake ones are mostly gibberish.

oceliker4y ago

If you don't understand the terminology, any paper is gibberish :) but I agree, I can detect fake ones fairly reliably in biology, but not in e.g. astronomy.

febed4y ago

You don't even really need to know the terminology to detect the fakes.

aidenn04y ago

Right, I could pick out all the hard ones for fields I'm comfortable with, but struggled for some of the easy ones in fields I was less familiar with.

hazeii4y ago

On easy, 7 correct and 0 wrong was enough to tell me that yes, I can (I have been reading Nature for years though).

caslon4y ago

On easy, ten correct and zero wrong was enough for me to realize the same (I have never read Nature).

Trying hard right now, will report results after.

Edit: Yeah, it's the same deal. Length becomes less of a giveaway but its errors become more obvious.

zwaps4y ago

Same here for hard mode.

Still interesting!

make34y ago

Even with GPT-3, where this "game" would be much harder, this would be kind of a weak demo, because the human reader doesn't understand what the text means in either case (most of the time), taking much if not all away the interesting part of whether the generated text makes any sense or not.

We have known for a while that language models can generate superficially good looking text. The real question is whether they can get to actually understand what is being said. As humans don't understand either, the exercise sadly moot.

anon_tor_123454y ago

STEM people love to bring up the sokal affair. the same STEM people also don't realize that many journals and conferences in STEM have been tricked by things like this (more specifically precursors using HMMs and etc).

https://en.wikipedia.org/wiki/List_of_scholarly_publishing_s...

edit: don't understand why i'm getting downvoted. is my comment not relevant to a post about the plausibility of abstracts generated by ML models?

Mordisquitos4y ago

The reason for the downvotes is probably the generalisation regarding what "STEM people love to bring up" and also "don't realize". It feels like an unprovoked strawman attack against an ambiguously defined group of people.

dnautics4y ago

10/10 on easy and 10/10 on hard. Hard selections seem mostly hard because they are short enough you don't see gpt-2 to go off the rails with something completely nonsensical.

Only one was convincing enough to be truly challenging, I got it right because the mechanism proposed was fishy, 1) I had domain expertise, and 2) the date of the paper made no sense relative to when that sort of a discovery would be made (2009 is too early)

jackcviers34y ago

6 and 2 on hard mode. The failure of the model to connect ideas in long paragraphs (or to make a succinct claim) is what gives it away. It introduces far too many terms with far too little repetition and far too much specificity in such a short span.

Suggested tweak - train it against papers written by people with an Erdos number < 3 (or Feynman contributors, etc.), so that the topics and fake topics are more closely related in style and content. Maybe even feed it some of their professional letters as well. That would produce some very hard to decipher fakes.

Another great corpus for complex writing is public law books. Have it compare real laws from the training set with fake laws. I bet it would be very difficult to figure out the fake laws.

Training one of these on an entire corpus of one author (Roger Ebert, Justice Ginsberg, Joyce, anyone with a large enough body of work), and having people spot the fake paragraphs from the real ones would be very, very difficult. An entire text, however, would likely be discernible.

It is getting really, really close to being able to fool any layman, though. Impressive work!

dougb54y ago

The generated abstracts may be gibberish but I wonder how often they contain little bits of brilliance, or make novel connections between ideas expressed in the training set. If we got a panel of domain experts to evaluate the snippets on this basis, thrir labels could be used to fine-tune the model in the direction of novel discovery. (This is almost certainly not a novel idea!)

th0ma54y ago

Now here is an application where the GPT stuff can really shine, which is trying to convince people that aren't domain experts that something is speaking from authority, even if the reader doesn't intend to get anything meaningful from the material either way.

anigbrowl4y ago

18/20 on hard mode. Sherter ones are more difficult, longer ones tend to have dangling clauses or circular claims. I suspect GPT-3 could produce convincing complete abstracts. But this was good enough that I don't feel bad about the two I missed.

karagenit4y ago

Seems like the model likes to repeat words in the title, particularly when hyphens are involved (I guess it considers them as different words?) e.g. "new dinosaur-like dinosaur" and "male-pattern traits in male rats" are a couple I saw.

minimaxir4y ago

That's mostly a GPT-2/Transformers quirk. Some approaches apply a repetition penalty to work around it.

varispeed4y ago

If we feed AI all the knowledge about the physics of the world, then will it be ever capable of giving answers without actually performing scientific research inferring it just from the laws that define the world?

Mordisquitos4y ago

I very much doubt it, at least with regards to GPT-n style models. In this particular example, it is not actually being fed any knowledge about the physics of the world. Rather, it is being fed the texts of a very specific subculture (that of scientific research and publishing) which is based not only on the prior sensory experiences of human beings, but also following the arbitrary agreements and expectations of the members of the scientific community that have developed over time. Even the most intelligent human minds would be unable to learn anything meaningful from scientific papers if they were expected to read them from scratch having been brought up completely isolated and without any prior knowledge.

On the other hand, an interesting possibility with well-designed text-mining and AI models would be for them to generate valid hypotheses that hadn't been contemplated earlier, based on the massive corpus of scientific publications. The model may be able to find possible correlations or interesting ideas by combining sources from different fields that would normally be ignored by the over-specialised research community. However, in that case the model wouldn't be valuable for providing answers—rather, it's value would be in providing questions.

finin4y ago

We've done recent work on using a transformer to generate fake cyber threat intelligence (CTI) and found that a set of cybersecurity experts could not reliably distinguish the fake CTI examples from real ones.

Priyanka Ranade, Aritran Piplai, Sudip Mittal, Anupam Joshi, and Tim Finin, Generating Fake Cyber Threat Intelligence Using Transformer-Based Models, Int. Joint Conf. on Neural Networks, IEEE, 2021. https://ebiq.org/p/969

f6v4y ago

The sad thing is that often there's an equal mental effort to read GPT articles and the real ones. It's as if people are trying to make their papers as incomprehensible as possible.

FredPret4y ago

Incomprehensible language = look how smart I am now give me grant money

mattkrause4y ago

Nah--If you're publishing in Nature, you're already well beyond that game.

The incomprehensibility comes from the fact that abstracts (and particularly NPG abstracts) are trying to do many things at once--and all in 200 words. In theory, the abstract should describe why your work is of broad general interest (so Nature's editors will publish it), while explaining the specific scientific question and answer(!) to a specialist audience of often-picky, sometimes-hostile peer reviewers, and conforming to a fairly specific style that doesn't reference the rest of the paper.

It's tough to do well, and even moreso for non-native English speakers.

kangalioo4y ago

At least in the samples I was presented, the more comprehensible articles were consistently the fake ones.

jpindar4y ago

Pretty easy, even in hard mode, and not due to any knowledge of the subject matter. I'm 15 - 0 so far.

I kept seeing certain types of grammatical error, such as constructs like "... and foo, despite foo, so..." or "with foo, but not foo..." where foo is the exact same word or phase appearing twice in a sentence.

I also kept seeing sentences with two clauses that should have agreed in number or tense but did not.

jpindar4y ago

"The structure of the HIV capsid is analysed by cryo-electron microscopy and cryo-electron microscopy at cryo-electron-microscopy resolution."

It really does like to repeat itself.

userbinator4y ago

This one made me laugh really really hard:

"This study presents the phylogenetic characterization of the beak and beak of beak whales; it is suggested that the beak and beak-toed beaks share common cranial bones, providing support for the idea that beaks are a new species of eutriconodont mammal."

1 more reply

writeslowly4y ago

I found it relatively easy to spot the fakes, but the titles on some of them were pretty good and made me wish they were real. Like reading science journals from a whimsical fantasy universe.

Some of my favorites: "A new onset of primeval black magic in magic-ring crystals"

"The genetic network for moderate religiosity in one thousand bespectacled twins"

"Thermal vestige of the '70s and '00s disco ball trend"

ivirshup4y ago

I initially hadn't realized these were meant to be abstracts (as the site doesn't say this). Knowing this makes hard mode much easier.

I'd been having trouble with ones which had a reasonable logical flow, but didn't communicate a complete idea.

Of course, pretty small N so YMMV

cblconfederate4y ago

It seems some of the fake ones could easily have been real (e.g. the one about the 3d structure of bound Ach receptor) . I guess the brevity of the text helps to make it make sense and to make it indistinguishable

NorwegianDude4y ago

Not hard at all. The fake ones doesn't make any sense from an English perspective. Looks like someone just picked the next word on a SwiftKey keyboard or something. "this word fits here... Right?"

hutzlibu4y ago

Nice advanced logic riddles.

blt4y ago

I wish there was a version of this for computer science. We don't have a broad flagship journal like Nature, so maybe it would need to be trained on a collection of IEEE and ACM venues.

gentleman114y ago

The major tell of these systems is writing something coherent over a span larger than a few paragraphs, so this is less impressive than it would have been 5 years ago. Still, well done

make34y ago

I'm surprised by how broken the english of GPT-2 is. A lot of sentences are just broken.

I would be curious to try again with GPT-3.

riquito4y ago

With discretion, php/bootstrap/jquery still do their job for the presentation layer

generalizations4y ago

I'm curious if the trained model is available. It would be very fun to play with.

arthurofcharn4y ago

Could we feed gpt-2 Turbo Encabulator? I want more Turbo Encabulator.

drusepth4y ago

This quick babble from 7 years ago [1] might scratch your itch while you train your own GPT-2 model. :)

[1] http://drusepth.com/series/how-to-speed-up-your-computer-usi...

thereddaikon4y ago

GPT-2 is Open Source. Nothing stopping you from training it on techno babble. GPT-3 is closed source.

et2o4y ago

I just got 10/10. This is not particularly difficult yet.

otabdeveloper44y ago

SaaS

Sokal-as-a-Service

f4304y ago

Progressively got tougher. I'm scared of the implications in like 20 years.

j / k navigate · click thread line to collapse

105 comments

jacquesm4y ago

Presumably when you're a native English speaker and have a broader interest the difficulty goes down a bit.

userbinator4y ago

and for sure it will cause trouble for search engines to classify real content from generated content.

zitterbewegung4y ago

I found that people would also refuse the test and would believe whatever the output of the model was due to my choice of subject.

Others that did a similar exercise and tried to verify their results using reddit had a great deal of people who would be able to spot fakes quite easily.

The biggest issue would be someone using a system to deliberately fool a targeted set of people which is easy given how ad networks are run.

lainga4y ago

jacquesm4y ago

Good one :) What's your score across 10 in sequence just looking at one? I get about 7 out of 10 that way if I repeat it a number of times.

jvanderbot4y ago

Quite easy when you know one is fake. Flagging fake articles in a review queue, by abstract only, and when none may exist all the way up to all being fake ....

Now that's a challenge.

pjc504y ago

jvanderbot4y ago

1. Happiness Hypothesis Jonathan Haidt 2. Buddhism and modern psychology, coursera

jrumbut4y ago

You could probably ask an undergraduate or pop science fan to write a paper title and first sentence of an abstract that would turn heads and get good results.

Faking an entire 10 page paper with figures and citations is much harder. I'm sure it'll happen next week, but until then I can still say that's where real understanding is demonstrated.

sdenton44y ago

"there is no understanding in the GPT-2 system, but it's able to generate increasingly plausible text."

So, basically it's achieved undergraduate level skills.

sellyme4y ago

> there is no understanding in the GPT-2 system, but it's able to generate increasingly plausible text

cwkoss4y ago

Makes you wonder if humans also have less "there" than we give ourselves credit for. How much of sentence construction is just repeating familiar tropes in barely-novel ways?

To what extent have our brains already decided what to say while we still perceive ourselves as 'thinking about the wording'?

dnautics4y ago

I was really impressed with gpt-2 but seeing this really gave me a feel for how much of a lack of understanding it has.

eMPee5844y ago

A striking argument, at this point in time. Quite likey this weakness will be overcome soon, when deep learning becomes integrated with [KR²]-spectrum methods...

[KR²] https://en.wikipedia.org/wiki/Knowledge_representation_and_r...

fock4y ago

lmm4y ago

What definition of "understanding" are you using? Why are you confident that the existing authors of pop-sci and political movements have it?

dilippkumar4y ago

Good point.

This challenge would be more interesting if there were "Neither is fake" and "Both are fake" buttons (and obviously, the test randomly showed two fake and two real articles in the mix)

brokencode4y ago

Absolutely. Especially once you factor in the fact that sometimes things are simply written poorly or don’t make sense, even if it was a real human being at the keyboard.

A lot of the communication I have with folks is subtly flawed in logic or grammar, but that doesn’t make me think I’m working with a bunch of androids.

It’s natural and often even necessary to try to figure out an author’s intent when their writing doesn’t fully make sense.

NorwegianDude4y ago

Considering the terrible quality of the writing in these examples, it's simple no matter how it's presented.

jcims4y ago

Excellent point. Serializing them would make it more difficult.

userbinator4y ago

Some of the fake ones are hilarious:

ronsor4y ago

Even hard mode isn't that hard because GPT-2 tends to ramble on while saying nothing substantive. If I can't figure out what a paper is supposed to be talking about, it's fake.

4/4 on hard. Never read a Nature paper before.

argvargc4y ago

Did way better on Hard mode than Easy. I think people get bored of doing this before we can see real indicative results.

Being as it is, at this juncture, I think the AI may still have a chance to be strong with this one.

meowface4y ago

(Easy/cliched joke, but)

>If I can't figure out what a paper is supposed to be talking about, it's fake.

Depends on the field...

thereddaikon4y ago

The medical and biotech ones are much harder.

1 more reply

iamgopal4y ago

Same for me, hard mode is quite easy, our brain is pattern matching Engine and constantly try to make sense of this and hence it may look like legitimate, it wouldn’t if text were 2D.

anigbrowl4y ago

N=4

drenvuk4y ago

That's close enough for this to be a success if the purpose was to persuade or fool laymen.

lkbm4y ago

This was a hard-mode fake I just got:

> A new era of hyperaridididididemia revealed by single-cell RNA-seq

So some are better than others. :-)

neltnerb4y ago

I also got one that was a fake that was talking about measuring two actual quantum properties simultaneously by using a third state to probe it indirectly.

quantum_mcts4y ago

carbocation4y ago

Easy mode is cake.

Hard mode is good enough that I'd like to see some sort of distance metric to the nearest real story, to be sure the model isn't accidentally copying truth.

jmgao4y ago

> Hard mode is good enough that I'd like to see some sort of distance metric to the nearest real story, to be sure the model isn't accidentally copying truth.

Yeah, I ran into at least one example which basically regurgitated a real paper. The "fake" article was:

    Efficient organic light-emitting diodes from delayed fluorescence

    A class of metal-free organic electroluminescent molecules is designed in which both singlet and triplet excitons contribute to light emission, leading to an intrinsic fluorescence efficiency greater than 90 per cent and an external electroluminescence efficiency comparable to that achieved in high-efficiency phosphorescence-based organic light-emitting diodes.

and the real one at https://www.nature.com/articles/nature11687 has:

    Highly efficient organic light-emitting diodes from delayed fluorescence

    Here we report a class of metal-free organic electroluminescent molecules in which the energy gap between the singlet and triplet excited states is minimized by design4, thereby promoting highly efficient spin up-conversion from non-radiative triplet states to radiative singlet states while maintaining high radiative decay rates, of more than 106 decays per second. In other words, these molecules harness both singlet and triplet excitons for light emission through fluorescence decay channels, leading to an intrinsic fluorescence efficiency in excess of 90 per cent and a very high external electroluminescence efficiency, of more than 19 per cent, which is comparable to that achieved in high-efficiency phosphorescence-based OLEDs

TuringTest4y ago

That's something I've been suspecting for a while. More powerful models produce overfitting, so it is likely that GPT-2 is simply memorizing whole texts and then regurgitating them in whole.

kurthr4y ago

Yes, I got a really short astronomical one about the discovery of a metallic core planet circling a G-Type star, and I only knew it was fake, because I would have heard about it!

jandrese4y ago

beforeolives4y ago

Cool demo.

drusepth4y ago

I use GPT-2/3 in creative writing to generate rough text that I then go back and edit/improve because it gives a good starting point (and often has a lot of high-quality factors).

bulldog134y ago

What platform do you use ? Is it all run from home ?

drusepth4y ago

parksy4y ago

CamperBob24y ago

Here's an example I just ran into today. This is post #17 in a thread on a car forum where people are discussing whether or not to run the A/C periodically in the winter:

https://www.planet-9.com/threads/air-conditioning-compressor...

Screenshot including the spam post, in case it's removed: https://imgur.com/GgCkp1r

This is going to suck.

minimaxir4y ago

For fun.

aabhay4y ago

Fun? Show me the fun!

1 more reply

codeflo4y ago

minkzilla4y ago

Or you just train a machine to do it and then generate a bunch and have this second machine sort out any it thinks are machine generated.

truth_4y ago

The best fake-detecting model detecting fakes generated by the best generator model will always lag behind the latter model.

minkzilla4y ago

I think I see what you’re saying, but why is this so?

1 more reply

toxik4y ago

You basically just described a GAN. Neat!

minimaxir4y ago

GANs work by feeding back the mistakes and forcing the generator model to improve its cheating. In this case, filtering out titles that are ambiguous would act as an independent filter.

1 more reply

ta9884y ago

I've reviewed articles that were completely made up and the other reviewer didnt even detect that. Nor did the editor.

matthewdgreen4y ago

ta9884y ago

I'm talking about top of the field journals unfortunately.

not2b4y ago

CSDude4y ago

bulldog134y ago

I am interested as well. I can't seem to find a decent end-to-end article on this.

supermatt4y ago

Even on hard, if you understand the terminology, the fake ones are mostly gibberish.

oceliker4y ago

If you don't understand the terminology, any paper is gibberish :) but I agree, I can detect fake ones fairly reliably in biology, but not in e.g. astronomy.

febed4y ago

You don't even really need to know the terminology to detect the fakes.

aidenn04y ago

Right, I could pick out all the hard ones for fields I'm comfortable with, but struggled for some of the easy ones in fields I was less familiar with.

hazeii4y ago

On easy, 7 correct and 0 wrong was enough to tell me that yes, I can (I have been reading Nature for years though).

caslon4y ago

On easy, ten correct and zero wrong was enough for me to realize the same (I have never read Nature).

Trying hard right now, will report results after.

Edit: Yeah, it's the same deal. Length becomes less of a giveaway but its errors become more obvious.

zwaps4y ago

Same here for hard mode.

Still interesting!

make34y ago

anon_tor_123454y ago

https://en.wikipedia.org/wiki/List_of_scholarly_publishing_s...

edit: don't understand why i'm getting downvoted. is my comment not relevant to a post about the plausibility of abstracts generated by ML models?

Mordisquitos4y ago

dnautics4y ago

10/10 on easy and 10/10 on hard. Hard selections seem mostly hard because they are short enough you don't see gpt-2 to go off the rails with something completely nonsensical.

jackcviers34y ago

Another great corpus for complex writing is public law books. Have it compare real laws from the training set with fake laws. I bet it would be very difficult to figure out the fake laws.

It is getting really, really close to being able to fool any layman, though. Impressive work!

dougb54y ago

th0ma54y ago

anigbrowl4y ago

karagenit4y ago

minimaxir4y ago

That's mostly a GPT-2/Transformers quirk. Some approaches apply a repetition penalty to work around it.

varispeed4y ago

Mordisquitos4y ago

finin4y ago

f6v4y ago

The sad thing is that often there's an equal mental effort to read GPT articles and the real ones. It's as if people are trying to make their papers as incomprehensible as possible.

FredPret4y ago

Incomprehensible language = look how smart I am now give me grant money

mattkrause4y ago

Nah--If you're publishing in Nature, you're already well beyond that game.

It's tough to do well, and even moreso for non-native English speakers.

kangalioo4y ago

At least in the samples I was presented, the more comprehensible articles were consistently the fake ones.

jpindar4y ago

Pretty easy, even in hard mode, and not due to any knowledge of the subject matter. I'm 15 - 0 so far.

I also kept seeing sentences with two clauses that should have agreed in number or tense but did not.

jpindar4y ago

"The structure of the HIV capsid is analysed by cryo-electron microscopy and cryo-electron microscopy at cryo-electron-microscopy resolution."

It really does like to repeat itself.

userbinator4y ago

This one made me laugh really really hard:

1 more reply

writeslowly4y ago

I found it relatively easy to spot the fakes, but the titles on some of them were pretty good and made me wish they were real. Like reading science journals from a whimsical fantasy universe.

Some of my favorites: "A new onset of primeval black magic in magic-ring crystals"

"The genetic network for moderate religiosity in one thousand bespectacled twins"

"Thermal vestige of the '70s and '00s disco ball trend"

ivirshup4y ago

I initially hadn't realized these were meant to be abstracts (as the site doesn't say this). Knowing this makes hard mode much easier.

I'd been having trouble with ones which had a reasonable logical flow, but didn't communicate a complete idea.

Of course, pretty small N so YMMV

cblconfederate4y ago

NorwegianDude4y ago

Not hard at all. The fake ones doesn't make any sense from an English perspective. Looks like someone just picked the next word on a SwiftKey keyboard or something. "this word fits here... Right?"

hutzlibu4y ago

Nice advanced logic riddles.

blt4y ago

I wish there was a version of this for computer science. We don't have a broad flagship journal like Nature, so maybe it would need to be trained on a collection of IEEE and ACM venues.

gentleman114y ago

The major tell of these systems is writing something coherent over a span larger than a few paragraphs, so this is less impressive than it would have been 5 years ago. Still, well done

make34y ago

I'm surprised by how broken the english of GPT-2 is. A lot of sentences are just broken.

I would be curious to try again with GPT-3.

riquito4y ago

With discretion, php/bootstrap/jquery still do their job for the presentation layer

generalizations4y ago

I'm curious if the trained model is available. It would be very fun to play with.

arthurofcharn4y ago

Could we feed gpt-2 Turbo Encabulator? I want more Turbo Encabulator.

drusepth4y ago

This quick babble from 7 years ago [1] might scratch your itch while you train your own GPT-2 model. :)

[1] http://drusepth.com/series/how-to-speed-up-your-computer-usi...

thereddaikon4y ago

GPT-2 is Open Source. Nothing stopping you from training it on techno babble. GPT-3 is closed source.

et2o4y ago

I just got 10/10. This is not particularly difficult yet.

otabdeveloper44y ago

SaaS

Sokal-as-a-Service

f4304y ago

Progressively got tougher. I'm scared of the implications in like 20 years.

j / k navigate · click thread line to collapse