An AI generated fairy tale - the road to the spam hell.

autor: Piotr Czapla

Since yesterday we have a deep learning model that is able to generate fairy tales that are consistent and almost completely logical. You could easily read such stories to your kids and this is terrifying news.

I bet you heard the stories of fake scientific papers generated by the AI, that were rubbish but they were written using good English but without a logical train of thoughts, but the text sounded good and convincing. This kind of text was easy to generate, but this wasn’t any threat as it was easy to spot if the text wasn’t using fancy terms as it was at times illogical and inherent.

If you haven’t seen that before the here is an example generated by a language model trained in 2018 that we trained on ~15% of English Wikipedia:

John Wood , Jr . , president of The National Association for the Re - Creation of the World ( NAVY ) and World War ii Victory Director, from which he also made a successful article. During the "Red Scare", he stated that "You shall be to America's free West Side", and that "[t]he French , Russian, , Indian, Canadian, Australian, Japanese, German, and Italian alike must be taken seriously. " Daniel Craig said of Smith, " The Americans are a great deal for a country of our own, and are very proud of us …..

You can see that it is readable English, but it lacks any coherence so it is easy to spot that it is fake. Here is another example that is a bit more tricky as it was trained on medical texts:

…of bilateral leg veins that was ruled out by inline) it. had at thistime obtained w/o sign of carotid stenosis which represents anterior ventricular serosanguinous values, no obvious source of infection seen. Cxr suggestive of normal superior months.disp:*60 toxins. 2057 and active vs, pulses normal, no focal deficits. per team thought to be due to bloodculture, that is not consistent with an ischemic event and considering slow outpatient monitoring was felt to be secondary to her recent 4:42 (right sided thigh hypotonic). iv access was ….

This text was generated by [fastai students  Binal Patel.]

Given the number of complex words you might be fooled that this diagnosis is written by a real person.

But what OpenAI presented two days ago is in a completely different league even though it uses very similar concepts to the models above. Their model called GPT-2 is using Deep learning, the difference is the amount of data they trained the models on (8 mln websites) and the complexity of their architecture. They’ve used very heavy models called Transformers. But have a look, yourself. Here is what the model received as the input that it should use to a news article from (the text was written by a human):

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

and here is the story generated by the GPT-2, some of my thoughts below:

The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.

Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.

Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.

Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez.

Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to move too much to see them — they were so close they could touch their horns.

While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English. Pérez stated, “We can see, for example, that they have a common ‘language,’ something like a dialect or dialectic.”

Dr. Pérez believes that the unicorns may have originated in Argentina, where the animals were believed to be descendants of a lost race of people who lived there before the arrival of humans in those parts of South America.

While their origins are still unclear, some believe that perhaps the creatures were created when a human and a unicorn met each other in a time before human civilization. According to Pérez, “In South America, such incidents seem to be quite common.”

However, Pérez also pointed out that it is likely that the only way of knowing for sure if unicorns are indeed the descendants of a lost alien race is through DNA. “But they seem to be able to communicate in English quite well, which I believe is a sign of evolution, or at least a change in social organization,” said the scientist.

Have you noticed how the model is consistent in using the way he talks about the discovery of Dr. Jorge Perez. How he started with a full name and later just used the last name, or how it knew that the Unicorns have Horns, and how it imagined the interbreeding between humans and unicorns in the past so that they speak English. It got lost a bit the train of thoughts, but I haven’t noticed when I read the text for the first time.

This blew my mind and terrified me at the same time. Working with this kind of models I obviously knew that a day is coming when we will have models that are able to generate news with coherent thoughts. But I wasn’t expecting to see it so soon.

Fortunately, Open AI being an NGO founded by Elon Musk is taking it job of mitigating the risk of AGI very seriously, and they haven’t published the weights of the full model and knowing how damage it could create. And if we consider how much such models cost to train (the GPT-1 was ~17k USD and GPT-2 is just 10x larger trained on 10x more data) we won't see spammers using them any time soon.

But I’m still worried, in what kind of world we are going to live in.

The good news is that the text generation is only one of many ways you can use this kind of models. You can use them to train question answering models, to classify text and/or to infer information out of unstructured data. Hopefully, we can use them to check the sources of the news articles.

If you want to know more about less terrifying and more useful ways of using such models feel free to get in touch with us. We are working adopting such breakthroughs to business task on all European Languages.

The article was based on this OpenAI blog post