What are the limits of ChatGPT and current natural language processing?

Taking inspiration from the century old Gödel's completeness theorem to define the limits of natural language processing and peer into what might the next breakthroughs be…

Dec 23, 2022

In the last 5 years, there have been tremendous research breakthroughs in the field of natural language processing. We are starting to reach near human level accuracy in many language tasks including question answering, summarization, dialogue, translation, etc. But what are the limits of natural language processing?

Benchmark performance over time with initial and human performance are normalized to -1 and 0 respectively (Kiela et al., 2021).

Before we get to the bottom of what the limits of current AI algorithms are given the types of data and models - let's talk about something marvelous that we do as humans. Let's say I show this clock below. Now if you are like me, you've probably never seen this language before - but still, you can tell the time up to nearly 1-5 minutes accuracy.

Mongolian Clock | https://www.deviantart.com/eye-of-tengri/art/Mongolian-Clock-549818536

Pause and give yourself a pat on the back! Think about how remarkable you are - you have never read or even seen Mongolian script before, yet you have no trouble reading a Mongolian clock! But this might not work for an AI algorithm that's been trained on one specific task. A computer vision algorithm trained to specifically recognize time based on images of clocks along with time labels, might do horribly when applied to a Mongolian clock. For example, due to lack of training data - this Tesla mistook a yellow moon for a traffic light. Those of us who have never seen a yellow moon, would never mistake it for anything other than the Moon (or at least something in the sky).

Tesla Autopilot Mistakes Moon For Traffic Light | Jordan Nelson

Recent Breakthroughs in Large Language Models and Prompting

Recently, there have been major breakthroughs in NLP. Large language mode are trained on large corpuses including Wikipedia articles and books to understand languages; rather than perform specific NLP tasks right at the beginning. These models have been shown to perform better than the previous state-of-the-art task based models; when applied to such tasks downstream.

“Few-shot Learning” has take the NLP world by storm - basically the idea that large models with enough parameters can reach high accuracies when trained only on a few examples. (Going back to the idea of telling time from Mongolian clocks even though you have seen few/no such clocks!)

What is also impressive is that in the last year or so, is that these models have shown that they can “reason” through complex problems that they have never really seen before, and give correct answers.

I don't know about you - but when I see a computer program reasoning so logically, it is quite awe inspiring and even scary at times! Awe inspiring because there are so many applications in various domains like education, business, etc. and scary because this model seems to articulate reasoning better than I would!

Enter ChatGPT

Over the last 2 weeks, ChatGPT has been taking the world of natural language processing by storm. It is a large language model (GPT-3) that has been “fine-tuned” on a small amount of data to engage in dialogue as a chat bot. It is pretty amazing as it can seem to do multiple tasks like answer questions, generate code, you name it!

So what are the limits of ChatGPT? Surely it cannot answer every question in the world right? It can try to, but sometimes it goes horribly wrong:

https://www.theatlantic.com/technology/archive/2022/12/openai-chatgpt-chatbot-messages/672411/

To formalize the limitations of ChatGPT and language models in general, let's take inspiration from Gödel's Theorems.

Gödel's Completeness Theorem

Gödel's Completeness theorem states that:

The completeness theorem says that if a formula is logically valid then there is a finite deduction (a formal proof) of the formula.

How does this apply to natural language processing? Well let's ask ChatGPT the very same question:

So does this mean that future advancements in large language models and the field of NLP in general can lead to AI models that can basically answer any question? Not quite, because of Gödel's Incompleteness Theorem.

Gödel's Incompleteness Theorem

First Incompleteness Theorem: "Any consistent formal system F within which a certain amount of elementary arithmetic can be carried out is incomplete; i.e., there are statements of the language of F which can neither be proved nor disproved in F." (Raatikainen 2020)

And this is what ChatGPT says about the Incompleteness Theorem:

On the surface it seems like the statements about the Completeness and Incompleteness theorems are contradictory. In one case, the completeness theorem appears to say that in principle one can develop NLP systems that can prove all truths in a language. On the other hand, the incompleteness theorem appears to say there are limitations to what such NLP systems can do. Choose a side Gödel!

If you look closely, there is the mention of “first-order logic” in ChatGPT's discussion on the incompleteness theorem as it pertains to NLP. So what is first-order logic? Essentially for NLP, first-order logic refers to the idea of pulling basic information from non-logical objects. For example, first-order logic exercises establish the equivalence between the following expressions:

"Socrates is a man" "there exists x such that x is Socrates and x is a man"

So if I trained a question-answering model to answer “Who is Socrates?” solely based on this sentence - I should get the answer "Socrates is a man"- or something of the sort. And if I trained a model to answer “Who is x” from the second sentence, I should get “X is Socrates, who is a man,” or something of the sort.

However, if I asked non first-order logic questions like “How frequently does S appear in the English language?" I would get a biased answer based on the training set.

You can see that the first answer ("Socrates is a man") is technically correct, whereas the second answer ("S appears blah% of the time") cannot be proved or disproved. This is where the limitations of NLP models come into play. Here's ChatGPT explaining its limitations:

Future Breakthroughs

We've seen that ChatGPT is very powerful, but has it's own limitations. While data is one of the biggest limitations, even if we have a model trained on nearly all types of data - there is still a mathematical limit that is reached, as pointed out by the Incompleteness theorem.

My take on this is that basically - let's say after a decade we have a super-huge model trained on nearly all text data would perform just as well as a human would perform, if they have never seen anything else apart from reading books and articles on the Internet (no Mongolian clocks - no visual stimuli apart from reading, no ears, no sense of touch, no taste, and no smell). If you asked such a model to even tell you about traffic, it would still say something like this:

I don't think the same line of research - improving large language model size, training data, computational efficiency would lead to any vast improvements over this. Rather, spilling these models over multiple sensory tasks would yield huge benefits in my opinion. Right now the largest models have 100's of billions of parameters. If we move to 100's of trillions in the next decade or however long it takes, we could start spreading across multiple sensory tasks.

The idea of few-shot learning is extremely promising and does not apply only to language tasks. Think about the potential for having a device (let's call it DeviceX) that can do multiple intelligent tasks based on few inputs. You could say “DeviceX plan me an entire week's family vacation in a European country.” Utilizing few-shot learning by training on your individualized data could be as good as you or your spouse planning an entire family vacation. It would take into account your monetary status, kids schedules, food preferences, time preferences, etc. and do the necessary bookings.

Or if you say “DeviceX I'm sick, do the needful-” it could login to your employer HR system, register for sick leave, make hot teas through the day, tell your significant other if they are out, make a doctor's appointment, etc.

If we continue rapid progress in AI research and application and the spillover across multiple domains, the impacts could be as big as when the Internet entered society two decades ago.

If you enjoyed this post, please share on social media or even just one person you think might enjoy holistic perspectives on the interconnections between technology and modern societies. Feel free to also post any comments in the post discussions on the cyber-physical substack page. This is a small, but growing effort and I hope that I can share in my journey in understanding and building resilient societies.

Real-World AI

Discussion about this post