Large Language Models in AI: A Guide for Beginners

A Brief History of Natural Language Processing
What are Language Models?
Large Language Models
Natural Language as a User Interface: From GPT-3 to InstructGPT
Use Cases: How to Use LLMs
Open Source Models vs Closed Models
3 Limitations of LLMs

Large Language Models (LLMs) are the cool new kid on the block. Ever since ChatGPT took the world by storm in November 2022, you cannot even walk down the street without overhearing mentions of it – some positive and some negative. There is no question that we are living in a changed world. However, in order to better understand the LLMs of today, it is important to understand where they came from.

In this guide, we’ll trace the evolution of LLMs to before their modern ancestors, looking at what they are, what they are not, and some of how they can be used. LLMs, especially the instruction-tuned variety, are impressive tools that many people thought were still decades away, but they are also not the end of the story. As long as you understand what they are, how to use them properly, and keep an eye on the output, they can be great tools for making some tasks that were once nearly impossible become almost trivial. (If you're ready to go further with LLMs, check out the Data Science Bootcamp with Machine Learning at NYC Data Science Academy!)

A Brief History of Natural Language Processing

There is a long history of trying to understand language mechanically. Noam Chomsky's work on universal grammars (which led to tools such as regular expressions, pushdown automata, and others) allow us to detect the presence of structure in text beyond simple searching for exact matches to some query. This work forms the backbone of parsers, which, among other things, allows for programming languages like C, Python, and others that are more suited for people to work in to be translated into machine code for execution. In general, this research allows for understanding the structure of text.

Beyond purely structural methods, natural language processing (NLP) generally seeks to allow computers to detect and understand the semantics of language. For example:

Part-of-speech tagging to extract the structure of sentences;
Named-entity recognition which aims to detect proper nouns not necessarily seen before;
Text classification such as sentiment analysis so that we can know whether a statement has a positive or negative feeling;
Clustering and embedding to measure the similarity between words, chatbots, and much more.

These methods are focused on autonomous understanding of language.

What are Language Models?

Language models are a subset of NLP that aims to learn a probabilistic model of how language works. These were originally relatively simple statistical models based on n-grams, which assume that the probability of the next word is based purely on the preceding words. This is effectively an nth-order Markov chain — As long as you have at least "n" words, you can make an educated guess as to what the next word will be based on the statistics in a text corpus used for training the model.

Even with such a crude approach it is possible for simple language models to be useful. For example, the autocomplete on your phone has traditionally been powered by n-grams models. Also spell checking can use these models to flag low probability words.

Autoregressive vs Masked Language Models

For language models, there are basically two types: autoregressive and masked.

Autoregression basically means using the input to predict one step – which can be a character, word, or otherwise – into the future, then adding that prediction back to the original input and doing it again until you decide to stop. GPT-like models and the sentence suggestions in Google Suite are examples of autoregressive models.
Masked language models on the other hand take some input, "mask off" (replace with some generic value) a portion of it, and use the model to predict what goes in the masked portion. This task aims to gain a better understanding of words and how they relate to each other in a context, regardless of whether they are before or after the mask. One of the most famous examples of masked language models being BERT and its variants. Grammar suggestions that appear after you have written a full sentence can be powered by this approach.

The appeal of autoregressive language models is that they can make more text given a short input. This however means that the imperfections in the choices of the next token (prediction errors) accumulate over time. The more tokens the model has produced, the more they impact future predictions, potentially causing the output to become incoherent, or worse, incorrect (referred to as hallucination).

Masked language models on the other hand are typically used for tasks such as making text searchable by meaning, called semantic search, due to their ability to encode information about long passages that have a notion of the similarity between their content.

Large Language Models

The modern language models we have all become familiar with are all made with artificial neural networks, due to breakthroughs in deep learning that allow for training large models on large amounts of data, which translates to improved quality and accuracy. However, the larger the model gets, the more data required to train it. Most traditional approaches relied purely on supervised learning directly on the task in question. For example, if you want to perform machine translation from English to French, you would need a dataset consisting of pairs of the same sentence in both languages, and of sufficient quantity to allow the model to generalize well. How many examples are sufficient increases with the size of the model.

The scarcity of labeled data is not limited to natural language problems. Many domains have been working on ways to train larger models on smaller supervised datasets by various means. In computer vision, the primary method is image augmentation, where multiple variations of a single image are created using a set of transformations that are simple but preserve the overall integrity of the image, such as slight rotations, randomly cropping from various locations, subtly adjusting the color, and similar.

With language these same augmentations make no sense. However, a plentiful source of unlabeled data is any text you can find, for example the internet. What can you do with unlabeled text? Language modeling. Although the task of language modeling may not be what we ultimately are trying to solve, the expectation would be that a model with sufficient capacity that is trained on enough general language should learn enough about the underlying characteristics of language to allow it to more quickly adapt to more specific tasks later, and with less labeled data.

The go-to neural network for NLP had up until then been the recurrent neural network (RNN), which worked well but had one big problem: it was slow to train. While there is nothing theoretically wrong with this, it does take a significant amount of time and means that it is difficult to fully take advantage of accelerators such as GPUs.

A Brief History of LLMs

In 2017, Transformer architecture was introduced in the "Attention Is All You Need" paper, enabling models to transform inputs into outputs by learning to "pay attention" to specific inputs. In 2018, Google released BERT and OpenAI introduced GPT, both leveraging Transformers for language understanding. GPT-1 was trained on fiction books. Scaling up data, compute, and model size was found to improve performance, leading to the release of GPT-2 in 2019. GPT-3, released in 2020, had 175 billion parameters and demonstrated few-shot learning, allowing it to perform tasks with minimal examples. This era also saw the emergence of "prompting" to interact with models effectively.

The name that everyone at this point has likely heard of, ChatGPT, came out in 2022. Suddenly there was a model that you could ask to perform various tasks using natural language. What is notable about this is that it is really still GPT-3 but with some modifications to make the now ubiquitous chat style interaction work. Shortly after in 2023, OpenAI released GPT-4, which is generally a more capable version of ChatGPT. Not much is known about the exact scale of GPT-4, as OpenAI has decided to be more protective of their position competitively than being open about their research.

Natural Language as a User Interface: From GPT-3 to InstructGPT

If most people needed to name one thing that makes using language models like ChatGPT useful, they would probably talk about the ability to converse with them using natural language. At first this seems like something that a language model should be able to do fairly well, but it turns out that, even with sufficient scale, pure language modeling is not sufficient for that behavior to occur on its own. The language modeling task is great at uncovering the structure of language, but not all examples of language are conversational.

Language models like GPT-3 are intended for prompt completion (take a look at OpenAI's documentation). As the name implies, the model will take whatever prompt you provide it and add to it, as if completing a partial text. Considering that this is the exact task these models were trained to do, that makes perfect sense. Although there are many who successfully used GPT-3 for various tasks, it required figuring out how to construct a prompt that would cause the model to do what you want in a purely "complete the sentence" fashion, which is not the most natural thing to do and sometimes is extremely difficult or impossible.
Between GPT-3 and ChatGPT, OpenAI was working on a family of variants called InstructGPT. This research aimed to explore how to make large language models respond to instructions rather than only extending the prompt. There is a great video by Andrej Karpathy explaining the various mechanisms used to do this: supervised fine tuning, and reinforcement learning with human feedback (RLHF). Supervised fine tuning is just taking a pre-trained LLM and fine tuning it on a dataset of instruction/response pairs. RLHF is basically the idea that you can ask people to rate how good responses are, train a separate model to predict those ratings, then use that model to fine tune the LLM to give better responses.

Use Cases: How to Use LLMs

Now that we have AI that can respond to natural language questions and requests, what can we do with it? Most people have heard of if not used ChatGPT, Bing Chat, Google Bard, Claude, or any of the other chat assistant-style uses of LLMs out there. Whether you are interacting with them from a user interface or through an API, there are many tasks you can use these models for, and various ways to augment them.

Any instruction-tuned LLM should be good at anything that the original model was, which includes things like in-context learning. This means that tasks like summarizing a document or answering questions based on some information you provide work well. Asking about anything that existed in the training data should also be reasonable, although there are limitations and caveats.

2 Ways to Fine-Tune Your LLM

What happens when you have questions about a very large document that the model was not trained on? Currently, all LLMs have a maximum context length, which defines the total number of tokens they can process, and is the sum of the input and output. There are a couple methods for dealing with this:

One approach is to fine tune the model on your data. If you have the ability to do so, and your data does not change very frequently, this can be a great option. Not exactly a fine tune, but in early 2023 Bloomberg announced BloombergGPT, which there is still very little known about. Some speculate that it will be a natural language interface to the Bloomberg Query Language (BQL). Another example is AWS Code Whisperer, which is a code generation LLM specifically trained on AWS code, making it ideal for those working in that ecosystem. Since neither BQL nor most of the AWS source development kit code changes extremely frequently (some would argue with me regarding the latter I'm sure), fine tuning a model makes sense and allows input prompts to be shorter as you do not need to provide examples.
When fine tuning is not an option because the data you need to interact with changes frequently or you do not have enough of it to fine tune, you can lean on in-context learning and put anything you need into the prompt, along with whatever question or task. Due to the context limit however, you often cannot use some documents simply because they are too long. For this, you can use something called Retrieval Augmented Generation (RAG), which effectively means looking up relevant information and only putting that in the prompt. This can be done through traditional keyword search, or through semantic search, which matches things based on their meaning rather than exact content. RAG effectively extends a language model with the ability to lookup information from a database, which can be helpful when questions can be answered by summarizing multiple passages of a long text for example.

👩‍💻 NYC Data Science Academy students learn about and use LLMs during the bootcamp! NYC Data Science Academy students have utilized their knowledge of LLM and founded a company called intrinsica.ai. Find more exciting user cases from these students!

Open Source Models vs Closed Models

Both Open Source and Closed Models require capable hardware for the model size, and clustering may be necessary for extremely large models. This table provides a comprehensive comparison of Open Source Models and Closed Models, considering cost, performance, transparency, and control over updates.

When developing a LLM, there are a number of practical things to consider:

	Open Source Models	Closed Models
Cost and Accessibility	Some open source models and research are accessible to the broader community, promoting transparency and collaboration.	Closed models and research may be kept private to gain a competitive edge, limiting accessibility.
Performance	Open source models are catching up in performance but may not surpass some closed models like GPT-4.	Closed models are currently the best-performing options available.
Transparency and Data Privacy	Open source models provide full access to model details, training data, and deployment options, enhancing transparency and addressing data privacy concerns.	Closed models offer limited information about their training data and data usage, raising data privacy concerns.
Control and Updates	Users can deploy and maintain open source models themselves, ensuring control and the ability to address potential issues.	Parent companies of closed models can make changes in offerings and terms without user input, potentially causing disruptions. Closed models routinely update the underlying model, potentially affecting existing applications without prior notice.

3 Limitations of LLMs

LLMs have gotten a ton of attention and hype, but it is important to understand what they are not good at. One of the main reasons why LLM adoption has been so staggering has been their generality. The fact that you can interact with them in your own language, ask them questions across almost any domain, and request that the response be provided in a particular format makes them very easy and practical to use. If the only tool you have is an LLM, then every problem looks like prompt engineering. They can answer questions directly, be used to translate natural language into structured data, extract and synthesize information, all with the same intuitive interface.

But LLMs are not perfect — even Yann LeCun, the creator of convolutional neural networks, has criticized autoregressive language models! Here are 3 notable drawbacks of LLMs:

Hallucinations as in producing plausible but inaccurate output. This is a direct result of the way these models are trained: by choosing the next token based on high likelihood options. If for whatever reason the model happens to get into a state where the next chosen token pushes the output into a strange direction, the model will continue to follow up with the most likely continuation. While there is no known way to completely prevent this, there are ways to sometimes reduce the chances of it happening. For example if you are using RAG, since you are likely asking questions about passages you put directly in the prompt, the model will pay closer attention to the prompt than other things, thus reducing the chance of hallucination.
LLMs are by themselves horrible at math. This has to do partly with how limited examples of math are in the training data, which is mostly the open internet. If you look around you will probably find multiplication tables, addition in double digits, and other elementary school resources, but since these models produce their output digit by digit in the case of numbers the result is again based on maximum likelihood rather than an intrinsic understanding of math. This has been fixed in some cases by extending the LLM with something that is good at math, such as a calculator or Wolfram Mathematica.
LLMs generally only know what they were trained on. For ChatGPT the "knowledge cutoff" was December 2021, as that is the most recent data they gathered before training it. This means that the model would not be able to answer questions about anything that happened after that time without being retrained, or by having access to more up to date information. Again, techniques like RAG, or even having access to search the internet can get around this limitation.

Are LLMs dangerous?

All advancements come with their share of fears and criticisms, and LLMs are no exception. These generally come in three flavors. Asking "will it kill us?", "what did you train on?" and "what do you do with my data?"

Earlier this year, the "Godfather of AI" Geoffrey Hinton left Google due to his concerns about the dangers of models such as LLMs as they become increasingly powerful. Much like LeCun, when the Godfather of AI says something, it is worth listening to, but I would caution against immediately jumping to Terminator, The Matrix, or other SciFi dramatizations. AI Safety has been a discipline for much longer than LLMs or even Deep Learning. You also have companies like OpenAI, Anthropic, and others putting up "guardrails" to make sure their LLMs behave "in an appropriate manner." Why a company that will not disclose any specific details about how their model works or was trained should be in charge of deciding what the "right" way to use it should be is a different discussion.

Where and how training data is gathered is another hot topic. Universal Music sued Anthropic over Claude being able to produce the lyrics to songs without rights (although you can find them using Google just as easily). Reddit dramatically increased their API prices to prevent LLMs makers from profiting off of their data for free. Twitter (or X now I guess) did the same. Pretty much everywhere you turn there is someone trying to recoup a piece of the pie that products like ChatGPT have created off the backs of their data (that was likely acquired illegally).

For users who are working with sensitive information, an important question is "what do you do with the prompts I send?" OpenAI says they only retain chats for use as your history, and any that you delete will be removed within 30 days after doing so. However, many people in fields such as medical, legal, and financial are hesitant to send anything too private to ChatGPT because we really do not know for sure that they do not train on it. They say they do not, but you never really know.