Am I doing this right? Three factors that make learning with and about LLMs difficult

When I started learning about large language models (LLMs), I hit wall almost immediately. I decided to follow an introductory coding tutorial to experience LLMs in action. This “first-step tutorial” focused on quickly getting code running and bypassed deep explanations of concepts for a series of examples. I figured I could simply copy-and-paste snippets and follow instructions verbatim. But the first example made me think otherwise.

The first code example prompted a Google LLM with a sports question: “Which NFL team won the Super Bowl in the 2010 season?” According to the tutorial, I should see the answer “Green Bay Packers” if I do things correctly. I copied the code snippet, initialized the script on my local machine, and I got nothing back. In fact, I waited and waited and eventually had to interrupt the Python script. Despite making no modifications to the code, the script got stuck. After some poking around, I realized that the model used in the code snippet (Google’s flan-t5-xl) was no longer accessible through HuggingFace’s Inference API. Although this took digging, the solution seemed simple enough: I swapped out the model for another Google model that could be accessed free-of-charge. I tried again and got a response: “San Francisco 49ers.”

Hmm, okay. Maybe the issue is Google…

I swapped models again, this time I opted for a Big Science model because I could access it free-of-charge and it seemed sufficiently large (1.1 billion parameters!). Although this was my justification at the time, I ultimately guessed this model would work because the documentation was, well, hard to understand. I tried again: “Buffalo Bills.” Again, I swapped models. This time I decided to try a VMware model that further fine-tuned Google’s model: “The New England Patriots won the Super Bowl in the 2010 season.” At this point, I stopped and wondered:

Am I doing this right?

In this post I will outline three factors that impact learning about LLMs through code. These factors are (1) LLMs as dependably unexpected, (2) the domain is in rapid flux, and (3) documentation is highly variable and written for the most expert audiences. Together, these factor make learning about and with LLMs feel finicky, slippery, and difficult. Although all learners are impacted by each of these factors, those seeking functional competence rather than engineering mastery are particularly impacted because these factors presume the individuals are learning code to be coders rather than to achieve some other end. I conclude by offering some advice to those writing and those using tutorials for how to approach an imperfect and shifting domain.

Three factors that make learning LLMs difficult

Large language models are complicated and complex, and so present enumerable ways that learning is difficult. I want to call out three factors that make early learning difficult that I have encountered myself:

  1. LLMs are dependably unexpected because they are non-deterministic technologies.
  2. The AI ecosystem that surrounds LLMs is very much in flux.
  3. The documentation of LLMs and related technologies presumes a degree of understanding far beyond early learners and geared toward those seeking domain expertise rather than other forms of comprehension.

1. “Dependably unexpected”

The first of these factors I have written about here, but I’ll rehash:

A simple way to understand LLMs is to think of them as very sophisticated predictive text keyboards. When presented with a prompt—whether a question, truncated sentence, or passage of text—LLMs generate words (or numbers, symbols, or whatever is appropriate) that is likely to follow the given prompt. Prompting an LLM with a question such “What is 2+2?” generates a response that is based on examples of problems of this and similar types. In this particular case, the most likely response contains “4” or “four,” and may be accompanied with pleasantries or richer explanations. By the same token and given the underlying data and other factors, the response may also be “5” or “What kind of question is that?!”. These may seem unexpected, but all responses are generated in the same manner: the LLM predicts and then responds based on likelihood that the output is appropriate for given context.

I have characterized LLMs as dependably unexpected. A more precise way to characterize LLMs is as non-deterministic technologies. Non-deterministic means that the inputs (prompts) to not correlated one-to-one to outputs (responses). Although they are related to a degree (that is, not entirely random), inputs do not necessarily determine outputs (and vice versa).

In many way, the non-deterministic quality of LLMs—and generative AI writ large—is what makes them so exciting and, depending on the context, a liability. LLMs can generate coherent and convincing responses to limitless prompts. Just as LLMs can respond to simple math questions, so too can they respond to existential prompts like “What happens at the end of the world?” or brand brainstorms like “What would be a good company name for a company that makes colorful socks?" For all these types of prompts, LLMs respond to language with language based on plentiful examples of language. A colleague of mine put this more bluntly: “It is hard to make an LLM not respond.” Many of the reports of LLMs hallucinating (or confabulating) are instances when the LLM is simply working as it is designed: when given language, LLMs respond with language that is likely. No one said the response would be accurate.

In short, LLMs are language engines that are dependably unexpected in their outputs. As such, outputs cannot be predicted to a degree, although many are working to make LLMs more predictable.

In the introductory example, the lack of predictability is a factor that resulted in the a bunch of answers I was not expecting. However, unpredictability is by no means the only factor contributing to my experience.

2. Ecosystem in flux

Although the product of decades of research and development, the contemporary generative AI ecosystem is, by and large, very new. Released as a demonstration in 2022, ChatGPT took hold of the popular imagination and became the fastest growing application in the history of the internet, reaching 100 million users in just two months. Alongside the growth of OpenAI’s users was (and continues to be) a deluge of products, startups, and demos that make use of its technologies and services. The success and performance of OpenAI accelerated the ecosystem: Google scrambled to productize its models; the investment and ambitions of companies like HuggingFace, Anthropic, and Langchain swelled; associated concerns spread into labor strikes, court cases, and open letters. Needless to say, a lot has been happening.

Part of this wave are two mutually arising activities: (1) the development of new and extended technologies, and so increased demand for key infrastructures; (2) the development of new and extended business models, and so monetizing the supply of key infrastructures. Take, for example, HuggingFace.

Prior to Dec 2020, HuggingFace offered two tiers of access ($199/month and $599/month) geared to different types of organizations exploring AI, namely, research labs and enterprises. These plans reflected that generative AI was concentrated in academic and industry settings due to specialized knowledge and resources. In Dec 2020, HuggingFace created new offerings focused on individuals working on AI (free and $9/month) and maintained their organizational offerings. This shift to individuals contributing to a community reflected that new participants needed access. HuggingFace continued to change. As demand increased, HuggingFace created pay-as-you-go models for labs and custom enterprise pricing (”call for a quote”). In 2023, HuggingFace began offering team accounts ($20 per user per month) to seemingly capture organizations scaling their AI competency but also wanting increased security and control.

Underlying these pricing changes were also service changes. In 2022, HuggingFace shifted organizations and higher parameter models from their Inference API (in operation since 2020) to its managed solution called Inference Endpoints or Pro Accounts. In particular, this managed solution gave organizations the ability to have dedicated access to specific (and private) models that could be used as a testbed as well as production-level infrastructure. In doing so, models previously accessible through the free Inference API were moved to paid plans.

When my introductory example timed out, it was due to the fact that the tutorial was now out-of-date given HuggingFace’s current offerings.

3. Documentation for experts rather than learners

The final factor that makes learning with and about LLMs difficult is that documentation presumes a level of knowledge that is beyond early learners. Take, for example, model cards. Model cards are a form of documentation used to explain the creation, use, and limitations of generative AI models. Model cards are supposed to help individuals pick between models. However, model cards—when they exist at all—tend to be sparse and conceptually dense.

Here are the four model cards I mentioned in the introduction:

Despite being different models, the two Google models have the exact same content on their model cards, all of which is excerpted from the technical paper that explains the model in detail. The model card begins with a diagrammatic depictions of the types of prompts/tasks and their potential outputs. This diagram is lifted from the original paper. The text that follows includes technical descriptions of the model(s) itself, all of which is summarized or excerpted from the original paper. Additionally, the model card includes code snippets for implementing the model on CPUs, GPUs, and other technical hardware (not exactly early learner stuff). To me—and I would say to any early learner who is a non-developer—these explanations are not particularly helpful when it comes to making a decision to pick one model or another.

The information in this model card is emblematic of other model cards. For example, the section titled “Direct Use and Downstream Use” reads as follows:

The authors write in the original paper’s model card that:

The primary use is research on language models, including: research on zero-shot NLP tasks and in-context few-shot learning NLP tasks, such as reasoning, and question answering; advancing fairness and safety research, and understanding limitations of current large language models

See the research paper for further details.

For someone just starting out, terms like “zero-shot NLP” or “in-context few-shot learning NLP tasks” do not hold much meaning. Although these terms describe categories of tasks within the domain, understanding these tasks may be far beyond those just starting to prompt models. Even more, the linked research paper is less accessible and rather intimidating in its writing and format.

Other model cards are similar. Big Science’s BLOOMZ model points readers to dense and technical webpages after providing a high-level summary of the model using similar terms as above. The VMWare’s Flan T5 Alpaca is even less fleshed out: it effectively only shows how you use the model on various types of hardware without any real explanation. This model card links back to Google’s FLAN T5 model card to fill in what is missing.

In the simplest terms, these cards are not designed for early learners. They expect a great deal of domain knowledge and subject matter confidence to use them effectively. Although it is important to have this type of expert-focused documentation, I would argue that these model cards are often so sparse or dense as to not be particularly effective for those most suited to use them. But more to the point of this post, no documentation exists to address those still learning and needing to make decisions about which models to use.

Making learning with and about LLMs easier

So what can improve learning with and about LLMs? Here is a quick set of closing thoughts about what writers and learns can do today.

Writers

  • Explain your choices: Picking the right model is important. The few explanations I have seen (for example) have emphasized performance above all else, and eschewed explanations of trade offs or other considerations. Those writing tutorials should explain their choices, even if those explanations is not particularly sophisticated. Setting expectations that not all models work the same teaches that LLMs are tools rather than intelligences.
  • Pick tasks that make sense: The prompt in the introduction is a bad learning prompt because it is a task that has no tolerance for variation. Those writing tutorials should be aware of the issues of asking LLMs about facts, and generally pick tasks that illustrate the appropriate points.
  • Set an expiration date: If you are writing tutorial, assume your tutorial will break for one of the factors listed here or other reasons altogether. As such, either update your tutorial periodically, or more simply have a disclaimer at the beginning. With regards to the latter, stating that a tutorial should be considered tenuous at best beyond a certain date gives the reader a heads up about the quality of the content at the present moment. Given how quickly things change, the shelf life of a tutorial is likely pretty short.

Learners

  • Pick tutorials from reputable sources: Try to find tutorials from people or on platforms that are interested in the teaching rather than warming a lead (that is, selling a product). Check to see what else the author has written and whether something newer exists.
  • Start with a task you actually want to accomplish: Many first-step tutorials begin with examples that illustrate functionality rather than a real task. If you own a sock company, I doubt you’ll use an LLM to come up with the name for your company. Approach examples with the filter of a task—even a small one—that you actually need to accomplish. As such, your learning will be geared toward that goal rather than debugging someone’s fictional example.
  • Read about LLMs concepts first: Conceptual overviews are useful to get your bearing on terms and expectations. Although these can be dense, they provide you with something to ground your tutorials in, and may make model cards a little more understandable.
Thomas Lodato @deptofthomas