Dept of Thomas

My colleague Mike Heavers made a nice overview of llamafile. llamafile allows you to run a #chatbot on your own computer with a single file. If you aren’t a developer, Mike’s video makes the set up v easy to understand.

Watch the overview here: https://www.youtube.com/watch?v=GjP7y3AiFWc

→ 1:44 PM, Dec 7

My colleague Stephen Hood and MIECO fellow Justine Tunney released llamafile this week. “llamafile lets you turn large language model (LLM) weights into executables” so that you can run a model on various operating systems without installing anything. Big win for giving access to models!

Blogpost explaining llamafile here: https://github.com/Mozilla-Ocho/llamafile

Check out the repo here: https://github.com/Mozilla-Ocho/llamafile

→ 10:50 AM, Dec 1

A lot of the reporting about Sam Altman’s return to OpenAI buries the fact that both of the women on the board were replace by men. The only board member who remains from the previous batch is a man. Assuming Brockman and Sutskever are still on it, here is what the board looks like now.

→ 10:43 AM, Nov 22

After writing a post on how LLMs answer math questions (tl;dr: answering math questions is not the same as doing math), I found this post by Gary Marcus on the topic. The chart he created of expected results from a calculator cracked me up (screenshot here).

→ 3:37 PM, Nov 21

From Nadia Eghbal’s “Working in Public: the making and maintenance of open source software”: of the four types of open source projects, Eghbal notes that stadiums (high user growth, low contributor growth) are “increasingly commonplace” despite how many narrativize OSS.

→ 10:17 AM, Nov 18

A colleague shared a project to run an LLM on iOS. The project sounds rad; the name is not (and maybe NSFW). In any case, I decided to do some work for the community with a prompt.

→ 10:20 AM, Nov 15

I decided to explore Replit’s AI coding feature and in the process made a thing:

Do I need AI for that? is a tool to determine if you need AI for that.

(Write up of coding with AI forthcoming.)

→ 3:52 PM, Nov 13

Simon Wardley (part of the Mozilla Internet Ecosystem (MIECO) program) shared some thoughts on open source AI models. Building on the idea that training data are source code for foundation models, he argues that open source models exist only when those data (not just weights) are open to inspection.

→ 7:52 PM, Nov 10

Notes on multimodal models, AI for infrastructure, fine tuning, and soft prompting
I have been taking a lot of notes on readings and other sources, and am planning on posting them more frequently. Here are some notes from the backlog.

—

From Matt Rickard:

Quick explanation of CLIP

CLIP is text-image neural net that combines a word-tokens and pixel (pixel-tokens?) in a single vector space. Interesting because it can do zero-shot learning on text descriptions of images. CLIP underlies stable diffusion and a 2021 OpenAI model.

Why is this interesting? Multimodal models are already here. This means that text-to-text may only be a type of input and output rather than the constraint of a model itself. OpenAI is certainly pushing in this direction with GPT-4 Turbo, as is Meta with ImageBind and others such as Llava. This makes me wonder how we might be able to take existing models and combine them rather than simply retraining new models. Not being an engineer, I could be suggesting something that is very harebrained.

AI tech stack half-life and AI writing infrastructure

These suggest that AI will do interop by dealing with fuzzy mappings of different infrastructures that rule-based deployments can’t handle due to the wide variety of possibilities. As a non-dev, I am intrigued by this because the constraint on UX and features due to infrastructure (or just tech debt) may be more easily overcome. I also am skeptical of anything claims that things of this sort will be easier because DevOps can be really complicated.

From Sebastian Raschka:

Two posts on fine-tuning (overview and PEFT) (in combination with this light post by IBM on prompt tuning). Lots in here, but my takeaways focus on prompt tuning (TIL) and fine-tuning (still learning):
- Soft prompt tuning is a strategy for optimizing when you can’t access the model weights (e.g. when fine tuning is not possible). Not sure how this works in practice (may try to run through a tutorial sometime and write about it). Might be something here to help folks do this with less work (again I haven’t done it so maybe unrealistic)
- Fine-tuning does increase performance of models, but the increases in performance don’t scale with investment in time/energy. A lot can be done with methods of fine tuning that only change the weights on the final layers of a model or add additional layers for classification. In terms of the decision to fine tune: need to decide what threshold is acceptable given the context (I have lots of questions about what performance means in these contexts tbh)
According to Raschka, one version of fine-tuning takes a pre-trained model and adjusts the weights by using a new dataset and refining (fine-tuning) the model for a new task. This type of fine tuning requires a lot of data and a lot of compute. Other approaches are far less intensive.

Raschka describes fine-tuning some of the layers of a model but not all of them. These include methods like fine-tuning only a few final layers of the model (what Raschka refers to as fine-tuning I) or parameter-efficient fine-tuning (PEFT, such as soft prompting, prefix tuning, and LORA) that use other techniques to modify the model without adjusting the weights (keeping the pretrained model frozen in full or for the most part). Some of these approaches are:
- Soft prompting attaches embeddings to the model to make it more efficient at a prompt (albeit not being human readable). (read about it here)
- Prefix tuning attaches embeddings and a “trainable tensor” to improve performance over soft-prompting. (read about it here)
- LORA adds a new set of weights on top of a pretrained model (read about it here)
→ 12:06 PM, Nov 10

[Field Report] Silicon Valley Generative AI Paper Reading Event: RWKV
Preamble: learning about AI through events

One of my approaches to learning about the industry, technology, practices, and culture of artificial intelligence is to attend various events on the topic. I am just starting out and so far have attended just three events. One event was on how to prompt ChatGPT that seemed to draw people using ChatGPT for marketing. Another event was a discussion hosted by the IxDA Atlanta where designers of various sorts talk about their use of, hopes for, and concerns about generative AI. Most recently, I went to an event where authors of a machine learning presented their research (described below). As I have found in other works, events are especially useful to find an entry point to expansive domains and, when done repeatedly, surface differences amongst groups or types of attendees. Given that AI is a massive and rapidly changing domain, attending a smattering of events is one way I am getting a handle on things.

I am finding events by searching Meetup, and doing so results in a lot of different events, these three being emblematic of what I have seen. I would say there are two camps thus far. Many—maybe most—events focus on building with the technology and introduce concepts, techniques, and tools useful for those in software development roles. Other events focus on understanding and using AI in different roles—marketing, design, product and project management, finance, and so forth. Although I am particularly interested in this second type of event—namely, events for non-builders¹—I am planning on attending a mix of events to understand the differences between these two broad camps, especially with regards to the type and distribution of information and understanding.

As I attend more events, I plan to share what I experience—sometimes just thoughts; other times more substantive thinking. I’ll be using the title “Field Report” to distinguish these posts from other posts.

—

Field Report

Event: Generative AI Paper Reading - RWKV (Presented by RWKV Team)

Host: Silicon Valley Generative AI

Additional resources
- Paper: https://arxiv.org/abs/2305.13048
- Website: https://www.rwkv.com/
- Hugging Face Blog: https://huggingface.co/blog/rwkv
- Github repo: https://github.com/BlinkDL/RWKV-LM
On 11/6/2023, I attended an event hosted by Silicon Valley Generative AI. The group does a series where individuals present their papers related to the AI field. This month, the presentation focused on research on rearchitecting a specific kind of neural network called a recurrent neural network or RNN.

This was my first time attending the series and will likely not be my last. Based on just this event, the expected audience is pretty technical. This may be expected given the paper itself, but also was made explicit: one presenter glossed over explaining what RNNs were or the architecture of transformer models by remarking he assumed the audience was sufficiently familiar with the field. I found the discussion hard to parse not being a machine learning or AI expert. However, I had just enough understanding of the topic to follow the main points at a high-level, though I had nowhere near enough understanding to have any questions. After the event, I did my fair share of reading Wikipedia articles to clarify terms that the presenters used.

Despite being challenging to follow, I found the RWKV event very useful, especially as a point of comparison to the events for non-builders I have attended. In those latter events, terms like prompt were defined upfront and most people conflated generative AI writ large with OpenAI’s product ChatGPT. In other words, the gulf between builders and non-builders is very wide in terms of what they are discussing. From the events I have attended and conversations I have had, non-builders audiences have been thinking about how these made-systems impact their lives and jobs in both positive and negative ways. At the RWKV event, the focus was on how these systems become made. This is not to say non-builders are unsophisticated—in fact, these audiences are simply not talking about the same things. At the forefront of non-builder events were real concerns about day-to-day impacts and ethical concerns, which were altogether missing from the RWKV event (honestly, I am not sure where these kinds of concerns would fit, which is my point). The RWKV event did show how fast ideas are emerging and how focused builders are on AI as a technical problem—GPU saturation, memory loss, and scaling—that leaves little room for the sociopolitical world of AI.

On to the topic itself.

The paper presented focused on a revised architecture to recurrent neural networks (RNNs), a precursor to transformers. Transformers underlie the wave of models currently driving advancements and productization of AI. RNNs have largely been left by the wayside because they lack the ability to scale effectively due to a bottleneck in their architecture (Wikipedia tells me that this bottleneck results in what is called “the vanishing gradient problem”).

As I understand it, RNNs “read” sentences in order, one word followed by the next. Transformers, on the other hand, use an “attention mechanism” that allows the model to focus on key words in a sentence (many claim attention mechanisms compare to how people actually read 🤷). The result is that transformers allow for parallel processing of tasks RNNs serialize. The issue with RNN architecture is that RNNs struggle with longer input sequences due to how they store (or really lose) information overtime. This first diagram is how the presenters explain the bottleneck of RNNs (see the red arrows).

The architectural update to RNNs described in the paper is called RWKV (it has a pronunciation that I can’t remember). According to the presenters, the reworked architecture has better memory by shifting tokens to address issues of memory loss and increase parallel processing. The subsequent diagram shows strategies for preserving memory and increasing efficiency by shifting tokens, sharing states, and mixing outputs as inputs.

The authors claim that while the models themselves require comparable data to know how they perform as models, the architecture can be used to train models at sizes well beyond previous RNNs. At these sizes, RWKV architecture compares to transformer architecture (on what terms that comparison happens I am not exactly sure…I need to do a deeper dive on the paper if I can even understand it).

Why is any of this interesting to me as a non-builder? The authors explain that RNNs are far less resource intensive, scaling linearly with parameter size rather than quadratically as transformers do. If this approach can use data that is comparable to existing models (say the data underlying Llama), the upfront cost and ongoing resource use would be significantly lower as models get bigger. This means that using the RWKV architecture might be an alternative in some use cases and at some scales to transformers. Basically, it would require less compute time and less storage for the same input and similar outputs. I am looking forward to seeing someone bring this to a workable demo—the paper presented lots of graphs that left me wondering how this would be any different or better from the perspective of a user.

A few final thoughts
- The presenters were very careful about their claims, and made several caveats about what the findings indicate. Insofar as the paper was hard to understand, I did not know the caveats existed until others asked questions to clarify diagrams and claims. All this is to say, the implications of this type of highly technical research requires a great deal of knowledge and translation.
- Though I am not a builder, I knew a little about RNNs from just knowing the history of AI developments. Plenty of timelines exist that place RNNs to the left before the transformer era. I had assumed RNNs were dead. Yet this paper made me realized those making these technologies are distinct from those narrativizing them as products and opportunities. Claiming transformers are the future is an opportunity narrative (”bigger, better, faster…now in titanium!”) as much as a claim about comparative performance. In addition, the paper made it clear that builders are cross-pollenating ideas in ways that are not obvious to non-builders in this space, and especially consumers of these services. There is a lot of room to make these ideas accessible and valuable beyond their technical accomplishment.
I am not sure what the right term is here. I’ll use the term builders to refer to engineers, developers, data scientists, and the like who are deeply immersed in the mechanics of AI. These roles are radically different and barely hang together. Likewise, I use the term non-builders to refer to everyone else. In discussions with peers, I have heard the technical and non-technical or expert and non-expert. I don’t like these terms because what is deemed technical and expert is highly debatable, and entrenched in the history and market of valuing certain types of knowledge, production, and work. Admittedly, so does builder/non-builder, but these terms feel more fitting to describe what different people are expected to do relative to AI. If someone has a concise and easy to use distinction, please send it my way. ↩︎
→ 3:03 PM, Nov 8

More for less: OpenAI's larger and cheaper context window
Yesterday (11/6/2023) OpenAI announced GPT-4 Turbo, an update to their state-of-the-art model GPT-4. Amongst many features, GPT-4 Turbo has a huge context window. Context windows are the amount of text (counted in a unit called tokens¹) that can be input into a model. At 128000 tokens, GPT-4 Turbo can take in about 98.5k words or ~200 pages of text.²

Prior to this announcement, OpenAI’s model with the largest context window was GPT-4-32k. This model has a context window of 32768 tokens, or roughly 50 pages (approx. 25k words). This model quadrupled GPT-4’s context window of 8192 tokens or 12 pages (approx. 6k words).

Performance aside, the increased context window with GPT-4 Turbo allows for use cases previously requiring some complicated approaches to using for OpenAI’s GPT-4 models. What is more notable is the cost of this context window.

With the introduction of GPT-4 and GPT-4-32k, the context window was a premium feature. Compared to its predecessor, GPT-3.5-turbo-1106 (context window: 16385), GPT-4 is 30 times more expensive to use at $0.03 per 1000 input tokens and GPT-4-32k 60 times more expensive at $0.06 per 1000 input tokens.³ In other words, inputing the the same 12 pages to these GPT-4 models would cost about $2.50 and $5, respectively. GPT-3.5-turbo-1106 would only cost about 8 cents. Arguably, GPT-4-32k would only be necessary for inputs that exceeded GPT-4’s context window and GPT-3.5-turbo-1106’s response quality, meaning it would be more costly—and maybe used less frequently—for specific kinds of use cases. In either case, OpenAI was charging users for the compute required to run a model with a larger context window.

GPT-4 Turbo, however, is sixth as expensive as GPT-4-32k, a third as expensive as GPT-4 at $0.01 per 1000 input tokens, and ten times as expensive as GPT-3.5-turbo-1106. Using our examples of a 12 page document, GPT-4 Turbo would cost about $0.80 to process the same prompt.

With lower cost, bigger context window, and a slew of other features, OpenAI seems to want developers and businesses to move to GPT-4 Turbo. This could be a strategy for OpenAI to increase its market share while taking a hit on operating cost and/or for consolidating its numerous offerings as its models become multimodal.⁴ Reflecting on the keynote and announcements, Ben Thompson speculates that OpenAI’s strategy is to become “the most important consumer-facing product since the iPhone”. As much as OpenAI presented these features as important for developers, Thompson thinks the company is leaning into the fact they are an “accidental consumer technology company”. As such, the features—context window included—seem positioned get more consumers using products developed using OpenAI’s technology than for developers to be excited about developing according to Thompson.

Regardless, OpenAI is moving extraordinarily fast and using its partnership with Microsoft to dominate the ecosystem. I’ll be curious to see what else follows in the coming weeks around pricing, access, and developer & consumer resources.
Tokens are language units that enable an LLM to represent natural language. LLMs represent language through associations. For example, the word “jumping” may resolve into two tokens: “jump” and “ing”. By doing so, an LLM may associate jumping with other physical movement words (e.g. leaping, bounding, running) by way of the root verbs (e.g. jump, leap, bound, run). Likewise, an LLMs may also able to association the word with parts of speech (e.g. gerund, present participle) by way of its suffix (-ing). (LLMs also create many other associations that are harder to explain.) This article explains more about tokens and how tokenization (the process by which words are turned into tokens) occurs. ↩︎

I am using the ratio of 1.3:1 for tokens to words found here. This ratio is not hard and fast, but an approximation based on tokenization strategies and actual model performance. I am also using 500 words (single spaced) per page, which is an estimation I have used for years and I am not sure where it comes from or how accurate it is. ↩︎

Pricing of input and output tokens—that is, what is provided as a prompt and what is served as a response—are different. See the pricing here. ↩︎

OpenAI maintains 9 different families of models with multiple models therein, some for image generation, some for audio, some for text. Also announced during the keynote was a version of GPT-4 that includes image-to-text (”vision”) capabilities. ↩︎
→ 5:21 PM, Nov 7

Casey Newton’s latest post about OpenAI’s agents (“GPTs”) is fascinating esp. the way making an agent requires no coding but using a chat interface. This is how, in many cases, chatbots have been doing some tasks: creating code unbeknownst to the user. Now it’s a feature.

→ 10:27 AM, Nov 7

Notes on Eric Jang's "How to Understand ML Papers Quickly"
I found “How to Understand ML Papers Quickly” by Eric Jang (VP @ 1X, ex-Google) recently in my learning about generative AI. Written in 2021, I wonder what might be updated about these points given the state of knowledge currently.

Here is my summary of the already-pretty-concise original post:
- Determine the inputs & outputs of the ML problem & whether these inputs can even produce these outputs
- “ML models are formed from combining biases and data” so figure out which is being added/subtracted
- Determine to what extent the model be generalized & whether it requires learning (as opposed to hard-coding) to get there
- Make sure the claims can be falsified (is it science?)
→ 5:10 PM, Oct 31

Some reading from this week on AI
- Hacker News thread on difference between and rationale for using embeddings vs. fine-tuning an LLM. Takeaway: embeddings are good for retrieval of specific information (e.g. RAG) and fine-tuning is good for synthesizing responses (though subject to confabulation) and tone & voice (e.g. “style”)
- Glossary entry about AI PM. Takeaway: An AI PM is a PM that has a deep enough knowledge of AI to drive product strategy and delivery for AI-enabled products. Often folks with this knowledge have background in data science or ML. (I have lots of questions about this category fwiw)
- How to Use Your Product Management Experience to Ride the AI Wave. Takeaways: AI PMs bring together “strategic”, “tactical”, and “technical” factors to effectively deliver products. Strategic factors focus on understanding the actual value added by AI (rather than hype) and what it takes to implement before beginning. Tactical factors focus on breaking down design and development into milestones and tasks, and being able to effectively communicate to leadership. Technical factors focus on “the ability to discuss model trade-offs, experimentation approaches, infrastructure choices, and technology stacks, for example, is integral to the role”. In short, AI PMs have a deep enough knowledge to be effective. An important consideration–and why subject matter expertise matters–is an AI PM needs to be aware of the skills necessary for the multidisciplinary teams needed for delivering AI.
→ 4:16 PM, Oct 27

The Mozilla AI Guide is now live!

The AI Guide is an MVP resource we created for new and existing developers who want to learn about generative AI, especially about local models, open-source software, and responsible use. Check it out, share your feedback, and look for more updates soon!

→ 1:41 PM, Oct 26

Am I doing this right? Three factors that make learning with and about LLMs difficult
When I started learning about large language models (LLMs), I hit wall almost immediately. I decided to follow an introductory coding tutorial to experience LLMs in action. This “first-step tutorial” focused on quickly getting code running and bypassed deep explanations of concepts for a series of examples. I figured I could simply copy-and-paste snippets and follow instructions verbatim. But the first example made me think otherwise.

The first code example prompted a Google LLM with a sports question: “Which NFL team won the Super Bowl in the 2010 season?” According to the tutorial, I should see the answer “Green Bay Packers” if I do things correctly. I copied the code snippet, initialized the script on my local machine, and I got nothing back. In fact, I waited and waited and eventually had to interrupt the Python script. Despite making no modifications to the code, the script got stuck. After some poking around, I realized that the model used in the code snippet (Google’s flan-t5-xl) was no longer accessible through HuggingFace’s Inference API. Although this took digging, the solution seemed simple enough: I swapped out the model for another Google model that could be accessed free-of-charge. I tried again and got a response: “San Francisco 49ers.”

Hmm, okay. Maybe the issue is Google…

I swapped models again, this time I opted for a Big Science model because I could access it free-of-charge and it seemed sufficiently large (1.1 billion parameters!). Although this was my justification at the time, I ultimately guessed this model would work because the documentation was, well, hard to understand. I tried again: “Buffalo Bills.” Again, I swapped models. This time I decided to try a VMware model that further fine-tuned Google’s model: “The New England Patriots won the Super Bowl in the 2010 season.” At this point, I stopped and wondered:

Am I doing this right?

In this post I will outline three factors that impact learning about LLMs through code. These factors are (1) LLMs as dependably unexpected, (2) the domain is in rapid flux, and (3) documentation is highly variable and written for the most expert audiences. Together, these factor make learning about and with LLMs feel finicky, slippery, and difficult. Although all learners are impacted by each of these factors, those seeking functional competence rather than engineering mastery are particularly impacted because these factors presume the individuals are learning code to be coders rather than to achieve some other end. I conclude by offering some advice to those writing and those using tutorials for how to approach an imperfect and shifting domain.

Three factors that make learning LLMs difficult

Large language models are complicated and complex, and so present enumerable ways that learning is difficult. I want to call out three factors that make early learning difficult that I have encountered myself:
1. LLMs are dependably unexpected because they are non-deterministic technologies.
2. The AI ecosystem that surrounds LLMs is very much in flux.
3. The documentation of LLMs and related technologies presumes a degree of understanding far beyond early learners and geared toward those seeking domain expertise rather than other forms of comprehension.
1. “Dependably unexpected”

The first of these factors I have written about here, but I’ll rehash:

A simple way to understand LLMs is to think of them as very sophisticated predictive text keyboards. When presented with a prompt—whether a question, truncated sentence, or passage of text—LLMs generate words (or numbers, symbols, or whatever is appropriate) that is likely to follow the given prompt. Prompting an LLM with a question such “What is 2+2?” generates a response that is based on examples of problems of this and similar types. In this particular case, the most likely response contains “4” or “four,” and may be accompanied with pleasantries or richer explanations. By the same token and given the underlying data and other factors, the response may also be “5” or “What kind of question is that?!”. These may seem unexpected, but all responses are generated in the same manner: the LLM predicts and then responds based on likelihood that the output is appropriate for given context.

I have characterized LLMs as dependably unexpected. A more precise way to characterize LLMs is as non-deterministic technologies. Non-deterministic means that the inputs (prompts) to not correlated one-to-one to outputs (responses). Although they are related to a degree (that is, not entirely random), inputs do not necessarily determine outputs (and vice versa).

In many way, the non-deterministic quality of LLMs—and generative AI writ large—is what makes them so exciting and, depending on the context, a liability. LLMs can generate coherent and convincing responses to limitless prompts. Just as LLMs can respond to simple math questions, so too can they respond to existential prompts like “What happens at the end of the world?” or brand brainstorms like “What would be a good company name for a company that makes colorful socks?" For all these types of prompts, LLMs respond to language with language based on plentiful examples of language. A colleague of mine put this more bluntly: “It is hard to make an LLM not respond.” Many of the reports of LLMs hallucinating (or confabulating) are instances when the LLM is simply working as it is designed: when given language, LLMs respond with language that is likely. No one said the response would be accurate.

In short, LLMs are language engines that are dependably unexpected in their outputs. As such, outputs cannot be predicted to a degree, although many are working to make LLMs more predictable.

In the introductory example, the lack of predictability is a factor that resulted in the a bunch of answers I was not expecting. However, unpredictability is by no means the only factor contributing to my experience.

2. Ecosystem in flux

Although the product of decades of research and development, the contemporary generative AI ecosystem is, by and large, very new. Released as a demonstration in 2022, ChatGPT took hold of the popular imagination and became the fastest growing application in the history of the internet, reaching 100 million users in just two months. Alongside the growth of OpenAI’s users was (and continues to be) a deluge of products, startups, and demos that make use of its technologies and services. The success and performance of OpenAI accelerated the ecosystem: Google scrambled to productize its models; the investment and ambitions of companies like HuggingFace, Anthropic, and Langchain swelled; associated concerns spread into labor strikes, court cases, and open letters. Needless to say, a lot has been happening.

Part of this wave are two mutually arising activities: (1) the development of new and extended technologies, and so increased demand for key infrastructures; (2) the development of new and extended business models, and so monetizing the supply of key infrastructures. Take, for example, HuggingFace.

Prior to Dec 2020, HuggingFace offered two tiers of access ($199/month and $599/month) geared to different types of organizations exploring AI, namely, research labs and enterprises. These plans reflected that generative AI was concentrated in academic and industry settings due to specialized knowledge and resources. In Dec 2020, HuggingFace created new offerings focused on individuals working on AI (free and $9/month) and maintained their organizational offerings. This shift to individuals contributing to a community reflected that new participants needed access. HuggingFace continued to change. As demand increased, HuggingFace created pay-as-you-go models for labs and custom enterprise pricing (”call for a quote”). In 2023, HuggingFace began offering team accounts ($20 per user per month) to seemingly capture organizations scaling their AI competency but also wanting increased security and control.

Underlying these pricing changes were also service changes. In 2022, HuggingFace shifted organizations and higher parameter models from their Inference API (in operation since 2020) to its managed solution called Inference Endpoints or Pro Accounts. In particular, this managed solution gave organizations the ability to have dedicated access to specific (and private) models that could be used as a testbed as well as production-level infrastructure. In doing so, models previously accessible through the free Inference API were moved to paid plans.

When my introductory example timed out, it was due to the fact that the tutorial was now out-of-date given HuggingFace’s current offerings.

3. Documentation for experts rather than learners

The final factor that makes learning with and about LLMs difficult is that documentation presumes a level of knowledge that is beyond early learners. Take, for example, model cards. Model cards are a form of documentation used to explain the creation, use, and limitations of generative AI models. Model cards are supposed to help individuals pick between models. However, model cards—when they exist at all—tend to be sparse and conceptually dense.

Here are the four model cards I mentioned in the introduction:
Despite being different models, the two Google models have the exact same content on their model cards, all of which is excerpted from the technical paper that explains the model in detail. The model card begins with a diagrammatic depictions of the types of prompts/tasks and their potential outputs. This diagram is lifted from the original paper. The text that follows includes technical descriptions of the model(s) itself, all of which is summarized or excerpted from the original paper. Additionally, the model card includes code snippets for implementing the model on CPUs, GPUs, and other technical hardware (not exactly early learner stuff). To me—and I would say to any early learner who is a non-developer—these explanations are not particularly helpful when it comes to making a decision to pick one model or another.

The information in this model card is emblematic of other model cards. For example, the section titled “Direct Use and Downstream Use” reads as follows:

The authors write in the original paper’s model card that:

The primary use is research on language models, including: research on zero-shot NLP tasks and in-context few-shot learning NLP tasks, such as reasoning, and question answering; advancing fairness and safety research, and understanding limitations of current large language models

See the research paper for further details.

For someone just starting out, terms like “zero-shot NLP” or “in-context few-shot learning NLP tasks” do not hold much meaning. Although these terms describe categories of tasks within the domain, understanding these tasks may be far beyond those just starting to prompt models. Even more, the linked research paper is less accessible and rather intimidating in its writing and format.

Other model cards are similar. Big Science’s BLOOMZ model points readers to dense and technical webpages after providing a high-level summary of the model using similar terms as above. The VMWare’s Flan T5 Alpaca is even less fleshed out: it effectively only shows how you use the model on various types of hardware without any real explanation. This model card links back to Google’s FLAN T5 model card to fill in what is missing.

In the simplest terms, these cards are not designed for early learners. They expect a great deal of domain knowledge and subject matter confidence to use them effectively. Although it is important to have this type of expert-focused documentation, I would argue that these model cards are often so sparse or dense as to not be particularly effective for those most suited to use them. But more to the point of this post, no documentation exists to address those still learning and needing to make decisions about which models to use.

Making learning with and about LLMs easier

So what can improve learning with and about LLMs? Here is a quick set of closing thoughts about what writers and learns can do today.

Writers
- Explain your choices: Picking the right model is important. The few explanations I have seen (for example) have emphasized performance above all else, and eschewed explanations of trade offs or other considerations. Those writing tutorials should explain their choices, even if those explanations is not particularly sophisticated. Setting expectations that not all models work the same teaches that LLMs are tools rather than intelligences.
- Pick tasks that make sense: The prompt in the introduction is a bad learning prompt because it is a task that has no tolerance for variation. Those writing tutorials should be aware of the issues of asking LLMs about facts, and generally pick tasks that illustrate the appropriate points.
- Set an expiration date: If you are writing tutorial, assume your tutorial will break for one of the factors listed here or other reasons altogether. As such, either update your tutorial periodically, or more simply have a disclaimer at the beginning. With regards to the latter, stating that a tutorial should be considered tenuous at best beyond a certain date gives the reader a heads up about the quality of the content at the present moment. Given how quickly things change, the shelf life of a tutorial is likely pretty short.
Learners
- Pick tutorials from reputable sources: Try to find tutorials from people or on platforms that are interested in the teaching rather than warming a lead (that is, selling a product). Check to see what else the author has written and whether something newer exists.
- Start with a task you actually want to accomplish: Many first-step tutorials begin with examples that illustrate functionality rather than a real task. If you own a sock company, I doubt you’ll use an LLM to come up with the name for your company. Approach examples with the filter of a task—even a small one—that you actually need to accomplish. As such, your learning will be geared toward that goal rather than debugging someone’s fictional example.
- Read about LLMs concepts first: Conceptual overviews are useful to get your bearing on terms and expectations. Although these can be dense, they provide you with something to ground your tutorials in, and may make model cards a little more understandable.
→ 3:41 PM, Oct 23

TIL from this paper that although LLMs can translate text from one language to another, the cost of doing so varies greatly, thereby making LLMs more costly within some languages because of how words are tokenized.

→ 12:04 PM, Oct 23

I love getting sassed by ChatGPT (GPT-3.5) to just Google something with the attitude of “you know the tech under here is pretty advanced, right?” Also, it doesn’t mean use something “like Google”–it means Google because Bing is not good.

→ 2:41 PM, Oct 16

Catching up in “Hard Fork”: this segment on prompt engineering with Riley Goodside is pretty fascinating esp because it sounds a lot more like animal training than pen testing

→ 8:33 AM, Oct 13

I am working on a bit of writing about the challenge of learning about LLMs by using them, and here is a prime example. The image is a screencap of Google’s flan-t5-base model on HuggingFace. The diagram gives an answer to a question that when demoed live does not produce the expected answer.

→ 3:14 PM, Oct 10

Oh, and one more: When using PAL chain the result translated the plus sign (+) into the word “plus” to explain the final answer without mentioning anything about the addition operation.
```
Entering new  chain...
def solution():
    'What is 2+2? Explain how you got an answer.'
    2+2 is 4. This is because 2+2 is the same as 2 plus 2, which is 4.

Finished chain.
```
→ 8:53 AM, Oct 9

Appendix to "Show Your Work"
In the previous post I used a plan and execute agent to discuss the use of an LLM with an orchestration. An action agent is more elegant admittedly. Using a variant prompt (“What is 2+2? Provide an explanation of how the final answer was obtained.") the output is more condensed, and honestly a lot more concise a point:
Entering new chain... I need to use a calculator to solve this math problem. Action: Calculator Action Input: 2+2 Observation: Answer: 4 Thought: I now know the final answer. Final Answer: 2+2 is equal to 4. The final answer was obtained by using a calculator to solve the math problem.
The explanation at the end is a perfect encapsulation of LLMs answering math questions without doing math.

You can find the Python file here used to run this
→ 4:49 PM, Oct 6

Show your work: what to expect from how LLMs “do” math
After finishing my undergrad in mathematics, I taught algebra and geometry to high schoolers. I often used the refrain show your work to stress that getting the right answer required the right process. The reason being that the steps one takes to solve math problems illustrate an understanding of the mathematical system at work, and that mathematical system is what students are ultimately learning in any math discipline. My experience studying and teaching math came rushing back when I recently started learning about large language models (LLMs). Amongst examples of these technologies summarizing text, answering trivia, and generating names for sock companies were examples of LLMs answering math questions. My interest was piqued: LLMs “do” math…show me your work!

In this post, I want to explain two approaches for how LLMs handle math questions in order to reflect on LLMs and generative AI more broadly. As we will see, LLMs can respond with seemingly correct answers to math problems without necessarily doing math. In the way LLM are designed around natural language relationships and predictions, LLMs answer math problems through language prediction rather than mathematical algorithms. The result is that LLMs use unexpected work to do math. Whether the answers and explanations are right or wrong from the perspective of math sidesteps a deeper question: what should we expect from LLMs in the first place? I suggest that LLMs and generative AI should be thought of as dependably unexpected, and be judged, used, and integrated according to that mindset.

What is 2+2? Explain how you got an answer.

Conveniently enough, simple math questions provide all we need in order to unpack the math capabilities of LLMs. The focus from here on out is on one prompt containing a math question:

Prompt: What is 2+2? Explain how you got an answer.

I sent the prompt to three different models and, as we will see, got similar but not the same answers. Two of these models (Llama2 [7B] and Orca Mini [7B]) were running on my local machine (an Apple Macbook Pro with an M1 chip). The other model (GPT-3.5) was running on OpenAI’s servers accessed through the web interface (ChatGPT) and OpenAI’s paid API using a local Python script (the final example appears later on).

To set the stage, here are three responses to this prompt:

Llama2 [7B] running locally with the llm command-line tool and the MLC plug-in

Hello! I’m here to help you with your question. However, I noticed that the question “what is 2+2” is a bit nonsensical as the answer is obvious and doesn’t require any explanation. The correct answer is simply 4.

I understand that you might have asked this question in a playful or rhetorical manner, but I’m programmed to provide accurate and helpful responses. If you have any other questions or concerns, please feel free to ask!

Orca Mini [7B] running locally through the llm command-line tool

The answer to the equation 2+2 is four. This is because when we add two and two together, we are essentially combining two quantities to make a new total. In this case, we start with two units and then add another two units to that, resulting in a total of four units. Therefore, the answer is four.

GPT-3.5 model access through the OpenAI’s ChatGPT web interface

2 + 2 equals 4.

To explain how I got this answer, we can break it down step by step:

Start with the number 2. Add another 2 to it. When you combine these two 2s, you get 4.

So, by performing the addition operation, we have combined two quantities of 2, resulting in a total of 4.

As we can see in all these cases, the model answers with 4 either as a numeral (”4”) or as a word (”four”). The response includes descriptions of how the value is obtained that vary in depth, language, and tone. For our purposes and with regards to the math itself, the LLM produced an expected response and a sufficient explanation given the question.

These demonstrate one approach of an LLMs handling math, but not the only approach.

Approach 1: The LLMs predicts what should come next

The simplest version is that what an LLM does to produce a response is determine the sequence of words most likely to follow from the prompt itself. In this, LLMs generate a sequence of words using a sophisticated version of next word prediction.

$Two side-by-side screenshots of the Apple’s native iPhone keyboard suggesting words through predictions of likelihood. Left image shows the word what\'s typed in the text field and suggestions for the words 'the', 'you', 'up'. Right image shows the word 'good' typed in the text field and suggestions for the words 'morning', 'night', and 'luck'.$
Screenshots of the Apple’s native iPhone keyboard suggesting words through predictions of likelihood. Screenshots taken by Thomas Lodato (10.3.2023)

For those using a smartphone, a version of next word prediction can be found on (in?) your predictive keyboard. Predictive keyboards display words above the smartphone keyboard to speed along typing by trying to guess what might be useful next. These predictions are made based on a likelihood the words might be useful next. “The”, “your”, and “up” are likely to follow “what’s”; :”morning”, “night”, and “luck” are likely to follow “good”. These words tend to be called suggestions. These suggestions are based on predictions of the typer’s intent, or, what someone is ultimately trying to communicate. Instead of “what’s up?” you may write “what’s your ETA?”. In this way the prediction of “up” was still highly likely, albeit not what was needed by a particular individual at that moment. LLMs are similar to predictive text, but depend on far more data and produce many words rather than just one.

LLMs are trained on lots of examples of writing. These examples allow LLMs to develop a schema of predicting which words, symbols, and sentences follow one another based on what comes before. These training data cover all sorts of examples of written language, some of which are math explanations, but also literature, recipes, historical accounts, and computer code, to name just a few. Because these—and so many other explanations—are fed into LLMs, the models have the ability to take in a prompt and predict a sequence of words that are likely to follow from it. If we use a simplified version of our prompt—prompt: “2+2=”—an LLM may predict ”4”, “___” (a blank line), or “?”, all of which are possible sequences of symbols following the prompt. The single response that is given is determined based on a variety of factors, but in many cases is the most likely next symbol, e.g. “4”. In this way, LLMs do not compute quantities, but compute likelihoods: the likelihood that something should follow something else.

LLMs use machine learning to create these relationships (”learn”) and a technique called attention to parse the important parts of the prompt. Combined, LLMs generate predictions based on the focal points of a prompt to determine intent. Some refer to attention as an LLM knowing what is important in a prompt. More accurately, attention allows models to prioritize parts of prompts using examples from training. This assessment of prompt intent using the attention mechanism narrows in on subtleties we as humans perceive in language, but previous natural language processing failed to capture. The result is a coherent response that, when it works, feels very much not like a machine.

Although in our example the math problem is likely to be found on the web, LLMs do not need the exact phrasing to make a coherent and seemingly accurate response. As mentioned, LLMs synthesize responses using relationships in language as much as the language itself. For example, an LLM may ingest examples that include questions like “What is 3+4=? Please explain” and “Show your work for: 10+2=”. These are not the same as our prompt but similar. Paired with their answers, these similar questions provide a structural relationship for how to answer questions of this type. If no example of our exact prompt was used in training, the LLM can leverage structural similarities. As such, responses can extend beyond what is found immediately and exactly in the training data for an LLM.

The result is an answer to a math question through language relationships. The expectedness and sufficiency of the answer—what one might say is correct or accurate—is derived from the content and tone of the data therein, and the receiver of the message. If you train an LLM on question-answer pairs where the answers are wrong, the model will generate responses that reflect that data. The answers may be seen as wrong, but the model is working perfectly. Likewise, if you train a model using an online community that has the potential to include derisive responses to simple questions—as in the case of Llama2 which uses user data from Facebook amongst other sources—then questions like “what is 2+2?” elicit a response commensurate with that data. Again, unexpected (and rude!), but working as designed.

From these techniques, LLMs can do math because they are trained on written explanations of math, and can extend that language knowledge if need be. Other factors influence the perceived accuracy and sufficiency of the response, such as the degree of randomness or length of response the system allows. Regardless, the approach is ultimately the same. The underlying work (not the explanation given) is closer to asking a lot of friends the answer rather than doing numerical calculation yourself. The answer is correct, but the work is not math-work per se.

Approach 2: The LLMs predicts what tools can do the job

In Approach 1, LLMs generate responses based on predictions. Prediction is the core of what an LLM does, but is not the only means an LLMs is used to handle math problems. A second approach involves identifying and applying specific tools (such as calculators) to effectively handle a prompt using the ReAct approach. Introduced by Yao et al. (2022), ReAct is short for Reason and Action and is the basis for frameworks such as Langchain and simpleaichat. ReAct depends on the language capabilities of LLMs to create so-called “chains-of-thought” that breakdown complex prompts into steps. Within these steps, the LLM handles tasks using native capabilities (namely, language prediction) with the added feature of being able to, when appropriate, invoke additional functions and tools (e.g. calculators, web search, encyclopedias). As such, the capabilities of LLMs are extended by the orchestration capabilities of a framework, all of which allows for sophisticated problem solving.

In the abstract, this approach may seem lofty. To ground the explanation, below is what this approach looks in the command line. Using a Python script, the following output uses OpenAI’s GPT-3.5 model and Langchain to answer the original prompt:
Entering new chain... steps=[Step(value='Add 2 and 2 together.'), Step(value='The sum of 2 and 2 is 4.'), Step(value="Given the above steps taken, respond to the user's original question.\n\n")] Entering new chain... Thought: To add 2 and 2 together, I can use a calculator tool. Action: {"action": "Calculator", "action_input": "2 + 2"} Entering new chain... numexpr.evaluate("2 + 2") Answer: 4 Finished chain. Observation: Answer: 4 Thought:To add 2 and 2 together, the result is 4. Action: { "action": "Final Answer", "action_input": "The sum of 2 and 2 is 4." } Finished chain. Step: Add 2 and 2 together. Response: The sum of 2 and 2 is 4. Entering new chain... Action: { "action": "Final Answer", "action_input": "The sum of 2 and 2 is 4." } Finished chain. Step: The sum of 2 and 2 is 4. Response: The sum of 2 and 2 is 4. Entering new chain... Action: { "action": "Final Answer", "action_input": "The sum of 2 and 2 is 4." } Finished chain. Step: Given the above steps taken, respond to the user's original question. Response: The sum of 2 and 2 is 4. Finished chain.
So, what is happening here?

First, the script sends the prompt with additional orchestration instructions called a meta-prompt to the GPT-3.5. The meta-prompt is a written description for how the LLM should parse and approach the prompt itself. Langchain uses the following meta-prompt at each step and does so without explicitly sharing it in the command line output.
```
Assistant has just had the below interactions with a User. Assistant followed their "system: Instructions" closely. Your job is to critique the Assistant's performance and then revise the Instructions so that Assistant would quickly and correctly respond in the future.
 
####
{hist}
####
 
Please reflect on these interactions.

You should first critique Assistant's performance. What could Assistant have done better? What should the Assistant remember about this user? Are there things this user always wants? Indicate this with "Critique: ...".

You should next revise the Instructions so that Assistant would quickly and correctly respond in the future. Assistant's goal is to satisfy the user in as few interactions as possible. Assistant will only see the new Instructions, not the interaction history, so anything important must be summarized in the Instructions. Don't forget any important details in the current Instructions! Indicate the new Instructions by "Instructions: ...".
```
This code snippet contains two things of note. First, the meta-prompt is considered a type of prompt engineering. Prompt engineering is a technique for constraining an LLM to respond in specific and hopefully expected ways. Second, {hist} is how an LLM is given memory. The variable {hist}, which stands for “history”, is a way to store previous steps and outputs. These previous steps are fed forward into subsequent steps to give the process continuity. As such, the LLM can refer to what has been done before to inform subsequent actions and does so in a constrained manner to focus on answering the question sufficiently. Combined, past actions and constrained responses seek to make the LLM an engine for predictable process automation.

Using this meta-prompt, the GPT-3.5 model constructs a set of steps based on the prompt “What is 2+2? Explain how you got an answer.” The output is this:
Entering new chain... steps=[Step(value='Add 2 and 2 together.'), Step(value='The sum of 2 and 2 is 4.'), Step(value="Given the above steps taken, respond to the user's original question.\n\n")]
These steps are then fed back into another request to GPT-3.5 using the same meta-prompt. The model processes the meta-prompt (now with more history!) and prompt, and generates a “Thought” for what to do next. From the “Thought,” the LLM determines that there is a tool for the task at hand. In our case, the determination is that “2+2” is a math problem, and so the LLM predicts that the calculator is an appropriate tool:
Thought: To add 2 and 2 together, I can use a calculator tool.
When the calculator performs the simple calculation, the model cycles through another time to determine what to do next.

Just as the meta-prompt is hidden above, so too are the ways frameworks extend the capabilities of LLMs with tools. To determine that a tool is applicable in a given step, the Python script must define upfront the conditions for when a tool is useful. For example, in the script used here, the calculator tool is described in the following manner:
tools = [ Tool( name = "Calculator", func = llm_math_chain.run, description = "useful for when you need to answer questions about math" ) ]
As the LLM works through the meta-prompt and prompt, this tool description augments what answers are possible. If the LLM “reflects” on a “Thought” that a tool may be useful (say, the calculator), then the tool can be applied. In our case, once the calculator tool is invoked, the script performs the appropriate action (e.g. calculating), and then moves on to another language-based step (e.g. composing a response, determining the response is a final response). As may be apparent, if the LLM make an unexpected prediction—determines a web search is applicable for 2+2—something else might happen instead, such provide an answer about the Bob Seger System’s anti-war song “2+2=?".

The cycle of Reason (”Thought”) and Action concludes when the LLM determines that the final answer has been obtained by using the meta-prompt to assess the state of affairs. Upon this determination, LLM generates an answer (”Response:):
Response: The sum of 2 and 2 is 4.
In the end, the combination of a framework and an LLM provides an answer to the math problem through language—here using language to determine steps and assess next steps—and through the application of a tool—here a calculator to compute the answer. Notably, the response generated by the LLM does not explain much about why 2+2 = 4 as the responses in the previous approach did. Instead, we see the LLM show work for its own logic without explaining much about the underlying and requested math work. In other words, the final answer is correct and the explanation is insufficient.

In tension

These approaches illustrate that when an LLM produces an answer to a math question, the underlying work is different than we might expect. As such, I want expand on a key tension that underlies LLMs and generative AI more broadly. This tension is found in the context of LLMs doing math problems, but is by no means limited to that context. The tension is this:

LLMs seem very human, but also are non-human computing technologies.

The tension stems from an inherent characteristic of LLMs, namely, these technologies display emergent capabilities. “Emergent” means that LLMs demonstrate capabilities for things they weren’t explicitly built to do. In our example, being able to seemingly solve math questions is an emergent capability.

These capabilities are one reason LLMs have drawn so much attention. For example, the “eureka moment” for Microsoft CEO Satya Nadella was when the OpenAI’s GPT-4 model “preserved the sovereignty of [Rumi’s] poetry across two language[s]” (Wired, 2023). Nadella points out that machine translation is by no means new. For him, what felt new was the ability of an LLM to capture meaning that exceeded mere word-for-word translation. This and other emergent capabilities are uncanny and, importantly, unexplainable to many researchers in the field. Despite building the techniques of machine learning, experts are unable to pin down how LLMs are capable of things previous models were not. The main explanation given is that more data is used to train LLMs, but this explanation only describes the conditions rather than the mechanisms for emergence. Regardless of why, experiencing LLMs and generative AI do new things results in a flywheel: “what else can these things do?!” And therein is the tension.

On one hand of this tension is the attribution of humanness to non-humans, or anthropomorphization. Many assign human qualities and descriptors to LLMs because they observe LLMs acting in ways and in contexts that are familiar to us as human actors in those contexts. For example, a back-and-forth exchange in WhatsApp is not that dissimilar in form and content than an exchange with ChatGPT. LLMs present a compelling illusion of humanness by producing responses that look and feel human. Responses use sentences and paragraphs, ideas and concepts, and what has been written previously (e.g. ”memory” or “history”). Combined, the effect feels truly astounding and very human-like. In turn, we come to expect LLMs to be like us.

On the other hand of this tension is the attribution of controllable causality to LLMs because they are made-things. I’ll refer to this as hubristic determinacy (though I am open to a better term). Despite “chatting with us,” LLMs are computing technologies. As such, we also expect LLMs to work how computers work both in terms of actions (e.g. compute, calculate) and outcomes (e.g. produce what we are looking for). The context and underlying knowledge that these are made-things means we assume at some level they are also things-to-be-controlled, even if we cannot control them just yet. When LLMs do things we do not expect or feel off, the presumption is that the technology can be reigned in with current or future techniques and knowledge.

Terminology is particularly revealing of this tension. Take the term “hallucination”, which entered the popular lexicon in recent years. “Hallucination” is used to describe instances when LLMs generate unexpected outputs, especially in relation to known facts. For example, LLM may cite legal precedents that do not exist if prompted to do so. As a term, “hallucinations” captures that LLMs are not doing what we want them to do—namely, be controlled for our specific ends. Though obviously anthropomorphic because it attributes human-like agency to the actions of LLMs, the term “hallucinations” is tellingly invoked as a metaphor to help us understand how a machine resists our will. As such, the term itself is anthropomorphic and used to cope with when expectations of control undermined.

The trouble is that this tension is a condition of generative AI rather than a resolvable bug. LLMs are technologies and tools with particular dispositions and affordances. These dispositions and affordances include not being especially predictable from the perspective of an observer and being able to interface with people through natural language. Anthropomorphization helps us understand these systems, but does not represent them. Hubristic determinacy expresses our intentions, but does not manifest them. Viewing this tension as something to overcome leads to faulty expectations, and, I would argue, undercuts our ability to use LLMs effectively—in the right context, for the right purpose, with the appropriate mindset. In sum, the tension sidesteps the a different line of questioning: what should we actually expect from LLMs?

Dependably unexpected

The two approaches of how an LLM handles a math question are likely not the only two. The reason for focusing on these approaches has been to unpack the central role of LLMs as mediators of language when prompted with math questions. The approaches have similar punchlines: either an LLM predicts what might come next or predicts whether a tool is needed next to do a task. In the first approach, the outputs are produced through the use of language examples of the central math question. In the second approach, the outputs are produced through the use of language to determine the work needed to produce an answer.

For someone learning math, how these systems actually answer math questions may matter very little when they are right. In our prompt, the explanations provided are sufficient and the calculations are correct. The issue comes in when LLMs are seemingly wrong. In the first approach, an incorrect answer stems from the way the prompt is parsed, data are leveraged to respond, and the alchemy of responding. In the second approach, an incorrect answer stems from the similar factors as the first, albeit at the various steps along the way: the prompt is parsed in a way that no tool is selected; the wrong tool is selected; the “Thoughts” generated are unexpected at a step and so gumming up the ReAct sequence; the final answer generated in an unexpected way, even if the right tool is used and calculation is made correctly (this tutorial documents just this scenario).

In instances when LLMs fall short of our expectations, those expectations come into focus, even though they have been there all along. So LLMs use different work to get to math explanations, so what? The issue is that without critical reflection and attenuated expectations based on the realities of these technology, those building and using technologies fail to consider whether these technologies are appropriate for a given context, problem, or circumstance.

The main lesson, then, relates to the tension of these technologies: despite how human LLMs feel or how much we know they are made, they are neither of those things. On one side, we should can easily confuse the human-feeling inputs and outputs with acting and thinking like a person. Yes, LLMs can respond to math questions with math answers. Observers can judge those answers as right or wrong, justified or fabricated, accurate or inaccurate. But LLMs don’t do what we think they do, rendering our evaluations inaccurate. Similarly, despite humans making these technologies, they are not controllable in the same way other technologies are. They surprise us both with compelling and unnerving things, but also with things that don’t correspond to reality. In short, we need a different mindset.

What can be reasonably asserted is that LLMs—and generative AI overall—are dependably unexpected in how they do what they do. That unexpectedness is a risk and liability in instances and applications when LLMs and other generative AI are expected to mediate circumstances or conditions that humans are far more adept at, or when inherent biases in data trickle into responses. More optimistically, the dependable unexpectedness of LLMs offers us a glimpse at pathways we might never consider because we as humans respond given our history and education, personal and professional expectations, and social mores that may foreclose certain considerations. In other words, many times we expect particular outcomes to questions, and LLMs might surprise us (but caveat emptor).

Considering LLMs as dependably unexpected is maybe more useful in the inverse. When expectedness is critical, LLMs and generative AI are likely not the technology to use. Approaching the application of LLMs in terms of a tolerance of unexpectedness is, then, a way to think of their application as adding value to scenarios, and detracting when safety, stability, and assurance are preferred. Because LLMs are given human-qualities, a prevailing assumption is that they can be taught to be good actors in a human world. Regardless, those using LLMs for tasks or in products have an obligation to think about the human world LLMs might create, and whether that world is one people are eager to expect.
→ 12:00 PM, Oct 6

The generative AI paradox
Note: After publishing this post, I have made some changes to my thinking. I will leave the below but want to mention:
- tension is a better word than paradox
- hubristic determinacy is a term word than over-determinacy
These changes can be found in the subsequent post.

Original Post

I have been working on a forthcoming and longer post (article?) about LLMs doing math. I cut a bunch (because I am verbose!), but thought the following two excerpts were worth sharing on their own. They demonstrate a key paradox about generative AI, where LLMs are a type of generative AI:

Generative AI feels human and so is attributed human qualities (anthropomorphized), but is not human and so is attributed deterministic qualities (over-determinacy).

Each of these excerpts capture in a single written breath the paradox of generative AI.

The first excerpt comes from a course created by Khan Academy for teachers, students, administrators, and parents grappling with the role of AI in the classroom. In a section advising students about the importance of critical thinking skills when using AI in an educational purposes, we find this excerpt:

“In addition to the obvious potential for letting an AI do your thinking for you, we know that AIs aren’t always right! They can give you inaccurate information, and they don’t have any judgment! […] They sometimes calculate math incorrectly, or in an inefficient way.” [source: Khan Academy]

The second excerpt comes from a tutorial about the framework Langchain published by the company Pinecone. In this excerpt, the following explanation is provided about how an LLM handles a math question:

“LLMs are generally bad at math, but that doesn’t stop them from trying to do math. The problem is due to the LLM’s overconfidence in its mathematical ability.” [source: Pinecone]

These excerpts demonstrate the two sides of the paradox. We see anthropomorphization in the words calculate and overconfidence. The presumption here is LLMs respond to questions in a manner a human would respond if presented with the same questions. As such, asking an LLM to answer a math problem does not necessarily result in the LLM doing calculations, even though we ourselves may do precisely that when asked the same question. Without getting into the debate of whether machines can think, this side of the paradox equates the question with the approach, and so overlooks that the answer can be obtained by other means.

We see over-determinacy in the use of the words inaccurate, bad, and incorrect. The output—and so LLM—is evaluated on the wrong criteria. Better words to describe the results would be unexpected or unanticipated. LLMs produce unexpected or unanticipated responses to prompts because they are non-deterministic. Although we might say “2+2=5” is incorrect, doing so doesn’t capture the important nuance that the 5 is not calculated but generated as a prediction when an LLM is in the mix. Characterizing LLMs as wrong is, in my opinion, a category error akin to saying a possibility is incorrect, inaccurate, or bad. Instead, a possibility has a degree of likelihood. The issue, then, is confusing a prediction with a certainty, and that falls on the observer rather than the prediction itself. As such, the objects and processes in question are held to the wrong standards.

—

This paradox makes talking and writing about LLMs complicated as these excerpts illustrate. More broadly, I see the paradox underscoring the veracious appetite many companies and people have to apply LLMs everywhere possible and then work to constrain their innate non-deterministic qualities. I view the impulse to apply and constrain LLMs to be a contemporary version of Foucault’s idea of “docile bodies.” In the case of generative AI, actual human bodies are transmuted by data and subsumed into a model. The application and fine-tuning of that model constrains outputs to make a predictable worker (or assistant, or task-completer). As such, generative AI makes bodies docile twice over: first in the model creation and then in the model application, and makes docile both human bodies and anthropomorphic machines.

I’ll be publishing the larger piece from which this was cut soon (this week, maybe next), and probably will be writing more about docile bodies thinking another time (maybe when I am not just wedging it into a scrap of paper).

I would love feedback or thoughts. I’m sure I am off base with some of my thinking or explanations. Reach out!
→ 4:58 PM, Oct 4

Prompt: What happens at the end of the world?
As part of some writing, I sent the prompt “What happens at the end of the world?” to Llama2, Orca Mini, and GPT-3.5. Not one responded with a mention of AI as being involved in the demise of Earth. This has me wondering if this situation is similar to asking if someone is good at driving and they respond “Absolutely, and everyone else is the problem!”

Llama 2 (7B)

Thank you for your question! I’m just an AI, I don’t have personal beliefs or opinions, but I can provide you with some general information on the topic.

The end of the world is a topic that has been debated and explored in various cultures, religions, and scientific theories. However, it’s important to note that the concept of the end of the world is often associated with catastrophic events, such as asteroid impacts, massive earthquakes, or global nuclear wars. These scenarios are highly unlikely and are not supported by scientific evidence.

In reality, the end of the world as we know it is unlikely to happen in the near future. The Earth is a dynamic and ever-changing planet, and it has faced numerous challenges throughout its history, including natural disasters, climate change, and the impact of human activities. However, these challenges are not necessarily indicative of the end of the world.

Instead, it’s important to focus on addressing the pressing environmental, social, and economic issues that we face today. By working together to address these challenges, we can create a more sustainable and equitable future for all.

In conclusion, the end of the world is not a likely scenario in the near future. Instead, let’s work together to create a better future for ourselves and future generations.

Orca Mini (7B):

As an AI, I cannot predict what will happen at the end of the world since it is a hypothetical scenario. However, if we are to consider scientific predictions, there are various theories about what could happen at the end of the world. Some believe that it could be caused by natural disasters such as asteroid impacts or climate change, while others speculate that it could be due to supernatural events like divine intervention or alien visitations. These are just a few possibilities and they vary depending on different interpretations of religious beliefs or scientific evidence.

GPT-3.5 (through ChatGPT web app):

The idea of “the end of the world” can be interpreted in various ways, depending on the context. Here are a few different scenarios that people often consider when thinking about the end of the world:
- Natural Catastrophes: Some envision the end of the world as a result of natural disasters, such as a massive asteroid impact, supervolcano eruption, or a catastrophic climate change event. In these scenarios, the Earth itself may still exist, but human civilization could be severely disrupted or even wiped out.
- Nuclear Apocalypse: Another popular apocalyptic scenario involves a global nuclear war that could lead to the end of human civilization. This scenario often involves widespread destruction and a significant reduction in the human population.
- Pandemics: A pandemic with a highly contagious and deadly pathogen could also be seen as an apocalyptic event. Such a scenario could lead to a collapse of society as we know it, with mass illness and death.
- Environmental Collapse: Some argue that ongoing environmental degradation, such as deforestation, pollution, and resource depletion, could eventually lead to a point where Earth can no longer support human life in the same way it does today.
- Cosmic Events: The end of the world could also be associated with cosmic events like the heat death of the universe, which is a theory in astrophysics suggesting that the universe will eventually reach a state of maximum entropy, making life unsustainable.
- Religious or Mythological Beliefs: Many religious and mythological traditions have their own beliefs about how the world will end. For example, in Christianity, there is the concept of the Apocalypse or Armageddon, which involves divine judgment and the ultimate fate of humanity.
It’s important to note that these scenarios are often the stuff of science fiction, religious beliefs, or speculative discussions. Predicting the actual end of the world with certainty is impossible, and many of these scenarios are highly unlikely in the near term. Human societies have faced various challenges throughout history and have shown resilience in adapting and surviving.

Efforts to address issues like climate change, nuclear disarmament, pandemic preparedness, and environmental conservation are important steps toward mitigating the risks associated with potential global catastrophes. Humanity’s ability to address these challenges will play a significant role in determining the future of our planet.
→ 4:53 PM, Oct 2

Here is a lesson in: 1/ the difference between less and few, and 2/ how much context matters to the definition of words. (Also the use of “quantum” to describe a unit of work does not make me think lesser was precisely an accident.)

→ 4:16 PM, Sep 22

I feel seen in that attempting to log into LinkedIn whatsoever is suspicious. It’s me, begrudgingly.

→ 12:29 PM, Sep 22

I am sort of surprised I haven’t read an explanation of hyperparameters for LLMs that uses Family Feud as the metaphor. img src

→ 12:14 PM, Sep 22

Getting Started but Going Where? What getting started guides tell us about the emerging shape of the generative AI domain
Following the release of OpenAI’s ChatGPT, a tide of new educational resources washed in aimed at educating newcomers about large language models (LLMs). Some of these resources were overviews of concepts and companies; others were tips and tricks guides (”this one prompt will save you hours!”); and still others were hands-on tutorials of working with APIs, packages, and various models themselves. As introductions to the terms, concepts, techniques, and technologies, these resources—what I call here “getting started guides”—serve as first steps in the journey of some to understand and use LLMs.

After some hesitation, I myself was pulled in. I am not a developer, but a designer and researcher with some very dusty front-end development experience. As such, I decided to start at the very beginning with fairly basic tutorials and conceptual overviews because, frankly, I was deeply intimidated by the domain. A lot of folks seemed to know a lot about what was going on—and I felt at a loss.

The getting started guides varied in form, content, and effectiveness, yet had similar currents. I noticed that certain names repeated: OpenAI, Google, Langchain, HuggingFace. I noticed the same diagrams of how models worked, and similar explanations and steps of how an LLM makes sense of prompts. I noticed that companies often wrote these resources, often promoting companies on which their own offerings depended. I also noticed the frequent absence of considerations such as the data underlying the models, their potential for bias, and the ways things go wrong. (Those concerns seemed left for reporters.) Most of all, the guides emphasized getting started fast.

Although my intention was to learn about LLMs, I found myself learning about something else too: the guides themselves. After looking a little deeper and shoring up hunches, I decided to catalog and review getting started guides for LLMs in order to understand what getting started guides tell us about the emerging shape and bearing of the generative AI domain.

The following post provides a review at what I found. First, I provide a personal anecdote about what led me to look at the guides rather than just what the guides were teaching. This is effectively the preamble to a recipe—if that’s not your thing, jump ahead. Second, I explain the methodology and its limitations. Third, I detail the findings and their implications. Following the findings, I discuss the implications of what has been shared.

The thesis of this post is that getting started guides illustrate that closed and proprietary models and services are becoming default choices for early learning, while alternative open models and services are positioned as conceptual alternatives. The relative positioning of these two camps reflects and reinforces wider industry activities, and so shows a site where commonsensical dispositions and expectations circulate that influence what it means to make (with) generative AI.

Definitional note: getting started guides are fuzzy

I use the term “getting started guides” to refer to a fuzzy category of educational materials that are intended to quickly introduce individuals to a new domain and set these individuals up for continued learning. Getting started guides exist for all sorts of topics, but I myself see a great deal of these working in technology. The reason I think getting started guides are so common in tech is that technology supposedly changes quickly, speed is operative in capitalism, and continued professional development is a technology worker’s response to the precariousness of these forces.

Regardless of the topic and reason getting started guides exist, three characteristics tend to define the getting started guide category in tech and elsewhere:
- Self-directed: Getting started guides assume that learners are learning on their own. As such, guides tend to be self-contained in their content and on-demand in their access.
- Informal: Getting started guides assume that learning is happening in an informal context, and often a temporally constrained context (hence needing to be quick guides). As such, the tone is not overly formal, though at times it can be jargony. The introduction of jargon is an important contribution of getting started guides as jargon is a marker of cultural affiliation, and these guides aim to onboard someone into a domain so they can be effective in that domain.
- Pragmatic: Getting started guides emphasize—even when giving an overview of concepts—useful information. The point is not knowing what something is, but knowing how to make use of something. The emphasis on know-how means reflections, debates, and discussions are footnotes, if included at all.
For the sake of reiterating, in the context of this post, when referring to getting started guides henceforth, I mean in particular getting started guides for LLMs.

Anecdote: why getting started guides are important?

The thought that getting started guides are useful artifacts to understand the culture and industry of generative AI came to me while reading my first tutorial. The tutorial, titled “Getting Started with LangChain: A Beginner’s Guide to Building LLM-Powered Applications” (25 Apr 2023), is written by Leonie Monigatti for the publication Towards Data Science, which is published on the platform Medium. (Candidly, the tutorial was way over my head at this initial reading.) In this tutorial, the author includes “a personal note” that stood out as more than just personal and just note. Instead, the paragraph contained what felt like a signpost for where generative AI might be headed. The note appears about a fifth of the way into the article, just past the introduction and first code snippets that connect a newbie to an API. The note reads:

Let’s be honest here for a second: Of course, you can experiment with open-source foundation models here. I tried to make this tutorial only with open-source models hosted on Hugging Face available with a regular account (google/flan-t5-xl and sentence-transformers/all-MiniLM-L6-v2). It works for most examples, but it is also a pain to get some examples to work. Finally, I pulled the trigger and set up a paid account for OpenAI as most examples for LangChain seem to be optimized for OpenAI’s API. Overall running a few experiments for this tutorial cost me about $1. (source)

Simply, the note is practical advice: it shares a path of least resistance given the author’s expertise. As anyone getting started with a new endeavor knows, early bumps can be frustrating and disruptive, and may even derail further learning. In that way, the personal note seems like a kind arm around the learner’s shoulder: “Hey, I’ve been there and here’s what I’d do if I could do it again…”

But this advice is more than just about avoiding missteps. The note advocates against specific alternatives and for specific companies, as well as suggests that moving quickly with the least effort or resistance should trump other considerations. As a newcomer myself and working for Mozilla—a company with a history of open-source software development—I wondered: “Can this be true? Certainly there is a place for open-source here, right?”

What I did not know when I first encountered it, the tutorial itself is a microcosm of getting started guides and seemingly the emerging shape of generative AI. In the subsequent code examples, HuggingFace is abruptly absent, and so alternatives cede the floor to OpenAI. The examples teach how to use Langchain with OpenAI rather than, as the title suggestions, how to use Langchain in general. The result is a tutorial that nudges readers to continue down the path of depending on OpenAI because, well, that is all they learned how to use.

I finished this tutorial with a number of questions about the conditions that surrounded its existence. For me, this tutorial illustrated that getting started guides are opinionated in ways that, regardless of how explicit they are, reflect and contribute to dispositions, tendency, and dependencies around making (with) generative AI. As such, the goal became finding a means to zoom out from one tutorial to many, and share that vantage point.

Methodology

Research question

The impetus for this research was to provide an answer to the following question: what do getting started guides tell us about the emerging shape of the generative AI domain? Although a starting point, this initial question is hard to answer directly. As such, the following research is more narrowly focused on the following question:

As artifacts of the material culture of generative AI, what values, dependencies, and interests are contained in getting started guides about the ecosystem and industry that is emerging around generative AI and specifically large language models?

Thinking of getting started guides as cultural artifacts helps us answer questions about how generative AI is not just a made up of technologies and technical practices, but also made of dispositions, attitudes, expectations, and values that are shared, learned, localized, and morphed over time.

Data gathering

The data gathered for this research was done so as follows:

Queries

To gather resources, I did four different web searches. I used Google search in a private window of the Firefox browser. These web searches were:
- “getting started with LLMs”
- “beginner’s guide to AI”
- “LLM introduction”
- “LLMs intro”
I ran the web searches at the end of July 2023.

Collecting

From each query, I gathered the top 20 links. Because I used Google, these results included promoted links (actually only one promoted link was found), YouTube videos, and links organized by Google’s relevancy ranking.

Analysis

After constructing a database of getting started guides, I visited each link and read what it contained. To begin, I used a half-open coding schema to catalog observations; that is, I started with a few pre-existing codes (e.g. mentions of OpenAI and HuggingFace; existence of code snippets) and added additional emergent codes (e.g. the type and format of the guide) as I read and reviewed documents. Codes were applied in subsequent iterations of reviewing documents, until a final pass was done to ensure consistency in their use.

The coding schema included descriptions of the documents in terms of their type and format, the location and production of the guide, and the guide’s contents (e.g. mentions of OpenAI and HuggingFace, inclusion of code snippets).

The analysis seeks to answer the research question through observable instances and their interpretation. All claims refer to the data (those guides gathered) rather than a generalization of all guides, yet I use these data to provide provisional claims about the emerging AI industry and ecosystem. I do so by assuming that getting started guides are cultural artifacts. As such, each guide is made because of and contributes to wider milieu that can be called the domain and industry of generative AI. I use numeric data—percentages and quantities—not to refer to the overall distribution of getting started guides, but to the distribution of the guides that were found.

In total, this methodology produced sixty (60) unique getting started guides out of a maximum of 80 potential guides. Accounting for this different is that eight (8) of the original set were deemed irrelevant and twelve (12) were found in more than one query (duplicates). These guides are the basis for the subsequent findings.

Limitations

The data gathered in this approach are but one snapshot of what is available. As such, the same queries conducted in the same way would likely produce different results now due to changes in the available resources and their relevancy ranking. Moreover, using different search engines (e.g. DuckDuckGo, Bing, Brave) or specific user preferences would produce different outcomes (though I am not sure in what ways browser or search history impacts results precisely in this domain). Likewise, the presumption that learning begins with a web search characterizes only some learning pathways, and excludes individuals that start using, say, social media, communities, or learning platforms (e.g. Substack, Reddit, Mastodon, Udemy, DataCamp).

(As much as comparative analysis would be useful, I am not certain if I will iterate on this study. I would encourage others to contribute to the body of knowledge around getting started guides by performing the same, derivative, or altogether new methodologies with respect to getting started guides, as well as analyzing these and other aspects of the material culture of AI through other lens and artifacts. If you take on the task of collecting getting started guides—even with a list of links or screenshot of what you find—please reach out in the event I find renewed energy!)

Findings

Describing guides as artifacts: types and formats

Types

According to the data collected, getting started guides for LLMs fall into three types: conceptual overviews (63%, 38 of 60), first-step tutorials (28%, 17 of 60), and learning repositories (6%, 4 of 60). These types of guides are mutually exclusive, though have overlaps in their composite parts.

Conceptual overviews

Conceptual overviews explain the ideas, technology, and processes of LLMs using key terms and jargon, reference points (such as papers, companies, and historical events), and presumed knowledge given the intended audience (such as ideas related to software engineering like abstraction). Together, these features enable subsequent learning by introducing concepts found within other (presumably subsequent) in-domain learning resources. These guides use textual and video explanations; diagrams, illustrations, and animations; and narratives (examples, metaphors) to accomplish this broad overview.

The core of conceptual overviews for LLMs is breaking down processes into steps and technologies into components. The explanations in conceptual overviews are relatively high-level—and authors sometimes acknowledge this fact—and so presume subsequent learning is needed in order to develop and use LLMs. The acquisition of language (both terminology and jargon), concepts, and resources (including companies to know and products to consider) are the primary learning outcomes of conceptual overviews. All told, the main purpose of conceptual overviews is demystifying LLMs, and so contributing practical knowledge with regards to grounding and navigating subsequent learning.

First-step tutorials

First-step tutorials are distinct from contextual overviews with one key feature: these guides provide code snippets or code notebooks to support experiential learning with LLMs. Many of the features of conceptual overviews are part of first-step tutorials. The introduction of terms, jargon, concepts, and processes fill the preambles of first-step tutorials. Introducing companies-to-know and offerings-to-use happens upfront as well as through coding examples, where the offerings and tools are used for various tasks.

After preambles, first-step tutorials guide learners through setting up prerequisites. These prerequisites include, amongst other tasks, setting up API access (such as with OpenAI and HuggingFace to access models) and installing packages from the command line (such as Python packages for storing embeddings, accessing models, and structuring prompts). The reason for this starting point is presumably because the audience has not worked with LLMs in this manner yet. Establishing this baseline for development (e.g. APIs, packages, etc.) is a key contribution of first-step tutorials as it makes subsequent learning-through-coding less cumbersome upfront.

Following the prerequisites, first-step tutorials walk through coding examples that illustrate the ways LLMs operate. Some of these examples include sentence completion, answering trivia, coming up with recipes or company names, summarizing texts, and determining the sentiment of a sentence. Within these tasks are gestures that reinforce using specific offerings (e.g. API keys, function calls to particular models or services) and domain techniques for making LLMs workable. Domain techniques include using hyperparameters (typically just temperature), few-shot text classification, prompt templates, and fine-tuning, all of which are specific generative AI. These coding examples illustrate through experience what an LLM is and does, and provides an initial foundation for subsequent experiential learning.

Learning Repositories

Learning repositories (or “learning repos”) comprise the remaining getting started guides. Given how few were found—only 4 of 60 guides fall into this category—claims beyond high-level descriptions lack substantiation. What seems to be the main feature that distinguishes learning repos from other guides—and the reason to mark distinction with so few instances—is that learning repos provide outlines of materials and relationships of those materials rather than the content of the learning itself. In other words, learning repos index learning. Of the learning repos found, some features include grouped lists of books, links, and other resources; explanations of what one could expect to learn and why (similar to an annotated bibliography); and sequencing and stages of advancement through material.

Encompassed in learning repos are what might be called curricula that structure learning progressively through levels or stages of knowledge and skill acquisition. Also encompassed in learning repos are organized collections of materials (e.g. link lists grouped by topic or type) that lack the progressive learning structure of a curricula, but nonetheless comprise a means to build domain understanding. In these ways and like the other guides, learning repos exist to accelerate domain knowledge, albeit by providing a foreshortened means to find resources for that accelerated learning.

Formats

From the guides found, several distinct formats exist. These types are listed based on the number of observations:

Single-page (37 of 60, 62%): The most observed format is single-page (38 of 60). As the name states, single-page guides are, unsurprisingly, a single webpage that flows from top to bottom. Conceptual overviews and first-step tutorials are found in single-page format.

Video (9 of 60, 15%): The second most observed format is video, namely YouTube videos. Given the means of data collection, the number of videos is a feature how Google prioritizes YouTube within its search results pages. As such, in the subsequent discussion of where getting started guides are found, YouTube videos are excluded as to remove a known bias from the means of data collection. That being say, Pew Research Center has found YouTube is a common site where learning occurs. Even still, to determine whether the video format in general and YouTube in particular are significant to learning about LLMs, data collection would need to be done through a different methodology.

Course (7 of 60, 12%): The third most observed format is the course. The course format breaks down learning into sequential and progressive modules or stages. To deliver learning in this fashion, courses are hosted on courseware sites that stage progressive learning and host content, including text, videos, and interactive activities (e.g. quizzes, clickable examples). Courses may include some form of measured evaluation, and almost always provide certificate of completion. Conceptual overviews and first-step tutorials are found as courses. No instances of learning-repos-as-courses were observed, though learning repos were a component of some courses in the form of reading lists.

Index (4 of 60, 6%): The next format of note is the index. The distinction between type and format is one of content and structure, respectively. In the case of the index and learning repository, type and format are coupled as the learning repo is defined as an index of learning resources.

Ebooks (2 of 60): The last format of note is the electronic book or ebook. The two ebooks are distinct in that they are not native web formats, but instead are portable document files (PDFs). The format itself and the writing in terms of content and style seem to presume a distinctly different audience than other getting started guides. For example, one of the ebooks is produced by the company Databricks and requires sign-up on a landing page to download the ebook and includes a pitch for technical consulting with the company. The other two-part ebook is produced by Nvidia and includes industry case studies and the benefits of LLMs for enterprises. I would venture to say that these ebooks are lead generation tools for companies seeking to court executives and product managers through business intelligence as much as they are conceptual overviews of LLMs.

Location and production: where are they and who’s making them

According to the data collected, getting started guides are found on two main types of website (location) when excluding YouTube (see: Formats>Video), and several minor types of websites.

Learning sites (23 of 51, 45%): Learning sites include courseware sites (e.g. Data Camp), publishing platforms (e.g. Medium), and peer learning communities (e.g. Kaggle). Learning sites are framed as places of learning, whether formal or informal. Medium itself is the most prevalent single site/platform amongst the guides collected.

Company sites (21 of 51, 41%): Company sites include all guides found on domains or subdomains controlled by a company, including blogs and landing pages.

The remaining guides were found on LinkedIn (type: professional social network), GitHub (type: code repository), and websites run by individuals (type: miscellaneous).

For the guides found, eleven (11) were access controlled, ten (10) of which of which were on learning sites. Access was managed with payment (e.g. a Medium membership) or registration (e.g. creating an account or sharing personal information). When accounting for access, company websites and blogs account for the majority of freely accessible guides about LLMs.

When considered from a different perspective, we see production is entangled in the companies working in this domain regardless of where a getting started guide is found. Of the 60 guides, 28 (47%) were written by a company. The majority of these guides were housed on company sites, but others were found on learning sites and YouTube. Moreover, 16 (27%) of all the guides collected promoted a product, service, or resource with which the producer/author was affiliated at the time of publishing. In this, we see that a large number of guides blur the line between education and promotion, and so making learners into potential customers.

The use of getting started guides as marketing and promotion is by no means surprising, but worth mentioning explicitly. Content marketing has long-been established as a means drumming up business for companies, whether in the form of generating leads or promoting offerings through education. In the case of LLMs, getting started guides are opportunities to catch those just beginning to learn about the domain. In that, getting started guides are amongst “top of the funnel” tools for companies in this domain. Moreover, guides are used to funnel potential users, customers, and clients in two ways: through promoting products themselves and through name recognition amongst those learning in this space.

In the next section, name recognition is focused on in a different way: regardless of who produces a getting started guide, what names frequent its contents.

Technology stack: entrenchments and alternatives

An important feature of getting started guides is what technologies and companies are promoted therein. As individuals develop skills and understanding, the particulars of that knowledge are important. When a guide includes an API, for instance, learning may depend exclusively on capabilities, expectations, and gestures unique to that company or offering. As such, the inclusion of specific technologies, frameworks, techniques, and their accumulation—what is typically called the “tech stack”—cannot be separated from the knowledge gained in some—iof not, many—instances. As such, commonsensical choice starts as an initial suggestion and reinforcement.

To focus the inquiry on what is becoming commonsense, I focus on two camps and specific proxies within each.

Closed camp: One camp is closed-source and proprietary LLMs and services. The focus of the closed camp is on mentions of OpenAI and use of OpenAI’s services and models. In this camp are other companies, including Google, Cohere, Nvidia, and Databricks. I only call out OpenAI and Google in the analysis because they capture the other proprietary mentions.

Open camp: The other camp is open-source and openly-accessed LLMs and services. The focus of the open camp is on mentions of HuggingFace and use of HuggingFace’s Interference API and Transformers tools. Other companies and models exist—such as BigScience and TII’s Falcon—which are grouped within the open camp together.

A clear theme is that LLMs are dominated by the closed camp Of the 60 getting started guides, 46 (77%) explicitly mention OpenAI in both name and in code instances. Four (4) additional guides—totaling 50 guides (83%)—mention OpenAI implicitly by way of reference to OpenAI’s ChatGPT. These mentions are only in name alone, that is, do not include code examples. 53 (88%) of all guides mention products, services, and models from the closed camp. The three additional guides are written by Google and exclusively mention models Google created, and none of these include code snippets. As such, OpenAI constitutes the bulk of individual mentions. No guides are produced by OpenAI .

The open camp is less dominant. HuggingFace is mentioned most frequently as an alternative to OpenAI. Across the 60 guides, 15 (25%) mention HuggingFace explicitly. An additional three (3) guides—totaling 18 (30%) guides—mention or link to HuggingFace or other open-source models (e.g. LMSYS Org’s Vicuna, TII’s Falcon, Big Science’s BLOOM). No guides are produced by HuggingFace.

Given the shape of the computational technology industry and ecosystem in general, the dominance of the closed camp is maybe not too surprising. What is more telling is where and how the open camp is found within specific types of guides. Of the 38 conceptual overviews, nine (9) mention HuggingFace and 27 mention OpenAI, with 23 of 38 (60%) of conceptual overviews only mentioning OpenAI or another proprietary model. No conceptual overviews mention HuggingFace or open-source models exclusively. Only one guide mentions an open-source model—and particularly, an open-source local LLM—as a explicit goal rather than simply an alternative to closed systems.

Of the 17 first-step tutorials, only 4 guides mention HuggingFace in coding examples, though HuggingFace is mentioned in the copy elsewhere in 3 additional tutorials. Inversely, 13 of the 17 first-step tutorials (76%) use code examples based solely on OpenAI’s offerings. When HuggingFace is mentioned in code, only one instance uses an open-source model, namely, Big Science’s BLOOM LLM. In the other three (3) instances, HuggingFace is used to interact with a Google model hosted on HuggingFace and accessed through HuggingFace’s Inference API.

In total, we see the closed camp overwhelm the open camp in terms of mentions. In no instances is HuggingFace mentioned without mention of OpenAI. Conversely, 28 of the 60 guides (47%) mention OpenAI without a mention of HuggingFace or an alternative, and 35 of 60 (58%) mention a proprietary model without a mention of an alternative from the open camp.

Discussion

Conceptual alternatives

These findings show that while open-source models and openly-accessible services exist, closed-source model and proprietary services are dominating the generative AI industry. As cultural artifacts, getting started guides illustrate that open-source models and openly-accessible services are framed within the industry and domain as conceptual alternatives to closed-source models and proprietary services. By conceptual alternative I mean that the open technologies are mentioned to make the point that closed technologies are not the only options, but that tangible examples of how to make open and closed comparable are not yet realized in the building of systems, products, services, and even coding examples. HuggingFace as a proxy for the open camp is name-checked, and then effectively disregarded by many first-step tutorials. Even in conceptual overviews, where HuggingFace is mentioned more often, HuggingFace is literally parenthetically mentioned in many instances and rendered marginal when compared to the volume and centering of OpenAI. ChatGPT, for example, has become a stand-in for generative AI in general, and OpenAI reaps the benefits of that.

This division between open and closed camps is similar to the technology industry in general, where open-source alternatives exist but not at the same scale as proprietary offerings. As such, open-source models and systems seem to lack equivalently substantive onboarding in the form of first-step tutorials that are useful for developers, and similar coding examples at higher levels of learning. The reason for this lack of coding examples is, as the personal note calls out, the performance of these alternative open offerings is not comparable or as easy. The question, then, is how did the means of comparison get established in the first place. Though I cannot say in general, in the case of getting started guides, the expectations that certain types of performance and use are desirable are established through the cultural narratives of technology production.

Narrowing in on first-step tutorials provides a glimpse of what is at work. The prerequisites and coding examples render subsequent learning less cumbersome by establishing a technical baseline for learning more. The examples teach the expectations of use and development through tasks and techniques, and so first-step tutorials also nudge learners forward on a path with particular APIs and tools, and so company. In that, first-step tutorials reflect and produce wider dispositions because their producers learned and then teach—and, in the case of companies, depend on—those dispositions as commonsense and the companies as defaults. Given that the technology industry tends towards a critical technical practice—namely, alternatives are not valued as simply possible through abstract concepts but valued as workable in concepts materialized concretely. As such, the taught-because-learned dispositions toward specific comparisons are dependent on the actual de mode techniques, technology, and products. As conceptual alternatives, open-source models and openly-accessed systems have ground to make up to be substantive alternatives in the onboarding itself as well as that which supports others building with that open tech stack.

What’s missing?

Until now the focus has been on what has been mentioned in getting started guides. But what seems absent?

A major absence from getting started guides is mention of the unpredictability of LLMs, as well as their bias, the data the models were trained on, the impact of that training, and a host of other considerations. Although the social and ethical considerations are paramount, I realize that expecting a getting started guide to delve in these issues is misplaced. Without the requisite understanding, these issues are hard to understand, let alone do anything about besides avoiding use of LLMs altogether. The unpredictability of LLMs does, however, seem like an appropriate thing to mention in getting started guides as it is very apparent and has been reported on wildly.

In early code examples, LLMs do not work as one might expect. Answers to seemingly simple questions—e.g. what is the capital of France?—may not be answered correctly, leading one to think the system is not working or the code written is faulty. In fact, as I have learned, if any word is produced whatsoever by an LLM that should be considered workin because LLMs are not answering a question but providing a statistical prediction of what word should follow from the preceding words. If it is factually wrong that is besides the point from the perspective of an LLM. As an expectation of these systems, the lack of mention of their inherent expected unpredictability is absent.

Mentions of the correctness of answers are few across the getting started guides collected. Of the 60 guides, I found only three (3) instances that call out that LLMs will get answers “wrong.” In other instances, getting started guides use language to frame unpredictability without explicit mention, e.g. LLMs “generate new text that is coherent and grammatically correct” (source). Implicit is that coherent and grammatically correct does not mean factually correct.

The lack of mention of these wrinkles is emblematic of the implicit goal of getting started guides. Besides teaching people about LLMs, getting started guides aim to convince newcomers that LLMs are exciting and worthy of their investment. In other words, getting started guides are material artifacts of domain boosterism. Now some producers of guides are clear about the drawbacks, yet in many cases the complexities are omitted, vaguely mentioned, or dismissed as workable. For example, bias is a major issue of LLMs, yet bias is mentioned in only five (5) guides. Most of these mentions simply say bias exists in these systems, which is efficient dismissal of bias as an issue one might consider before learning more.

In sum, getting started guides lack important information so individuals can approach learning about LLMs critically. As unreasonable as it may be to expect a deep discussion of unpredictability and bias in a getting started guide, an appropriate level of detail about the drawbacks of these systems is needed.

Conclusion

Getting started guides for LLMs are artifacts of AI’s material culture. As has been described above, they are products of and contribute to the cultural milieu of generative AI by producing and reproducing features of an ecosystem within which they exists. As such, getting started guides illustrate what those working in and around AI may see and feel elsewhere.

In particular, this post describes that getting started guides document the growing entrenchment and consolidation of the LLM technology “stack” around proprietary and closed-source models, services, and systems. Open-source alternatives, while they exist, are more conceptual alternatives rather than viable substantive alternatives—that is, alternatives ready to be used in products. With the role of staging learning, getting started guides blur the boundary between content marketing and educational resources, seemingly warming leads through name recognition and practical adoption as much as priming subsequent knowledge acquisition. These guides set expectations about what LLMs can do (and who and what can provide that doing), while kicking the can aspects that complicate their adoption and mire expectations.
→ 7:57 PM, Sep 19

TFW you think you know what you are doing after one tutorial

→ 8:20 PM, Sep 11

Maybe just don’t take oil money AND use the term responsible.

→ 2:02 PM, Sep 11

TWIL: Been reading about LLM context windows & came across this article: “We find that language models often struggle to use information in the middle of long input contexts, and that performance decreases as the input context grows longer.” Research couldn’t explain why though.

→ 1:18 PM, Sep 8

Current personal reading: “On the Aesthetic Education of Man” by Friedrich Schiller

Noteworthy thus far: Schiller argues that play is unifies the rational/durable and material/transcient, and thus is the path to conceiving Beauty.

→ 8:00 AM, Sep 8

After reading through a bunch of LLM getting started tutorials and resources, here are a few observations:
- Simple code is fragile because the ecosystem is changing so fast (e.g. basic calls to a model on HuggingFace don’t work because of access changes)
- LLMs are unpredictable in unpredictable ways. LLMs are, by design, non-deterministic, but there are a host of other factors that seem to make LLMs behave in unexpected ways (e.g. other processes running while using a local install).
- Getting started resources tend to be confusing because the people writing them seem confused too. Am I an expert? No. But I can tell that others are struggling to make sense of what these things are doing.
- Implication and ethics in what people choose to use are altogether left out, save a few instances.
- Some is just substanceless. This “course” from Google feels like it was made by AI to game SEO (which Google runs so…)
→ 11:57 AM, Sep 1

Y’all see this clip of Sam Altman inventing the Orb?

→ 11:52 AM, Sep 1

TWIL (this week I learned) I recently got llama.cpp running on my desktop (tutorial forthcoming!) only to find out that the GGML format is dead and has been replaced with the GGUF format. I have yet to move over to that format, but I learned it exists!

→ 11:45 AM, Sep 1