Notes on multimodal models, AI for infrastructure, fine tuning, and soft prompting

I have been taking a lot of notes on readings and other sources, and am planning on posting them more frequently. Here are some notes from the backlog.

From Matt Rickard:

Quick explanation of CLIP

CLIP is text-image neural net that combines a word-tokens and pixel (pixel-tokens?) in a single vector space. Interesting because it can do zero-shot learning on text descriptions of images. CLIP underlies stable diffusion and a 2021 OpenAI model.

Why is this interesting? Multimodal models are already here. This means that text-to-text may only be a type of input and output rather than the constraint of a model itself. OpenAI is certainly pushing in this direction with GPT-4 Turbo, as is Meta with ImageBind and others such as Llava. This makes me wonder how we might be able to take existing models and combine them rather than simply retraining new models. Not being an engineer, I could be suggesting something that is very harebrained.

AI tech stack half-life and AI writing infrastructure

These suggest that AI will do interop by dealing with fuzzy mappings of different infrastructures that rule-based deployments can’t handle due to the wide variety of possibilities. As a non-dev, I am intrigued by this because the constraint on UX and features due to infrastructure (or just tech debt) may be more easily overcome. I also am skeptical of anything claims that things of this sort will be easier because DevOps can be really complicated.

From Sebastian Raschka:

Two posts on fine-tuning (overview and PEFT) (in combination with this light post by IBM on prompt tuning). Lots in here, but my takeaways focus on prompt tuning (TIL) and fine-tuning (still learning):

  • Soft prompt tuning is a strategy for optimizing when you can’t access the model weights (e.g. when fine tuning is not possible). Not sure how this works in practice (may try to run through a tutorial sometime and write about it). Might be something here to help folks do this with less work (again I haven’t done it so maybe unrealistic)
  • Fine-tuning does increase performance of models, but the increases in performance don’t scale with investment in time/energy. A lot can be done with methods of fine tuning that only change the weights on the final layers of a model or add additional layers for classification. In terms of the decision to fine tune: need to decide what threshold is acceptable given the context (I have lots of questions about what performance means in these contexts tbh)

According to Raschka, one version of fine-tuning takes a pre-trained model and adjusts the weights by using a new dataset and refining (fine-tuning) the model for a new task. This type of fine tuning requires a lot of data and a lot of compute. Other approaches are far less intensive.

Raschka describes fine-tuning some of the layers of a model but not all of them. These include methods like fine-tuning only a few final layers of the model (what Raschka refers to as fine-tuning I) or parameter-efficient fine-tuning (PEFT, such as soft prompting, prefix tuning, and LORA) that use other techniques to modify the model without adjusting the weights (keeping the pretrained model frozen in full or for the most part). Some of these approaches are:

  • Soft prompting attaches embeddings to the model to make it more efficient at a prompt (albeit not being human readable). (read about it here)
  • Prefix tuning attaches embeddings and a “trainable tensor” to improve performance over soft-prompting. (read about it here)
  • LORA adds a new set of weights on top of a pretrained model (read about it here)
Thomas Lodato @deptofthomas