Glossary — ctx

Agent

A system that uses an LLM to plan, take actions, observe results, and iterate toward a goal. Agents combine a control loop with tool use, memory, and safety checks to complete multi-step tasks autonomously or with human oversight. See also: L07 Orchestration.

AI Fluency

The ability to work with AI systems effectively, efficiently, ethically, and safely. Goes beyond knowing how to prompt — includes understanding model capabilities, limitations, appropriate use cases, and responsible deployment.

Alignment

The practice of making AI systems behave according to human intentions and values. Includes techniques like RLHF and constitutional AI applied during post-training.

AI red teaming

Adversarial testing of AI systems using autonomous AI agents that reason iteratively about novel attack surfaces — RAG poisoning, prompt extraction, vector store exfiltration — rather than matching known vulnerability signatures. Value chain layer: L10 Eval and Safety.

API gateway

A proxy layer that sits between applications and AI providers. Handles routing, rate limiting, caching, and fallback logic. Value chain layer: L05 Routing.

Attention

The mechanism in transformer models that lets each token consider every other token in the input. Self-attention is what gives LLMs their ability to understand context across long sequences.

Autoregressive model

A model that generates output one token at a time, where each new token depends on all previously generated tokens. GPT, Claude, and Llama are all autoregressive.

Automation / Augmentation / Agency

Three modes of human-AI interaction. Automation: AI executes instructions with minimal oversight. Augmentation: human and AI collaborate, with the human making final decisions. Agency: AI works independently within configured boundaries, escalating when uncertain.

Batch processing

Running multiple inference requests together to maximize GPU utilization. Continuous batching dynamically adds new requests to an active batch without waiting for all current requests to finish.

BitNet

A class of 1-bit language models where weights are quantized to {-1, 0, 1} during training rather than post-hoc. Uses ~78% less memory than 16-bit models, enabling billion-parameter inference and LoRA fine-tuning on consumer GPUs and smartphones. Developed by Microsoft Research, with cross-platform implementations by Tether's QVAC framework. Value chain layers: L02 Pre-Training, L04 Inference.

Benchmarking

Standardized evaluation of model performance on specific tasks. Common benchmarks include MMLU, HumanEval, MATH, and arena-style human preference rankings. In a post-training workflow, benchmarking confirms that a fine-tuned model improves on the target task without regressing on general capability. Value chain layer: L03 Post-Training, L10 Eval and Safety.

Bias (AI)

Systematic patterns in AI outputs that unfairly favor or disadvantage certain groups, often reflecting biases present in training data, labeling decisions, or evaluation criteria.

BERT

Bidirectional Encoder Representations from Transformers. An encoder-only model architecture used primarily for embeddings, classification, and search — not text generation.

Chain-of-thought

A prompting technique where the model is asked to show its reasoning step by step before giving a final answer. Improves accuracy on complex reasoning tasks.

Chunking

Splitting documents into smaller pieces for embedding and retrieval. Chunk size, overlap, and boundary strategy directly affect RAG quality. Value chain layer: L08 Context.

Citation grounding

Attaching specific source passages to each claim in an AI-generated response so users can verify accuracy. Turns opaque summaries into auditable, traceable answers. NotebookLM's core mechanism. See also: Grounding, RAG.

Claude

Anthropic's family of AI models (Opus, Sonnet, Haiku). Known for long context windows, strong instruction following, and tool use capabilities.

Completion

The text generated by a model in response to a prompt. Also called a response or generation.

Context window

The maximum number of tokens a model can process in a single request, including both input and output. Ranges from 4K tokens to 2M+ tokens depending on the model.

Context engineering

The discipline of designing what information enters the context window, how it is structured, and when it is refreshed. Value chain layer: L08 Context.

Context injection

Retrieving external knowledge — documentation, search results, database records, or memory — and inserting it into a model's prompt at query time. Reduces hallucination by grounding responses in current, authoritative sources rather than relying solely on training data. RAG is the most common form. Value chain layer: L08 Context.

Copilot

An AI assistant embedded in an existing workflow — typically an IDE, editor, or productivity tool. The AI suggests, the human accepts or rejects. Value chain layer: L11 Products.

CUDA

NVIDIA's parallel computing platform. The dominant software ecosystem for GPU-accelerated AI workloads. Most training and inference frameworks depend on CUDA.

Diffusion model

A generative model that creates images, video, or audio by starting from noise and iteratively removing it. Stable Diffusion, DALL-E 3, and Midjourney use diffusion-based architectures.

Distillation

Training a smaller model to replicate the behavior of a larger model. Produces cheaper, faster models that retain most of the teacher model's capability. Value chain layer: L03 Post-Training.

DPO (Direct Preference Optimization)

A post-training technique that aligns models with human preferences without needing a separate reward model. Simpler alternative to RLHF.

Embedding

A numerical vector representation of text, images, or other data. Embeddings capture semantic meaning — similar concepts have similar vectors. Used for search, RAG, and classification.

Eval (evaluation)

Systematic testing of AI system outputs against defined criteria. Includes golden test sets, scoring rubrics, and regression testing. Value chain layer: L10 Eval and Safety.

Edge inference

Running models on local devices (phones, laptops, edge servers) rather than cloud APIs. Offers lower latency and data privacy at the cost of model size constraints.

Encoder-decoder

A model architecture with two parts — an encoder that processes input and a decoder that generates output. T5 and BART are encoder-decoder models. Most modern LLMs are decoder-only.

Few-shot learning

Providing a model with a small number of examples in the prompt to demonstrate the desired task format and behavior. No model weight changes — it is purely in-context.

Fine-tuning

Continuing the training of a pre-trained model on a specific dataset to adapt it for a particular task or domain. Value chain layer: L03 Post-Training.

Formal verification

Using mathematical proof to guarantee that code or a system satisfies its specification. Unlike testing, which samples behavior, a formal proof covers all possible inputs. Theorem provers like Lean 4 and verification-aware languages like Dafny check proofs mechanically. LLMs are changing the economics by drafting proofs that proof checkers accept or reject. Value chain layer: L10 Eval and Safety.

Foundation model

A large model trained on broad data that can be adapted to many downstream tasks. GPT-4, Claude, Llama, and Gemini are foundation models.

Function calling

A model capability where the model outputs structured JSON to invoke external functions or tools, rather than generating free-form text. The protocol side of tool use — how models express intent to act. See also: Tool use.

GPU

Graphics Processing Unit. The primary hardware for AI training and inference due to its massive parallelism. NVIDIA dominates with the H100 and B200 series. Value chain layer: L01 Compute.

Guardrails

Runtime safety checks that validate model outputs before they reach users. Can filter harmful content, enforce format constraints, or apply business logic. Value chain layer: L10 Eval and Safety.

GQA (Grouped-Query Attention)

An attention variant that shares key-value heads across groups of query heads, reducing KV cache size and accelerating inference with minimal quality loss. Used in Llama 3, Mistral, and most modern open-weight models. Sits between full multi-head attention (MHA) and multi-query attention (MQA) on the efficiency–quality spectrum. Value chain layer: L02 Pre-Training.

Grounding

Connecting model outputs to verifiable sources of truth. A grounded response cites specific documents, data, or evidence rather than relying solely on the model's training.

GGUF

A file format for quantized models used by llama.cpp. Stores model weights in compressed formats (Q4, Q5, Q8) that run on consumer hardware.

Hallucination

When a model generates confident but factually incorrect information. A fundamental challenge with autoregressive models that produce statistically plausible text regardless of truth.

Harness

Everything that is NOT the model — the runtime infrastructure that makes a single agent effective. Includes execution environments (sandboxes), state and continuity (crash recovery, session bridging), tool infrastructure (MCP, code execution), verification loops (write-test-fix), context management (compaction, progressive disclosure), and constraints (linters, structural tests). Agent = Model + Harness. The model is a commodity; the harness is the durable asset. Value chain layer: L06 Harness.

Harness engineering

The discipline of designing, building, and maintaining agent harnesses — the runtime infrastructure that turns a model API call into a functioning agent. Distinct from prompt engineering (single interaction), context engineering (context window), and agent engineering (internal agent design). Catalyzed by OpenAI's Codex team building a 1M-line codebase with 3 engineers using only agents, proving the bottleneck is infrastructure, not model intelligence. Key insight: better models amplify the need for better harnesses. Value chain layer: L06 Harness.

HITL (human-in-the-loop)

A workflow design where human judgment is required at specific decision points. Used for high-stakes actions, quality review, and safety-critical outputs.

Human oversight spectrum

The range of human involvement in AI workflows. Human-in-the-loop: human approves every significant action. Human-on-the-loop: agent executes, human monitors and can intervene. Bounded autonomy: agent acts within pre-defined boundaries, escalates edge cases. The right position depends on reversibility, stakes, and confidence. Value chain layer: L07 Orchestration.

Hyperparameter

A configuration value set before training that affects how the model learns — learning rate, batch size, number of training steps, temperature. Not learned from data.

Inference

The process of running a trained model to generate outputs. The primary cost driver for AI applications in production. Value chain layer: L04 Inference.

In-context learning

A model's ability to perform new tasks based on examples and instructions provided in the prompt, without any weight updates. The basis for few-shot and zero-shot prompting.

Instruction tuning

A form of fine-tuning where a model is trained on instruction-response pairs to follow human directions. What makes a base model into a chat model.

JSON mode

A model output setting that constrains generation to valid JSON. Supported by OpenAI, Anthropic, and most inference providers. Essential for structured tool use and API integration.

Knowledge cutoff

The date after which a model has no built-in knowledge, determined by the recency of its training data. Information after this date requires retrieval augmentation or tool use.

Knowledge graph

A structured representation of entities and their relationships. Used to provide factual context to models and reduce hallucination.

KV cache

Key-Value cache. During inference, the model caches computed attention keys and values so they don't need to be recomputed for each new token. Critical for fast autoregressive generation.

LoRA (Low-Rank Adaptation)

A parameter-efficient fine-tuning technique that trains small adapter matrices instead of modifying all model weights. Drastically reduces memory and compute requirements for fine-tuning. Value chain layer: L03 Post-Training.

LLM (Large Language Model)

A neural network with billions of parameters trained on text data to generate and understand language. The core technology behind modern AI assistants and tools.

Latency

The time between sending a request and receiving the first token of the response (time to first token, TTFT) or the complete response. A key metric for user-facing AI applications.

Local-first

A design philosophy that keeps data and computation on local infrastructure by default, using cloud services as an option rather than a dependency.

MCP (Model Context Protocol)

An open standard for connecting AI models to external tools and data sources. Defines how models discover, authenticate with, and call tools through a standardized interface. Value chain layer: L09 Integrations.

Mixture of experts (MoE)

A model architecture where only a subset of parameters is active for each input. Routes tokens to specialized "expert" sub-networks, allowing larger total model size without proportional compute cost. Mixtral and DeepSeek V3 use MoE.

MLA (Multi-Head Latent Attention)

An attention mechanism that compresses key-value representations into a low-rank latent space, dramatically reducing KV cache memory during inference. Introduced by DeepSeek-V2 and used in DeepSeek-V3. Achieves similar quality to standard multi-head attention at a fraction of the memory cost. Value chain layer: L02 Pre-Training.

Multi-token prediction

A training objective where the model predicts several future tokens simultaneously rather than just the next one. Improves training efficiency and can accelerate inference via speculative decoding. Used in Meta's Llama training and other modern architectures. Value chain layer: L02 Pre-Training.

Multimodal

A model or system that processes multiple input types — text, images, audio, video. GPT-4V, Gemini, and Claude 3+ are multimodal models.

Model routing

The practice of directing inference requests to different models based on task type, cost, quality requirements, or provider availability. Value chain layer: L05 Routing.

Neural network

A computational model inspired by biological neurons. Composed of layers of connected nodes that learn patterns from data through training.

NVIDIA NIM

NVIDIA's inference microservice platform for deploying optimized AI models. Packages models as containerized services with GPU acceleration.

Observability

Monitoring and tracing AI system behavior in production — request logs, latency metrics, cost tracking, error rates, and quality scores. Value chain layer: L10 Eval and Safety.

Ollama

An open-source tool for running LLMs locally. Provides a simple CLI and API for downloading and serving models on consumer hardware. Value chain layer: L04 Inference.

Open-source vs open-weight

Open-source AI means the full training code, data, and weights are available. Open-weight means only the trained model weights are released — you can run the model but cannot reproduce the training.

Orchestration

Coordinating multiple agents and multi-step workflows — delegation, handoffs, human oversight gates, and structured pipelines. Distinct from harness engineering (L06), which makes a single agent effective. Orchestration assumes capable agents exist and focuses on how they work together. Value chain layer: L07 Orchestration. See also: Harness.

Overfitting

When a model memorizes training data patterns too specifically and performs poorly on new inputs. A common risk during fine-tuning, mitigated by regularization and validation.

Parameter

A single learnable value in a neural network (a weight or bias). Model size is measured in parameters — GPT-4 has an estimated 1.76 trillion parameters.

Pre-training

The initial phase of training where a model learns from massive datasets. Produces a base model with broad knowledge but no instruction-following ability. Value chain layer: L02 Pre-Training.

Post-training

All training done after pre-training — SFT, RLHF, DPO, distillation. Shapes the base model into an instruction-following, safe, useful assistant. Value chain layer: L03 Post-Training.

Prompt caching

Reusing computed KV states for repeated prefixes so the model skips redundant computation on the shared portion of a prompt. Supported natively by Anthropic, OpenAI, and inference gateways like Cloudflare AI Gateway and Portkey. Value chain layer: L04 Inference.

Prompt injection

An attack that manipulates model behavior by inserting adversarial instructions through user input, retrieved context, or tool outputs. Direct injection targets the prompt itself; indirect injection hides instructions in data the model retrieves. A primary security concern for any AI system that processes untrusted input. See also: AI red teaming.

Prompt engineering

The practice of crafting inputs to get desired outputs from a model. Includes system prompts, few-shot examples, chain-of-thought instructions, and structured output formatting.

Product layer

The top of the value chain where AI capabilities become user-facing applications — chatbots, copilots, agents, search, creative tools. Value chain layer: L11 Products.

Pruning

Removing unnecessary parameters from a trained model to reduce size and increase inference speed while preserving most of the model's capability.

Quantization

Reducing the precision of model weights (e.g., from 16-bit to 4-bit) to decrease memory usage and increase inference speed. Enables large models to run on consumer hardware with minimal quality loss.

QLoRA

Quantized LoRA — fine-tuning a quantized model using LoRA adapters. Enables fine-tuning of large models on a single consumer GPU. Value chain layer: L03 Post-Training.

Query (in attention)

In the transformer attention mechanism, the query is the representation of the current token looking for relevant information across all other tokens (keys) in the sequence.

RoPE (Rotary Position Embedding)

A positional encoding method that encodes absolute position using rotation matrices, enabling the model to learn relative position relationships. The dominant positional encoding in modern LLMs (Llama, Mistral, Qwen). Extended by YaRN to support longer context windows without retraining. Value chain layer: L02 Pre-Training.

Reasoning model

AI models designed to think step-by-step through complex problems, showing improved logical, mathematical, and scientific reasoning. Examples include OpenAI o1/o3 and DeepSeek-R1.

RAG (Retrieval-Augmented Generation)

A technique that retrieves relevant documents from an external knowledge base and includes them in the model's context before generation. Reduces hallucination and provides up-to-date information. Value chain layer: L08 Context.

RAG poisoning

An attack that inserts adversarial content into a knowledge base so the model retrieves and trusts it during generation. Effective because RAG pipelines treat retrieved documents as authoritative context. Defenses include input validation, provenance tracking, and retrieval-time anomaly detection. See also: RAG, Prompt injection.

RLHF (Reinforcement Learning from Human Feedback)

A post-training technique where human preference judgments train a reward model, which then guides the LLM toward preferred outputs. Value chain layer: L03 Post-Training.

Routing

See Model routing.

Re-ranking

A second-stage retrieval step that scores and reorders search results by relevance before they enter the model's context. Improves RAG quality by filtering out low-relevance chunks.

Semantic caching

Caching model responses by meaning similarity rather than exact string match. When a new prompt is semantically close to a previously cached one, the cached response is returned without running inference. Reduces cost and latency for repeated or near-duplicate queries. Value chain layer: L04 Inference.

Serving

Running models in production to handle user requests. Includes the infrastructure for loading models, managing GPU memory, batching requests, and scaling with demand. Value chain layer: L04 Inference.

System prompt

Instructions provided to a model at the start of a conversation that define its behavior, constraints, and persona. Persists across all user messages in the session.

Structured output

Model outputs constrained to a specific format — JSON, XML, typed schemas. Enables reliable parsing and integration with downstream systems.

Safety layer

The combination of guardrails, content filters, and validation checks that prevent harmful or incorrect model outputs from reaching users. Value chain layer: L10 Eval and Safety.

Sliding-window attention

An attention pattern that restricts each token to attending only within a fixed window of nearby tokens in lower layers, while upper layers retain full attention. Reduces compute and memory for long sequences. Used in Mistral and other efficiency-focused architectures. Value chain layer: L02 Pre-Training.

Scaling law

Empirical relationships between model size, training data volume, compute budget, and model performance. Larger models trained on more data generally perform better, with predictable returns.

Token optimization

Umbrella term for techniques that reduce token usage without sacrificing output quality. Includes prompt caching, semantic caching, prompt compression, structured output constraints, and KV cache management. Value chain layer: L04 Inference.

Token

The basic unit of text that models process. A token is roughly 3-4 characters in English. Models have limits on total tokens (context window) and are priced per token.

Tokenizer

The algorithm that converts text into tokens and back. Different model families use different tokenizers (BPE, SentencePiece). The tokenizer determines what the model "sees."

Temperature

A parameter that controls randomness in model output. Low temperature (0-0.3) produces deterministic, focused responses. High temperature (0.7-1.0) produces more creative, varied output.

Tool use

The ability of an AI model to call external tools — APIs, databases, code interpreters, file systems, web browsers — to take actions beyond text generation. Function calling is the protocol; tool use is the broader capability. Effective tool use requires the model to decide which tool to call, construct valid arguments, interpret results, and decide whether to call another tool or return a final answer. The protocol layer lives in L09 Integrations (MCP, function calling specs); the tool infrastructure and execution layer lives in L06 Harness (tool registries, sandboxed execution); the coordination layer lives in L07 Orchestration (agent loops, tool selection, error handling). Value chain layers: L06 Harness, L07 Orchestration, L09 Integrations.

Transformer

The neural network architecture behind all modern LLMs. Uses self-attention mechanisms to process sequences in parallel. Introduced in the 2017 paper "Attention Is All You Need."

TPU

Tensor Processing Unit. Google's custom AI accelerator chip, designed for both training and inference. Used internally at Google and available via Google Cloud. Value chain layer: L01 Compute.

Vector database

A database optimized for storing and searching high-dimensional vectors (embeddings). Used in RAG pipelines for semantic search. Pinecone, Weaviate, Chroma, and Qdrant are vector databases. Value chain layer: L08 Context.

vLLM

An open-source high-throughput LLM serving engine that uses PagedAttention for efficient memory management. A leading choice for self-hosted inference. Value chain layer: L04 Inference.

Validation gate

A checkpoint in a workflow where outputs must pass automated checks before proceeding. Format validation, safety filtering, and quality scoring are common validation gates.

Weight

A single numerical value in a neural network that is learned during training. The collection of all weights defines what a model knows and how it behaves.

Workflow

A structured sequence of steps — planning, execution, review, approval — that coordinates human and AI actions toward a goal. Value chain layer: L07 Orchestration.

Window (context)

See Context window.

Zero-shot learning

Asking a model to perform a task without any examples — relying entirely on the model's pre-trained knowledge and the instruction itself.

AI terms, defined.

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

V

W

Z