Academy

Open curriculum for teams that scale.

Open program for teams building with AI. Practical, portable, and built to be reused.

AI in production is not a research project — it is a systems problem. This module builds the baseline: what models exist, what they are good at, where they fail, and how to think about AI as infrastructure rather than magic. Teams leave with a shared vocabulary and a clear picture of the landscape they are operating in.

  • Model categories: completion models, chat models, embedding models, image models, audio models, and multimodal models. What each category does, what it costs, and where it fits in a production stack.
  • Context window mechanics: how tokens work, what fits in a context window, how truncation happens, and why token budgeting is a core engineering skill — not an afterthought.
  • Local-first vs cloud-first: when to keep data and computation local, when to use cloud endpoints, and how to make that decision based on sensitivity, latency, cost, and reliability requirements.
  • Tooling survey: IDEs (VS Code, Cursor, Windsurf), terminals (iTerm2, Warp), orchestration frameworks (LangChain, LlamaIndex, custom), and agent runners (Claude Code, Codex, OpenHands). What each tool does and when to use it.
  • Production vs prototype: the gap between a demo that works and a system that works reliably at scale. Error rates, latency budgets, cost projections, and failure mode analysis.

Deliverables: AI foundations decision document, glossary of terms the team will use consistently, and baseline prompt templates for common tasks.

Most teams treat prompts as informal text. This module moves prompts from ad-hoc instructions to versioned contracts with predictable behavior. The goal is to reduce variance in outputs, standardize formats, and build prompt libraries that teams can share, test, and improve over time.

  • Prompt design patterns: system prompts, user prompts, few-shot examples, chain-of-thought reasoning, and structured output instructions. When to use each pattern and how to combine them for complex tasks.
  • Template versioning: prompts are code. They get version-controlled, tested against golden sets, and reviewed before deployment. A prompt change is a behavior change — treat it with the same rigor as a code change.
  • Structured output formats: JSON schemas, XML structures, typed responses, and validation rules that ensure model outputs are machine-parseable and contract-compliant.
  • Context hygiene: what to include in context, what to omit, when to reset. Stale context produces stale outputs. Redundant context wastes tokens.
  • 4D fluency framework: Delegation (what to ask), Description (how to ask), Discernment (how to evaluate), and Diligence (how to improve).
  • Acceptance checks: every prompt has pass/fail criteria. Measure, iterate, and set minimum quality bars.

Deliverables: context contracts document, test prompt suite with acceptance checks, and prompt template library.

Breaking work into reliable agent tasks is the core skill of AI-native teams. This module teaches task decomposition that works in production — not toy examples from blog posts. The focus is on boundaries: where planning ends and execution begins, where machine autonomy stops and human judgment takes over.

  • Task decomposition: take a complex workflow and break it into atomic agent tasks. Each task has one owner, clear inputs, defined outputs, and acceptance criteria.
  • Planning, execution, and review boundaries: separate roles with separate context — never combined.
  • Delegation rules: what can be delegated to an agent, what stays with a human. Risk level, reversibility, and cost of error are the deciding factors.
  • Multi-agent coordination: patterns for running agents in parallel, sequencing agent chains, and handling conflicts.
  • Validation gates: automated checks between workflow stages. Format validation, content verification, safety filters, and business logic checks.
  • Escalation contracts: when an agent cannot complete a task, it escalates with full context — what it tried, why it failed, and what it recommends.

Deliverables: workflow map for one high-value use case, handoff checklist, and validation gate definitions.

When control and sensitivity matter, architecture choices get real. This module covers the full spectrum from on-device models to hybrid cloud-local deployments. The goal is to make deployment decisions based on data classification, latency needs, and compliance requirements rather than defaults.

  • Security boundaries: classify data into tiers (public, internal, confidential, regulated) and map each tier to infrastructure that matches its policy.
  • Deployment options: on-device, local server, private-cloud, and hybrid. Each option has trade-offs in cost, latency, capability, and operational complexity.
  • Private model integration: running local models via llama.cpp, vLLM, or Ollama for tasks that cannot leave controlled infrastructure.
  • Fallback patterns: when a local model cannot handle a task, how to route it to a cloud model with appropriate redaction.
  • Redaction pipelines: automated removal of sensitive information before context crosses trust boundaries.
  • Compliance boundaries: data residency, retention policies, audit trails, and access controls.

Deliverables: local architecture decision document, deployment option matrix, and security posture decision table.

Context is what the system knows. When context is wrong, every output is wrong — confidently. This module builds the discipline of treating context as a product with its own lifecycle, versioning, and quality standards.

  • Context capsules: scoped packages of state — per domain, per role, per session. Each capsule defines what it contains, what it excludes, how long it lives.
  • Volatile vs durable state: conversation history is volatile. Decisions and architecture docs are durable. Mixing these is the most common context bug.
  • Versioning and rollback: every context mutation is tracked. When outputs degrade, trace back and recover.
  • Cross-project continuity: lessons and frameworks transfer between projects. Project-specific noise does not.
  • State lifecycle: creation, mutation, archival, and deletion. Every piece of context has a defined lifecycle.
  • Context auditing: regular review of what is in the context window. What contributes to output quality? What is noise?

Deliverables: context index, decision history page structure, and context capsule templates.

Retrieval-augmented generation turns internal documents, code, emails, and tickets into queryable knowledge. The challenge is keeping retrieval quality high, auditable, and deterministic in production.

  • Embedding pipelines: converting documents and code into vector representations. Chunking strategies and embedding model selection.
  • Vector store selection: Pinecone, Weaviate, Chroma, pgvector, and local options. Performance, hosting, and cost trade-offs.
  • Retrieval strategies: keyword search, semantic search, hybrid search, and re-ranking. When to use each approach.
  • Quality gates: every retrieved context gets a relevance score. Low-relevance results get filtered before entering the model's context.
  • Knowledge base maintenance: update cycles that keep the index fresh without full rebuilds.
  • Audit trails: every retrieval query is logged with results, scores, and usage. The foundation for debugging and improvement.

Deliverables: retrieval pipeline specification, query quality checks, and embedding strategy document.

Every AI system fails. The question is whether failures are designed for or discovered in production. This module builds safety infrastructure: failure taxonomies, validation chains, escalation paths, and human-approval gates.

  • Failure taxonomy: silent failures, hallucinations, drift, and cascading failures. Each type requires different detection and response.
  • Validation chains: multi-step verification — format, accuracy, safety, business logic. Validation is layered, not single-point.
  • Human-in-the-loop escalation: defined triggers for when a task must be reviewed by a human before proceeding.
  • Incident response: detect, contain, communicate, fix, and prevent recurrence. Same discipline as software incident response.
  • Rollback policies: every automated action is reversible until explicitly committed.
  • Safety boundaries: content that should never be generated, actions that should never be automated. Hard limits, not suggestions.

Deliverables: safety matrix, rollback policy document, and incident response playbook.

If you cannot measure it, you cannot improve it. This module builds evaluation and observability infrastructure that turns AI operations from guesswork into engineering.

  • Golden test sets: curated input-output pairs that represent expected behavior. Every change is evaluated against golden sets before deployment.
  • Scoring rubrics: multi-dimensional evaluation across accuracy, relevance, safety, format compliance, and cost efficiency.
  • Dashboard design: real-time visibility into request volume, error rates, latency, cost per task, and quality scores.
  • Regression detection: automated comparison of current performance against historical baselines. Alert before users notice.
  • A/B evaluation: side-by-side comparison of models, prompts, and routing rules against production configuration.
  • Cost tracking: per-task, per-team, per-model cost attribution. Teams see exactly what they spend and where.

Deliverables: evaluation set, quality dashboard specification, and regression alerting configuration.

AI systems connect to codebases, CRMs, ticketing systems, communication platforms, databases, and file systems. This module builds the connector architecture with standardized patterns and audit trails.

  • Connector design: standardized patterns for API, database, and file system connectors. Defined schema, error handling, rate limiting, retry logic.
  • Sync contracts: explicit rules for frequency, conflict resolution, and idempotency. No silent data loss, no duplicate processing.
  • Audit trails: every data movement is logged with source, destination, timestamp, and outcome.
  • Adapter patterns: isolate source system changes so the rest of the system does not need to know when APIs update.
  • Legacy system integration: wrapping older systems behind modern interfaces without requiring the legacy system to change.
  • Data quality enforcement: validation rules at ingestion boundaries. Bad data in means bad outputs out — catch it at the boundary.

Deliverables: connector map, sync contract document, and adapter specification for priority integrations.

Technology without a team operating model is shelf-ware. This module defines how teams adopt, operate, and improve AI systems over time.

  • Role-based adoption paths: engineering, operations, leadership, sales, and support each have different entry points and learning priorities.
  • Capability ladders: defined progression from basic prompt use to advanced workflow design. Skill-based checkpoints, not time-based milestones.
  • Handoff standards: when work passes between human and AI, the handoff includes context, status, decision history, and next steps.
  • Ownership cadence: weekly cycles — who reviews outputs, who ships, who escalates, who updates context.
  • Shared context documents: living documents that keep the team aligned on decisions and priorities.
  • Measurement and feedback: regular reviews of AI system performance against team goals.

Deliverables: role matrix, runbook v1 with ownership cadence, and capability ladder.

Every engagement produces artifacts that can help other teams. This module builds the pipeline from internal work product to public, reusable learning material while keeping proprietary information isolated.

  • Skill authoring: turning internal processes into reusable skill definitions with clear triggers, inputs, steps, and outputs.
  • Template design: creating prompt, workflow, and evaluation templates that work across projects.
  • Redaction pipelines: automated separation of public and private content. The boundary is explicit and enforced.
  • Open academy contribution: published artifacts follow the open academy format — module structure, skill companion, and validation checklist.
  • Quality review: published artifacts go through the same validation gates as production outputs.
  • Maintenance commitments: published artifacts have owners who keep them current.

Deliverables: one public skill, one public module contribution, and a de-risked template pack.

Building the system is the first quarter. Keeping it running and improving it over time is every quarter after that. This module creates the patterns for sustainable operation.

  • Monitoring loops: automated checks on output quality, cost, latency, and error rates. Alert on anomalies, not user complaints.
  • Upgrade cadence: scheduled evaluation of new model versions, tool updates, and prompt improvements. Test-in-staging, evaluate, deploy-with-rollback.
  • Context drift detection: flag context that has not been reviewed recently. Trigger maintenance cycles.
  • Operational playbooks: documented procedures for model regression, cost spikes, quality drops, and new use case adoption.
  • Quarterly reviews: structured assessment of performance, capability growth, and alignment with goals.
  • Scale planning: patterns for expanding to new teams and use cases. Assess, pilot, measure, expand.

Deliverables: scale plan, context drift policy, and operational playbook for top three scenarios.

The final module maps the entire provider ecosystem end-to-end. Not just which models to use, but how to connect providers, routing layers, terminals, coding agents, and observability tooling into a coherent stack.

  • Provider comparison: Claude, GPT, Gemini, Grok, Llama, Mistral, Qwen evaluated on capabilities, pricing, rate limits, and API stability.
  • Routing policy implementation: OpenRouter, LiteLLM, or custom routing layers that enforce task-type routing, cost limits, and fallback chains.
  • Agent integration: Claude Code, Codex CLI, Cursor, Windsurf wired into the workflow system with defined permissions and tool capabilities.
  • Multimodal coverage: image generation, vision analysis, audio transcription, and document processing across providers.
  • Observability implementation: logs, traces, metrics, and cost tracking wired into dashboards.
  • Migration planning: when providers change terms or deprecate models, the stack has a migration path. Provider lock-in is minimized through abstraction.

Deliverables: full provider catalog, terminal and agent command map by role, and multimodal fallback playbook.