Node Deep Dive / Agentic

LLM Application Architecture: From Prompts to Engineering Systems

Traces how LLM apps evolve from basic prompts to vector search, caching and orchestration layers, and shows how to build a local stack with Ollama and LangChain.

2024-02-01~ 5 min read
#node/agentic #LLM #Prompt #RAG

In the last year, AIGC has shifted from concept hype to engineering hype. Base models keep getting stronger; upper‑layer frameworks keep lowering the bar for building apps.

I started building LLM apps systematically via tools like Ollama: first get a minimal chain working locally, then gradually add engineering capabilities.

How LLM Architectures Evolved

Early on, context length was tiny. The only usable pattern was:

  • Give the model a short prompt
  • Maybe add some inline context
  • Let it "continue the text"

This simple setup already solved a bunch of tasks.

alt text

But the model’s knowledge is bounded by its training data. Ask about recent events and it will hallucinate. You can inject extra text into the context, but that:

  • Doesn’t scale to large docs (PDF/Word, etc.)
  • Is brittle and expensive in tokens

The next step was vectorization – convert supplemental information into embeddings and retrieve relevant chunks per query, then attach them to the prompt.

alt text

As usage grows (bigger context, more users, higher QPS), we add databases and caches:

  • Vector DB stores embeddings and supports similarity search
    • Within limited context windows, you still feed the most relevant content
  • Cache stores previous questions and answers
    • Reduces model calls for similar queries
    • Provides conversational history to avoid "amnesia"

alt text

Developing and Deploying LLM Apps

alt text

Reference: The architecture of today’s LLM applications

From a traditional perspective, an LLM app is still:

  • UI
  • Server
  • DB

The difference is that the model runtime pulls in a lot of extra dependencies (weights, GPU, runtimes), and is often treated as a separate tier.

Tools like Ollama smooth this out by providing:

  • A local model registry and runtime
  • A simple CLI and HTTP API over many models
sh
Available Commands:
  serve       Start ollama
  create      Create a model from a Modelfile
  show        Show information for a model
  run         Run a model
  pull        Pull a model from a registry
  push        Push a model to a registry
  list        List models
  cp          Copy a model
  rm          Remove a model
  help        Help about any command

With Ollama in place, you don’t need to stand up your own model server. Just call Ollama and focus on orchestration, where LangChain is still the de facto first choice.

alt text

Building LLM-Powered Web Apps with Client-Side Technology

If you already know how to build a 3‑tier web app, you’re 80% of the way there. The remaining 20% is about:

  • Picking the right model(s)
  • Designing prompts and chains
  • Managing context, latency and cost
  • Observability for prompts and responses

Everything else is just engineering.