Running LLMs Locally With Ollama: A Developer's Practical Guide

Running a local LLM API server with Ollama lets you replace OpenAI calls with zero-cost, self-hosted inference. I built a TypeScript and Express.js service that mimics the OpenAI API, allowing existing applications to switch to local models like Llama 3 with minimal code changes. This guide breaks down the architecture, key decisions, and pitfalls from my implementation.

Architecture Overview

The system is designed as a drop-in replacement. An Express.js server receives HTTP requests formatted for the OpenAI Chat Completions API. Instead of forwarding these to OpenAI, it transforms the request and sends it to a local Ollama instance running Llama 3. The Ollama server, often containerized with Docker for easy setup, handles the actual model inference and streams the response back.

flowchart LR
    A[Client App] -->|OpenAI-formatted<br>HTTP Request| B[Express.js API Server]
    B -->|Transforms & Routes Request| C[Local Ollama Server]
    C -->|Runs Inference<br>Llama 3 Model| D[(Ollama Model Library)]
    C -->|Streams JSON Response| B
    B -->|Returns OpenAI-formatted<br>Response| A

This simple proxy architecture means you can point an existing app's baseURL to your local server and, for basic chat completions, it just works.

Key Technical Decisions

The first critical decision was to implement response streaming from day one. Many AI applications rely on streaming for user experience, and Ollama supports it natively. My endpoint had to consume the Ollama stream and reformat each chunk into an OpenAI-compatible Server-Sent Events (SSE) stream.

import express, { Request, Response } from 'express';
import { createParser } from 'eventsource-parser';

const app = express();
app.use(express.json());

app.post('/v1/chat/completions', async (req: Request, res: Response) => {
  // 1. Transform OpenAI request to Ollama format
  const { messages, model } = req.body;
  const ollamaRequest = {
    model: 'llama3', // Map to local model name
    messages: messages,
    stream: true,
  };

  // 2. Fetch stream from Ollama
  const ollamaResponse = await fetch('http://localhost:11434/api/chat', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(ollamaRequest),
  });

  // 3. Set up SSE headers
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');

  // 4. Transformer: Ollama chunk -> OpenAI SSE format
  const parser = createParser((event) => {
    if (event.type === 'event') {
      try {
        const ollamaChunk = JSON.parse(event.data);
        if (ollamaChunk.done) {
          res.write('data: [DONE]\n\n');
          res.end();
        } else {
          const openAIChunk = {
            id: 'chatcmpl-local',
            object: 'chat.completion.chunk',
            created: Date.now(),
            model: model || 'llama3',
            choices: [{
              index: 0,
              delta: { content: ollamaChunk.message?.content || '' },
              finish_reason: null,
            }],
          };
          res.write(`data: ${JSON.stringify(openAIChunk)}\n\n`);
        }
      } catch (e) { /* Handle error */ }
    }
  });

  // 5. Pipe the stream through the parser
  const reader = ollamaResponse.body?.getReader();
  if (reader) {
    // ... Read stream and feed to parser
  }
});

The second decision was to hardcode the model mapping initially. While Ollama's API accepts a model name, I wanted my server's /v1/chat/completions endpoint to accept any model parameter (like gpt-3.5-turbo) for compatibility. I started by mapping all incoming requests to llama3, which simplified the initial proxy logic. A model registry came later.

What Broke and How I Fixed It

The first major breakage was context window mismanagement. The OpenAI API includes a max_tokens parameter, but Ollama uses num_predict. I initially forwarded max_tokens directly, which Ollama ignored. This caused responses to be unpredictably long and sometimes time out. The fix was explicit parameter translation in the request transformation layer.

// In the request transformation logic
const { messages, model, max_tokens } = req.body;

const ollamaRequest = {
  model: 'llama3',
  messages: messages,
  stream: true,
  options: { // Ollama-specific configuration
    num_predict: max_tokens || 512, // Map max_tokens to num_predict
    temperature: 0.7,
  },
};

The second issue was handling non-streaming requests. My initial implementation assumed all clients wanted streaming. When I integrated a simple script that expected a single JSON response, it hung waiting for the stream to close. The solution was to check the request's stream boolean and implement two response paths: one for SSE streaming and one for buffering the complete response before sending a standard JSON reply.

How to Build Something Similar

Start by getting Ollama running locally. It's a single command. Then build the thinnest possible proxy that works for your primary use case.

# 1. Install and run Ollama with a model
ollama run llama3
# Keep this running. It starts a server on localhost:11434

# 2. Initialize your Node.js project
mkdir local-llm-proxy && cd local-llm-proxy
npm init -y
npm install express eventsource-parser
npm install --save-dev typescript @types/express @types/node

# 3. Create a basic server.ts file with the streaming code above.
# 4. Compile and run it.
npx tsc
node dist/server.js

Your server will now be running, likely on port 3000. Test it by sending a curl request mimicking an OpenAI client, or point a compatible application at http://localhost:3000/v1/chat/completions. You can find a more complete starter template on my portfolio at suhailroushan.com.

Would I Build It the Same Way Again?

For a personal project or small team prototype, absolutely. This approach gets you from zero to functional in an afternoon. The simplicity is its strength. For a production system serving multiple users, I would change two things. First, I'd add a proper model router to dynamically pull models (e.g., llama3:8b, mistral) based on the request. Second, I'd implement a basic request queue and timeout handler to prevent a single long-running inference from blocking the entire Express.js event loop.

The one thing you should know before starting is that local inference is a trade-off: you gain privacy and eliminate costs, but you must manage performance, memory, and model compatibility yourself. Start by profiling how much RAM your chosen model needs—Llama 3 8B needs about 8GB—and ensure your hardware can handle it.

Architecture Overview

Key Technical Decisions

What Broke and How I Fixed It

How to Build Something Similar

Would I Build It the Same Way Again?

Related posts