Running LLMs Locally with Ollama

Setting up Ollama, choosing a model (Llama 3, Mistral), Q4/Q8 quantization, and local vs API tradeoffs.

35 min read2 questions in quizReady prompt includedIn progress

Practical exercise

What to do after this lesson

Install Ollama, run `llama3`, and call it via the Python OpenAI client. Compare response speed for Q4_K_M vs Q8_0 on the same prompt.

Install Ollama, run `llama3`, and call it via the Python OpenAI client. Compare response speed for Q4_K_M vs Q8_0 on the same prompt.

Your answer

Ready-to-use prompt

Template for this lesson

Copy and adapt to your context. Text in angle brackets should be replaced.

You are a developer assistant. Answer concisely with code examples. Question: {{user_question}}

Prompt

Common mistakes

What people get wrong

Forgetting Ollama only listens on localhost by default — set `OLLAMA_HOST=0.0.0.0` for remote access. 2. Using F16 on a GPU with insufficient VRAM — the model crashes; use quantization.

Pro tips

What works but no one documents

`ollama show llama3 --modelfile` — inspect the model's system prompt and parameters. 2. `OLLAMA_NUM_PARALLEL=4` — enable parallel requests on multi-core CPU.

Discussion