Running LLMs Locally with Ollama
Setting up Ollama, choosing a model (Llama 3, Mistral), Q4/Q8 quantization, and local vs API tradeoffs.
Install Ollama, run `llama3`, and call it via the Python OpenAI client. Compare response speed for Q4_K_M vs Q8_0 on the same prompt.
Copy and adapt to your context. Text in angle brackets should be replaced.
You are a developer assistant. Answer concisely with code examples. Question: {{user_question}}- Forgetting Ollama only listens on localhost by default — set `OLLAMA_HOST=0.0.0.0` for remote access. 2. Using F16 on a GPU with insufficient VRAM — the model crashes; use quantization.
- `ollama show llama3 --modelfile` — inspect the model's system prompt and parameters. 2. `OLLAMA_NUM_PARALLEL=4` — enable parallel requests on multi-core CPU.