How to run LLaMA 3 locally
How to Run LLaMA 3 Locally: Step-by-Step Guide for 2026
LLaMA 3 is Meta's open-source AI model — the same class of technology as ChatGPT, running entirely on your own computer. No subscription. No API costs. No data leaving your machine. Every prompt you type stays on your hardware and gets processed there.
Ollama has accumulated over 112 million model pulls for Llama 3.1 alone, making it the most popular local LLM runtime in the developer community. That number reflects a genuine shift: running capable AI locally is no longer experimental. It is a practical, everyday workflow for developers, writers, researchers, and anyone who values privacy.
This guide walks you through everything — checking your hardware, installing Ollama, downloading LLaMA 3, running your first prompt, and optimising for your specific setup. By the end, you will have a working local Llama instance and a clear understanding of when self-hosting makes financial and operational sense versus using a managed API.
Total setup time: under 15 minutes on most machines.
What Is LLaMA 3?
LLaMA (Large Language Model Meta AI) is Meta's family of open-source AI language models. Unlike GPT-4 or Claude, which are closed and only accessible via paid APIs, LLaMA models are released with open weights — meaning anyone can download the model files and run them.
Meta Llama 3 features pretrained and instruction-fine-tuned language models with 8B and 70B parameters that can support a broad range of use cases, demonstrating state-of-the-art performance on a wide range of industry benchmarks.
In plain terms: LLaMA 3 is a highly capable AI model that handles writing, summarisation, Q&A, coding, analysis, and conversation — and you can run it for free on your own computer.
The two main versions:
- LLaMA 3.1 8B — 8 billion parameters. Runs on most laptops with 8GB RAM. Fast, capable, best for everyday tasks.
- LLaMA 3.1 70B — 70 billion parameters. Requires 32–48GB RAM or a high-end GPU. Near GPT-4 quality. Best for complex reasoning and demanding tasks.
For most beginners, start with the 8B model. It runs comfortably on standard hardware and handles the vast majority of real-world tasks well.
Step 1 — Check Your Hardware
Before downloading anything, verify your machine meets the minimum requirements.
- RAM: 8GB
- Storage: 10GB free space
- GPU: Not required (but significantly improves speed)
- OS: Windows 10+, macOS 11+, Ubuntu 20.04+
- RAM: 16GB
- GPU: NVIDIA with 8GB VRAM (RTX 3060, RTX 4060, or better) OR Apple Silicon (M1/M2/M3/M4 — any variant)
- Storage: 10–20GB free
- RAM: 40GB+ (or GPU with 40GB+ VRAM)
- These are not beginner setups — start with 8B and scale up
Q4_K_M quantisation gives you the best balance of quality and speed. Use this for most applications. The 8B model in Q4_K_M quantisation is a 4.7GB download — manageable on any modern machine.
Apple Silicon note: Every M-series Mac — M1, M2, M3, M4, in any configuration — handles the 8B model smoothly. For M3 Pro/Max chips with 18+ GPU cores, Metal acceleration delivers 28–35 tokens per second on Llama 3.1 8B — genuinely conversational speed. Even an M1 MacBook Air with 8GB unified memory runs the 8B model at 15–20 tokens per second. That is fast enough for real-time use.
Step 2 — Install Ollama
Ollama is the easiest way to get LLaMA 3 running locally. It handles the download, quantisation, and configuration, allowing you to run Llama 3 with a single command. Think of it as Docker for AI models.
Option A — Homebrew (recommended for developers):
brew install ollama
Option B — Direct download: Go to ollama.com, download the .dmg file, drag to Applications, and open it. Approve the security prompt if macOS asks (System Settings → Privacy & Security → Allow).
One command installs everything:
curl -fsSL https://ollama.com/install.sh | sh
This downloads the Ollama binary and sets it up as a background service. It works on Ubuntu 20.04+, Debian, and most mainstream distributions.
Download the .exe installer from ollama.com and run it. For best results, make sure WSL2 is set up beforehand.
The native Windows installer works without WSL2, but GPU acceleration and overall performance are better with WSL2 enabled. If you are on Windows and have not set up WSL2, the Microsoft documentation covers it in a few steps.
Open your terminal and run:
ollama --version
If you see a version number, Ollama is installed correctly and running as a background service.
Step 3 — Download LLaMA 3
With Ollama installed, downloading LLaMA 3 is a single command.
ollama pull llama3.1
The download is 4.9GB. On a 200 Mbps connection, this takes roughly 3–4 minutes. You will see a progress bar. Ollama downloads the Q4_K_M quantised version by default — the best balance of size, speed, and quality for most hardware.
ollama pull llama3.2
LLaMA 3.2 is Meta's most recent release as of 2026 and performs slightly better than 3.1 on most benchmarks while remaining the same size.
For 70B quality on capable hardware
ollama pull llama3.1:70b
This is a 40GB+ download and requires substantial RAM or VRAM. Only attempt this if your hardware qualifies.
ollama list
This shows all models currently stored on your machine with their sizes.
Step 4 — Run Your First Prompt
ollama run llama3.1
This loads the model and opens an interactive chat session directly in your terminal. Type your prompt and press Enter. Type /bye to exit.
>>> Write a short email declining a meeting politely
The model streams its response token by token — you will see it generating in real time.
ollama run llama3.1 "Summarise the key differences between supervised and unsupervised learning in 3 bullet points"
The model generates its response and exits. Useful for scripting and automation.
Ollama automatically runs a REST API server on port 11434. Once a model is pulled, you can call it from any application:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "What is the capital of India?",
"stream": false
}'
Point your existing OpenAI code at http://localhost:11434/v1 and switch the model name. Most LLM libraries work without modification.
Step 5 — Optimise for Your Hardware
Ollama automatically detects NVIDIA GPUs and uses CUDA acceleration. Verify GPU is being used:
ollama run llama3.1
# In another terminal while the model is running:
nvidia-smi
If you see Ollama processes using VRAM, GPU acceleration is working. If the model is running on CPU only, check that your CUDA drivers are up to date:
nvidia-smi # Should show driver version 525+ for CUDA 12
To force all layers to the GPU:
export OLLAMA_NUM_GPU=999
ollama run llama3.1
Metal GPU acceleration is enabled automatically on all M-series Macs. No configuration needed. To verify and maximise it:
export OLLAMA_METAL_ENABLED=1
export OLLAMA_NUM_GPU=1
ollama run llama3.1
For M3 Pro/Max Apple Silicon chips with 18+ GPU cores, Metal acceleration delivers 28–35 tokens per second on Llama 3.1 8B. That is genuinely conversational speed — fast enough for real-time use.
Memory tip for Apple Silicon: On Apple Silicon, Ollama uses unified memory shared by CPU and GPU. Close other apps to free memory. With 16GB unified memory, Llama 3.1 8B runs well. With 32GB+, 13B–30B models are comfortable.
Local AI works on CPU-only hardware — just more slowly. CPU-only inference of 70B models takes minutes per token. For 7B models on a modern CPU, expect 2–8 tokens/second. CPU-only mode is practical for non-time-sensitive tasks like document summarisation, where you can leave it running and come back.
Optimise CPU inference by closing all other applications before running the model and choosing the smallest quantisation that meets your quality needs.
Quantisation Options Explained
When you pull LLaMA 3, Ollama downloads the Q4_K_M version by default. You can specify different quantisation levels for different hardware situations:
| Quantisation | File Size | Speed | Quality | Use When |
|---|---|---|---|---|
| Q2_K | 2.7GB | Fastest | Noticeably lower | Very constrained hardware |
| Q4_K_M | 4.7GB | Fast | Excellent | Most users — default choice |
| Q5_K_M | 5.7GB | Good | Slightly better | Extra VRAM headroom |
| Q8_0 | 8.5GB | Slower | Near full precision | 12GB+ VRAM available |
To pull a specific quantisation:
ollama pull llama3.1:8b-instruct-q4_K_M # Explicit Q4_K_M
ollama pull llama3.1:8b-instruct-q8_0 # Higher quality, more VRAM
For most setups, the default Q4_K_M is the right choice and requires no additional configuration.
Adding a Chat Interface (Optional)
The terminal works fine, but if you want a visual interface similar to ChatGPT — running entirely locally — Open WebUI is the standard solution.
Install Open WebUI with Docker
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
Open your browser and go to http://localhost:3000. Create an account (stored locally), connect to your Ollama instance, and you have a full chat interface running entirely on your machine.
No Docker? LM Studio provides a similar experience without any setup. See the Ollama vs LM Studio comparison for a full breakdown of both options.
Troubleshooting Common Issues
The model needs more RAM than is currently available. Solutions:
- Close all other applications to free RAM
- Pull a smaller quantisation:
ollama pull llama3.1:8b-q2_K(2.7GB, less RAM) - Use a smaller model:
ollama pull phi4-mini(strong reasoning, smaller footprint)
The model is running on CPU instead of GPU. Check:
- NVIDIA: run
nvidia-smiand verify CUDA drivers are installed - Apple Silicon: run
export OLLAMA_METAL_ENABLED=1before starting - Close other applications to free system memory
ollama list # See exactly what is downloaded
ollama pull llama3.1 # Re-pull if needed
If the download stalls or fails, it is usually a network timeout. Ollama does not resume partial downloads by default. Delete the incomplete model with ollama rm llama3.1 and try again. On slow connections, consider pulling a smaller model first, such as ollama run phi3 (2.3GB).
Delete and restart:
ollama rm llama3.1
ollama pull llama3.1
What to Do With LLaMA 3 Once It Is Running
Use it exactly like you would use ChatGPT. Draft blog posts, emails, summaries, social captions, or any other text — all processed locally without cloud exposure. For bloggers specifically, the practical guide to local AI for content creators covers specific workflows for writing with local models.
For developers working on proprietary codebases, local LLaMA 3 is a genuine alternative to GitHub Copilot — with complete privacy. Connect it to VS Code using the Continue extension:
- Install Continue from the VS Code marketplace
- Open Continue settings
- Set model:
llama3.1, API base:http://localhost:11434 - Use AI code assistance without any data leaving your machine
Feed LLaMA 3 long documents, reports, or research papers via the API. For a dedicated document chat experience, AnythingLLM connects to your local Ollama instance and lets you chat with PDFs and documents privately.
Because Ollama exposes a standard REST API, it integrates with automation tools. Connect it to Make.com, n8n, or custom scripts to process documents, generate content, or handle classification tasks at scale — with zero per-token cost.
LLaMA 3 vs Other Local Models
LLaMA 3 is the right starting point for most users, but it is not the only option. Here is how it compares to other popular local models:
| Model | Best For | Size (8B Equivalent) | Notes |
|---|---|---|---|
| LLaMA 3.1 / 3.2 | General use, writing, Q&A | 4.7GB | Best all-rounder for beginners |
| Mistral 7B | Fast responses, document analysis | 4.1GB | 22% faster than LLaMA 3, slightly lower quality |
| Qwen3 8B | Coding, multilingual | 4.9GB | Excellent for code, strong on Hindi and other languages |
| DeepSeek R1 8B | Complex reasoning | 4.7GB | Better reasoning than LLaMA 3 on difficult problems |
| Phi-4 Mini | Constrained hardware | 2.3GB | Strong reasoning in a smaller footprint |
| Gemma 3 4B | Very low RAM (4–6GB) | 2.5GB | Google's model offers impressive quality for its size |
For a first-time setup, Llama 3.3 8B is the most balanced choice: 4.9GB download, requires 8GB RAM, and produces high-quality responses across most tasks, including conversation, writing, and code.
Frequently Asked Questions: How to run LLaMA 3 locally
Q1. Is LLaMA 3 free to use commercially?
Meta's Llama models allow commercial use for companies with fewer than 700 million monthly active users. For the vast majority of individuals, freelancers, and small businesses, yes — LLaMA 3 is free for commercial use.
Q2. Do I need the internet to run LLaMA 3 once downloaded?
No. After the initial download, LLaMA 3 runs entirely offline. No internet connection required for inference.
Q3. How is LLaMA 3 different from ChatGPT?
LLaMA 3 is open-source and runs on your hardware. ChatGPT runs on OpenAI's servers and requires an internet connection and a subscription for full access. LLaMA 3 is roughly 80–90% as capable as GPT-4 on most everyday tasks, with complete privacy and no ongoing cost.
Q4. Can I run LLaMA 3 on a laptop without a dedicated GPU?
Yes. The 8B model runs on CPU-only hardware with 8GB RAM. Expect 2–8 tokens/second — slower than with a GPU, but functional for non-time-critical tasks.
Q5. How do I update to a newer LLaMA version?
ollama pull llama3.2 # Pull the newer version
ollama rm llama3.1 # Remove the old one to free disk space
Q6. Can I run multiple models at the same time?
Yes. Ollama supports concurrent model loading. Set OLLAMA_MAX_LOADED_MODELS=2 to keep two models in memory simultaneously.
Q7. What is the difference between the 8B and 70B models?
The 70B model is significantly more capable — better reasoning, more nuanced responses, stronger performance on complex tasks. But it requires 32–48GB of RAM and a high-end GPU. For most users, the 8B model handles 90% of real-world tasks well.
The Bottom Line
Running LLaMA 3 locally is a 15-minute setup that gives you a capable, private AI assistant with no ongoing cost and no data leaving your machine.
Start with Llama 3.1 8B via Ollama. It runs on most modern GPUs, downloads in minutes, and gives you a working mental model of local inference before scaling up.
Install Ollama. Pull llama3.1. Run your first prompt. The rest — optimisation, interfaces, integration — you can layer in as you go.
For the broader picture of what is possible with local AI — including hardware recommendations and model comparisons — see the complete guide to running AI locally.
.webp)