How to run LLaMA 3 locally

How to run LLaMA 3 locally

How to Run LLaMA 3 Locally: Step-by-Step Guide for 2026

LLaMA 3 is Meta's open-source AI model — the same class of technology as ChatGPT, running entirely on your own computer. No subscription. No API costs. No data leaving your machine. Every prompt you type stays on your hardware and gets processed there.

Ollama has accumulated over 112 million model pulls for Llama 3.1 alone, making it the most popular local LLM runtime in the developer community. That number reflects a genuine shift: running capable AI locally is no longer experimental. It is a practical, everyday workflow for developers, writers, researchers, and anyone who values privacy.

This guide walks you through everything — checking your hardware, installing Ollama, downloading LLaMA 3, running your first prompt, and optimising for your specific setup. By the end, you will have a working local Llama instance and a clear understanding of when self-hosting makes financial and operational sense versus using a managed API.

Total setup time: under 15 minutes on most machines.

What Is LLaMA 3?

LLaMA (Large Language Model Meta AI) is Meta's family of open-source AI language models. Unlike GPT-4 or Claude, which are closed and only accessible via paid APIs, LLaMA models are released with open weights — meaning anyone can download the model files and run them.

Meta Llama 3 features pretrained and instruction-fine-tuned language models with 8B and 70B parameters that can support a broad range of use cases, demonstrating state-of-the-art performance on a wide range of industry benchmarks.

In plain terms: LLaMA 3 is a highly capable AI model that handles writing, summarisation, Q&A, coding, analysis, and conversation — and you can run it for free on your own computer.

The two main versions:

  • LLaMA 3.1 8B — 8 billion parameters. Runs on most laptops with 8GB RAM. Fast, capable, best for everyday tasks.
  • LLaMA 3.1 70B — 70 billion parameters. Requires 32–48GB RAM or a high-end GPU. Near GPT-4 quality. Best for complex reasoning and demanding tasks.

For most beginners, start with the 8B model. It runs comfortably on standard hardware and handles the vast majority of real-world tasks well.

Step 1 — Check Your Hardware

Before downloading anything, verify your machine meets the minimum requirements.

Minimum (8B model, CPU-only)
  • RAM: 8GB
  • Storage: 10GB free space
  • GPU: Not required (but significantly improves speed)
  • OS: Windows 10+, macOS 11+, Ubuntu 20.04+
Recommended (8B model with GPU acceleration)
  • RAM: 16GB
  • GPU: NVIDIA with 8GB VRAM (RTX 3060, RTX 4060, or better) OR Apple Silicon (M1/M2/M3/M4 — any variant)
  • Storage: 10–20GB free
For 70B models
  • RAM: 40GB+ (or GPU with 40GB+ VRAM)
  • These are not beginner setups — start with 8B and scale up

Q4_K_M quantisation gives you the best balance of quality and speed. Use this for most applications. The 8B model in Q4_K_M quantisation is a 4.7GB download — manageable on any modern machine.

Apple Silicon note: Every M-series Mac — M1, M2, M3, M4, in any configuration — handles the 8B model smoothly. For M3 Pro/Max chips with 18+ GPU cores, Metal acceleration delivers 28–35 tokens per second on Llama 3.1 8B — genuinely conversational speed. Even an M1 MacBook Air with 8GB unified memory runs the 8B model at 15–20 tokens per second. That is fast enough for real-time use.

Step 2 — Install Ollama

Ollama is the easiest way to get LLaMA 3 running locally. It handles the download, quantisation, and configuration, allowing you to run Llama 3 with a single command. Think of it as Docker for AI models.

macOS

Option A — Homebrew (recommended for developers):

brew install ollama

Option B — Direct download: Go to ollama.com, download the .dmg file, drag to Applications, and open it. Approve the security prompt if macOS asks (System Settings → Privacy & Security → Allow).

Linux

One command installs everything:

curl -fsSL https://ollama.com/install.sh | sh

This downloads the Ollama binary and sets it up as a background service. It works on Ubuntu 20.04+, Debian, and most mainstream distributions.

Windows

Download the .exe installer from ollama.com and run it. For best results, make sure WSL2 is set up beforehand.

The native Windows installer works without WSL2, but GPU acceleration and overall performance are better with WSL2 enabled. If you are on Windows and have not set up WSL2, the Microsoft documentation covers it in a few steps.

Verify the installation

Open your terminal and run:

ollama --version

If you see a version number, Ollama is installed correctly and running as a background service.

Step 3 — Download LLaMA 3

With Ollama installed, downloading LLaMA 3 is a single command.

For most users — LLaMA 3.1 8B (recommended)
ollama pull llama3.1

The download is 4.9GB. On a 200 Mbps connection, this takes roughly 3–4 minutes. You will see a progress bar. Ollama downloads the Q4_K_M quantised version by default — the best balance of size, speed, and quality for most hardware.

For machines with 16GB+ RAM — LLaMA 3.2 (latest stable)
ollama pull llama3.2

LLaMA 3.2 is Meta's most recent release as of 2026 and performs slightly better than 3.1 on most benchmarks while remaining the same size.

For 70B quality on capable hardware

ollama pull llama3.1:70b

This is a 40GB+ download and requires substantial RAM or VRAM. Only attempt this if your hardware qualifies.

Check what you have downloaded
ollama list

This shows all models currently stored on your machine with their sizes.

Step 4 — Run Your First Prompt

Option A — Interactive chat in terminal
ollama run llama3.1

This loads the model and opens an interactive chat session directly in your terminal. Type your prompt and press Enter. Type /bye to exit.

>>> Write a short email declining a meeting politely

The model streams its response token by token — you will see it generating in real time.

Option B — Single prompt, no interactive session
ollama run llama3.1 "Summarise the key differences between supervised and unsupervised learning in 3 bullet points"

The model generates its response and exits. Useful for scripting and automation.

Option C — API call (for developers)

Ollama automatically runs a REST API server on port 11434. Once a model is pulled, you can call it from any application:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "What is the capital of India?",
  "stream": false
}'

Point your existing OpenAI code at http://localhost:11434/v1 and switch the model name. Most LLM libraries work without modification.

Step 5 — Optimise for Your Hardware

NVIDIA GPU (Windows / Linux)

Ollama automatically detects NVIDIA GPUs and uses CUDA acceleration. Verify GPU is being used:

ollama run llama3.1
# In another terminal while the model is running:
nvidia-smi

If you see Ollama processes using VRAM, GPU acceleration is working. If the model is running on CPU only, check that your CUDA drivers are up to date:

nvidia-smi  # Should show driver version 525+ for CUDA 12

To force all layers to the GPU:

export OLLAMA_NUM_GPU=999
ollama run llama3.1
Apple Silicon (M1 / M2 / M3 / M4)

Metal GPU acceleration is enabled automatically on all M-series Macs. No configuration needed. To verify and maximise it:

export OLLAMA_METAL_ENABLED=1
export OLLAMA_NUM_GPU=1
ollama run llama3.1

For M3 Pro/Max Apple Silicon chips with 18+ GPU cores, Metal acceleration delivers 28–35 tokens per second on Llama 3.1 8B. That is genuinely conversational speed — fast enough for real-time use.

Memory tip for Apple Silicon: On Apple Silicon, Ollama uses unified memory shared by CPU and GPU. Close other apps to free memory. With 16GB unified memory, Llama 3.1 8B runs well. With 32GB+, 13B–30B models are comfortable.

CPU-only machines (no dedicated GPU)

Local AI works on CPU-only hardware — just more slowly. CPU-only inference of 70B models takes minutes per token. For 7B models on a modern CPU, expect 2–8 tokens/second. CPU-only mode is practical for non-time-sensitive tasks like document summarisation, where you can leave it running and come back.

Optimise CPU inference by closing all other applications before running the model and choosing the smallest quantisation that meets your quality needs.

Quantisation Options Explained

When you pull LLaMA 3, Ollama downloads the Q4_K_M version by default. You can specify different quantisation levels for different hardware situations:

Quantisation File Size Speed Quality Use When
Q2_K 2.7GB Fastest Noticeably lower Very constrained hardware
Q4_K_M 4.7GB Fast Excellent Most users — default choice
Q5_K_M 5.7GB Good Slightly better Extra VRAM headroom
Q8_0 8.5GB Slower Near full precision 12GB+ VRAM available

To pull a specific quantisation:

ollama pull llama3.1:8b-instruct-q4_K_M  # Explicit Q4_K_M
ollama pull llama3.1:8b-instruct-q8_0    # Higher quality, more VRAM

For most setups, the default Q4_K_M is the right choice and requires no additional configuration.

Adding a Chat Interface (Optional)

The terminal works fine, but if you want a visual interface similar to ChatGPT — running entirely locally — Open WebUI is the standard solution.

Install Open WebUI with Docker

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open your browser and go to http://localhost:3000. Create an account (stored locally), connect to your Ollama instance, and you have a full chat interface running entirely on your machine.

No Docker? LM Studio provides a similar experience without any setup. See the Ollama vs LM Studio comparison for a full breakdown of both options.

Troubleshooting Common Issues

"Error: model requires more system memory."

The model needs more RAM than is currently available. Solutions:

  1. Close all other applications to free RAM
  2. Pull a smaller quantisation: ollama pull llama3.1:8b-q2_K (2.7GB, less RAM)
  3. Use a smaller model: ollama pull phi4-mini (strong reasoning, smaller footprint)
Slow response speed (2–5 tokens/second)

The model is running on CPU instead of GPU. Check:

  • NVIDIA: run nvidia-smi and verify CUDA drivers are installed
  • Apple Silicon: run export OLLAMA_METAL_ENABLED=1 before starting
  • Close other applications to free system memory
"Model not found" error
ollama list  # See exactly what is downloaded
ollama pull llama3.1  # Re-pull if needed

If the download stalls or fails, it is usually a network timeout. Ollama does not resume partial downloads by default. Delete the incomplete model with ollama rm llama3.1 and try again. On slow connections, consider pulling a smaller model first, such as ollama run phi3 (2.3GB).

Download stalled mid-way

Delete and restart:

ollama rm llama3.1
ollama pull llama3.1

What to Do With LLaMA 3 Once It Is Running

Writing and content drafting

Use it exactly like you would use ChatGPT. Draft blog posts, emails, summaries, social captions, or any other text — all processed locally without cloud exposure. For bloggers specifically, the practical guide to local AI for content creators covers specific workflows for writing with local models.

Coding assistance without sending your code to the cloud

For developers working on proprietary codebases, local LLaMA 3 is a genuine alternative to GitHub Copilot — with complete privacy. Connect it to VS Code using the Continue extension:

  1. Install Continue from the VS Code marketplace
  2. Open Continue settings
  3. Set model: llama3.1, API base: http://localhost:11434
  4. Use AI code assistance without any data leaving your machine
Document summarisation and analysis

Feed LLaMA 3 long documents, reports, or research papers via the API. For a dedicated document chat experience, AnythingLLM connects to your local Ollama instance and lets you chat with PDFs and documents privately.

Batch processing and automation

Because Ollama exposes a standard REST API, it integrates with automation tools. Connect it to Make.com, n8n, or custom scripts to process documents, generate content, or handle classification tasks at scale — with zero per-token cost.

LLaMA 3 vs Other Local Models

LLaMA 3 is the right starting point for most users, but it is not the only option. Here is how it compares to other popular local models:

Model Best For Size (8B Equivalent) Notes
LLaMA 3.1 / 3.2 General use, writing, Q&A 4.7GB Best all-rounder for beginners
Mistral 7B Fast responses, document analysis 4.1GB 22% faster than LLaMA 3, slightly lower quality
Qwen3 8B Coding, multilingual 4.9GB Excellent for code, strong on Hindi and other languages
DeepSeek R1 8B Complex reasoning 4.7GB Better reasoning than LLaMA 3 on difficult problems
Phi-4 Mini Constrained hardware 2.3GB Strong reasoning in a smaller footprint
Gemma 3 4B Very low RAM (4–6GB) 2.5GB Google's model offers impressive quality for its size

For a first-time setup, Llama 3.3 8B is the most balanced choice: 4.9GB download, requires 8GB RAM, and produces high-quality responses across most tasks, including conversation, writing, and code.

Frequently Asked Questions: How to run LLaMA 3 locally

Q1. Is LLaMA 3 free to use commercially? 

Meta's Llama models allow commercial use for companies with fewer than 700 million monthly active users. For the vast majority of individuals, freelancers, and small businesses, yes — LLaMA 3 is free for commercial use.

Q2. Do I need the internet to run LLaMA 3 once downloaded?

No. After the initial download, LLaMA 3 runs entirely offline. No internet connection required for inference.

Q3. How is LLaMA 3 different from ChatGPT? 

LLaMA 3 is open-source and runs on your hardware. ChatGPT runs on OpenAI's servers and requires an internet connection and a subscription for full access. LLaMA 3 is roughly 80–90% as capable as GPT-4 on most everyday tasks, with complete privacy and no ongoing cost.

Q4. Can I run LLaMA 3 on a laptop without a dedicated GPU? 

Yes. The 8B model runs on CPU-only hardware with 8GB RAM. Expect 2–8 tokens/second — slower than with a GPU, but functional for non-time-critical tasks.

Q5. How do I update to a newer LLaMA version?

ollama pull llama3.2  # Pull the newer version
ollama rm llama3.1    # Remove the old one to free disk space

Q6. Can I run multiple models at the same time? 

Yes. Ollama supports concurrent model loading. Set OLLAMA_MAX_LOADED_MODELS=2 to keep two models in memory simultaneously.

Q7. What is the difference between the 8B and 70B models? 

The 70B model is significantly more capable — better reasoning, more nuanced responses, stronger performance on complex tasks. But it requires 32–48GB of RAM and a high-end GPU. For most users, the 8B model handles 90% of real-world tasks well.

The Bottom Line

Running LLaMA 3 locally is a 15-minute setup that gives you a capable, private AI assistant with no ongoing cost and no data leaving your machine.

Start with Llama 3.1 8B via Ollama. It runs on most modern GPUs, downloads in minutes, and gives you a working mental model of local inference before scaling up.

Install Ollama. Pull llama3.1. Run your first prompt. The rest — optimisation, interfaces, integration — you can layer in as you go.

For the broader picture of what is possible with local AI — including hardware recommendations and model comparisons — see the complete guide to running AI locally.

Author Image

Hardeep Singh

Hardeep Singh is a tech and money-blogging enthusiast, sharing guides on earning apps, affiliate programs, online business tips, AI tools, SEO, and blogging tutorials. About Author.

Previous Post