New price cuts across every cluster configuration! Up to 35% off Starter Kits and saves hundreds depending on the cluster configuration!

How to Run Ollama Local AI Chat on the Raspberry Pi 5

Learn how to deploy and run Ollama, a local large language model (LLM) server, on your Raspberry Pi 5. This guide covers the complete setup process with optimizations specific to ARM64 architecture and the Pi 5's performance characteristics for AI workloads.

In this comprehensive guide, we'll walk you through running Ollama for local LLM inference on ARM architecture on your Desktop Datacenter. Whether you're building a home lab or managing an edge computing environment, this tutorial will help you leverage your PicoCluster for private AI workloads on your ARM-based Desktop Datacenter.

Prerequisites

Before we begin, ensure you have the following:

  • Hardware: Raspberry Pi 5 with 8GB RAM, 128GB+ high-speed storage (SSD strongly recommended)
  • Software: Raspberry Pi OS 64-bit or Ubuntu 22.04+ for ARM64
  • Network: Internet connection for model downloads, local network access
  • Knowledge: Basic Linux command line experience
  • Kubernetes: K8s must already be installed - see our guides for installing Docker on Raspberry Pi 5, K3s installation, or MicroK8s

Background & Context

Ollama supports ARM64 architecture, making it possible to run language models on the Raspberry Pi 5. While performance is more limited compared to x86 systems, the Pi 5's improved capabilities make it viable for smaller models and development use cases. This setup is perfect for learning AI concepts, building edge AI solutions, or creating private chat systems on ARM hardware.

Your Desktop Datacenter provides the perfect environment for building ARM-based AI solutions, creating development environments for edge AI applications, or learning AI technologies without the overhead and privacy concerns of cloud-based services.

Step-by-Step Implementation

Step 1: ARM64 Prerequisites Verification

Verify your ARM64 system is ready for Ollama deployment:

bash
# Check architecture and system resources
uname -m  # Should show aarch64
free -h   # Check available RAM (8GB recommended)
df -h     # Check available disk space

# Check CPU information for Pi 5
lscpu | grep -E "(Model name|CPU\(s\)|Thread|Architecture)"

# Verify we're on 64-bit system
getconf LONG_BIT  # Should show 64

# Check thermal status (important for AI workloads)
vcgencmd measure_temp
vcgencmd get_throttled

Step 2: Install Ollama on ARM64

Install Ollama with ARM64-specific considerations:

bash
# Update system packages
sudo apt update && sudo apt upgrade -y

# Install required dependencies
sudo apt install -y curl wget jq

# Install Ollama using the official script (supports ARM64)
curl -fsSL https://ollama.ai/install.sh | sh

# Verify installation and architecture
ollama --version

# Check if Ollama service is running
systemctl status ollama

# Start and enable Ollama service if needed
sudo systemctl start ollama
sudo systemctl enable ollama

Step 3: Configure Ollama for ARM64 Optimization

Configure Ollama with ARM64-specific optimizations:

bash
# Create Ollama service configuration directory
sudo mkdir -p /etc/systemd/system/ollama.service.d

# Configure Ollama for Pi 5 ARM64 optimization
sudo tee /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/var/lib/ollama/models"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_MAX_QUEUE=5"
Environment="OLLAMA_FLASH_ATTENTION=0"
EOF

# Set memory limits for stability on Pi 5
sudo tee -a /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
MemoryMax=6G
MemoryHigh=5G
EOF

# Reload systemd and restart Ollama
sudo systemctl daemon-reload
sudo systemctl restart ollama

# Verify Ollama is listening
sudo netstat -tlnp | grep :11434

Step 4: Download ARM64-Optimized Models

Download smaller, ARM64-friendly models suitable for the Pi 5:

bash
# Start with the smallest viable chat model
ollama pull llama2:7b-chat-q4_0  # 4-bit quantized for better ARM performance

# Alternative lightweight models for Pi 5
# ollama pull mistral:7b-instruct-q4_0  # Quantized Mistral
# ollama pull phi:2.7b                  # Smaller model, good for Pi 5
# ollama pull tinyllama:1.1b           # Very small, fast model

# List downloaded models and their sizes
ollama list

# Check model file sizes on disk
sudo du -sh /var/lib/ollama/models/* | sort -hr

# Test the model (be patient, ARM64 inference is slower)
echo "Testing model - this may take a minute on ARM64..."
ollama run llama2:7b-chat-q4_0

Step 5: ARM64 Performance Testing

Test and benchmark Ollama performance on Pi 5 ARM64:

bash
# Create ARM64 performance test script
mkdir -p ~/ollama-scripts
tee ~/ollama-scripts/test-arm64-performance.sh << 'EOF'
#!/bin/bash

echo "=== Ollama ARM64 Performance Test ==="
echo "Testing on: $(uname -m)"
echo "Date: $(date)"
echo

# System status before test
echo "Pre-test system status:"
vcgencmd measure_temp
free -h
echo

# Simple inference test with timing
echo "Starting inference test..."
start_time=$(date +%s)

curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2:7b-chat-q4_0",
    "prompt": "What is artificial intelligence?",
    "stream": false
  }' > /tmp/ollama_response.json

end_time=$(date +%s)
duration=$((end_time - start_time))

echo "Inference completed in ${duration} seconds"
echo

# System status after test
echo "Post-test system status:"
vcgencmd measure_temp
vcgencmd get_throttled
echo

# Show response
echo "Generated response:"
jq -r '.response' /tmp/ollama_response.json | head -n 10
EOF

chmod +x ~/ollama-scripts/test-arm64-performance.sh
~/ollama-scripts/test-arm64-performance.sh

Step 6: Set Up Lightweight Web Interface

Install a lightweight web UI optimized for Pi 5 ARM64:

bash
# Install Docker for ARM64 if not already installed
sudo apt update
sudo apt install -y docker.io
sudo systemctl start docker
sudo systemctl enable docker

# Add user to docker group
sudo usermod -aG docker $USER
# Log out and back in for group changes

# Run lightweight Open WebUI for ARM64
docker run -d \
  --name open-webui-pi5 \
  --restart unless-stopped \
  -p 3000:8080 \
  -e OLLAMA_BASE_URL=http://localhost:11434 \
  -e WEBUI_SECRET_KEY="$(openssl rand -base64 32)" \
  -v open-webui:/app/backend/data \
  --memory=1g \
  --cpus=2 \
  ghcr.io/open-webui/open-webui:main

# Monitor container startup (may take time on ARM64)
echo "Waiting for Web UI to start..."
sleep 30
docker logs open-webui-pi5 | tail -10

# Check if service is accessible
curl -s http://localhost:3000 > /dev/null && echo "Web UI is ready!" || echo "Web UI still starting..."

Step 7: ARM64 Monitoring and Optimization

Set up monitoring specific to Pi 5 ARM64 performance:

bash
# Create comprehensive Pi 5 monitoring script
tee ~/ollama-scripts/monitor-pi5.sh << 'EOF'
#!/bin/bash

echo "=== Raspberry Pi 5 Ollama Monitor ==="
echo "Date: $(date)"
echo

echo "Pi 5 Hardware Status:"
echo "CPU Temperature: $(vcgencmd measure_temp)"
echo "Throttling Status: $(vcgencmd get_throttled)"
echo "CPU Frequency: $(vcgencmd measure_clock arm)"
echo "GPU Memory Split: $(vcgencmd get_mem gpu) / $(vcgencmd get_mem arm)"
echo

echo "System Resources:"
echo "Memory Usage:"
free -h
echo
echo "CPU Usage:"
top -bn1 | grep "Cpu(s)" | head -1
echo

echo "Storage Usage:"
df -h /var/lib/ollama
echo

echo "Ollama Status:"
systemctl is-active ollama
curl -s http://localhost:11434/api/ps | jq -r '.models[]? | "Model: \(.name) - Size: \(.size_vram)MB"'
echo

echo "Network Connections:"
sudo netstat -tlnp | grep :11434
EOF

chmod +x ~/ollama-scripts/monitor-pi5.sh

# Create model management script for ARM64
tee ~/ollama-scripts/manage-models-arm64.sh << 'EOF'
#!/bin/bash

echo "=== ARM64 Model Management ==="

case $1 in
  "list")
    echo "Available models:"
    ollama list
    echo
    echo "Disk usage:"
    sudo du -sh /var/lib/ollama/models/* 2>/dev/null | sort -hr
    ;;
  "unload")
    echo "Unloading all models to free memory..."
    curl -X POST http://localhost:11434/api/generate -d '{"model": "", "keep_alive": 0}'
    echo "Models unloaded. Memory freed:"
    free -h
    ;;
  "temp")
    echo "Current system temperature:"
    vcgencmd measure_temp
    vcgencmd get_throttled
    ;;
  *)
    echo "Usage: $0 {list|unload|temp}"
    echo "  list   - Show models and disk usage"
    echo "  unload - Unload models to free memory"
    echo "  temp   - Show temperature and throttling status"
    ;;
esac
EOF

chmod +x ~/ollama-scripts/manage-models-arm64.sh

# Run initial monitoring
~/ollama-scripts/monitor-pi5.sh

Desktop Datacenter Integration

Home Lab Applications:

  • Edge AI experimentation and development
  • IoT device integration with local AI processing
  • Private chatbot for home automation systems
  • Educational AI projects without cloud dependencies

Educational Benefits:

  • Understanding ARM64 AI deployment challenges and solutions
  • Learning about model quantization and optimization
  • Hands-on experience with edge AI computing
  • Resource-constrained system optimization skills

Professional Development:

  • ARM64 deployment experience for edge AI applications
  • Understanding of AI model performance on embedded systems
  • Cost-effective AI development and testing environment
  • Experience with thermal and resource management in AI workloads

Troubleshooting

Models fail to load or run very slowly: Check available memory with free -h, consider using smaller models like tinyllama:1.1b, monitor temperature with vcgencmd measure_temp

System becomes unresponsive during inference: Reduce OLLAMA_NUM_PARALLEL to 1, ensure adequate cooling, check for thermal throttling with vcgencmd get_throttled

ARM64 Docker images fail to run: Verify 64-bit OS with getconf LONG_BIT, ensure Docker is pulling correct architecture images

High CPU temperature and performance throttling: Improve cooling setup, reduce concurrent operations, consider active cooling solutions for sustained AI workloads

Performance Optimization

For optimal ARM64 AI performance, use quantized models (q4_0 or q5_0), monitor CPU temperature to prevent throttling, and use fast storage (SSD over SD card). The Pi 5's ARM64 architecture requires careful resource management but can handle smaller models effectively for development and edge AI use cases.

Conclusion

You now have Ollama running efficiently on your Raspberry Pi 5's ARM64 architecture. This setup demonstrates the potential of ARM processors for edge AI workloads while providing a cost-effective platform for AI development and experimentation.

Your PicoCluster Desktop Datacenter provides an excellent platform for running Ollama for local LLM inference on ARM architecture. This setup not only saves costs compared to cloud AI services but also provides valuable hands-on experience with enterprise-grade AI technologies while maintaining complete privacy and control.

Related Products & Resources

Explore our range of Desktop Datacenter solutions:

For additional support and documentation, visit our support center.

Leave a comment

Please note, comments must be approved before they are published

What are you looking for? Have questions or feedback?