In the world of AI inference, performance optimization is crucial, especially when running large language models on local hardware. This article explores a custom build script for llama.cpp that leverages Intel’s OneAPI toolkit to maximize performance on Intel processors.

What is Llama.cpp?

Llama.cpp is a lightweight, open-source project that enables running large language models (LLMs) like Llama, Vicuna, and other compatible models locally on CPU. Unlike cloud-based solutions, llama.cpp allows you to run sophisticated AI models on your own hardware without internet connectivity or API costs.

Why Intel Optimizations Matter

Intel processors power most modern computers, but their full potential often remains untapped without proper optimization. The script we’re examining uses:

  • Intel MKL (Math Kernel Library): Highly optimized mathematical routines
  • Intel C++ Compiler (icpx/icx): Advanced compiler optimizations
  • Intel OneAPI: Unified programming model for Intel hardware
  • Flash Attention: Memory-efficient attention mechanism
  • Context Shifting: Dynamic context management for better memory usage

The Build Script Deep Dive

Here’s the complete build script with detailed explanations:

#!/bin/bash
# build-llama-intel.sh
set -e

BUILD_DIR=/home/jp/PERURI-ID/llama.cpp
BUILD_TARGET_DIR=build
INSTALL_TARGET_DIR=/home/jp/.local/share/llama.cpp

Script Configuration

The script begins by setting essential paths:

  • BUILD_DIR: Location of your llama.cpp source code
  • BUILD_TARGET_DIR: Build directory (created fresh each time)
  • INSTALL_TARGET_DIR: Final installation location

Intel OneAPI Environment Setup

# Load Intel OneAPI environment
if [ -z "$MKLROOT" ]; then
    echo "[INFO] MKLROOT not set, sourcing Intel OneAPI setvars.sh..."
    source /opt/intel/oneapi/setvars.sh
else
    echo "[INFO] MKLROOT already set: $MKLROOT"
fi

This section ensures Intel OneAPI is properly loaded. The MKLROOT environment variable indicates whether the Intel environment is already active.

Build Directory Preparation

# Go to build directory and remove target dir if there exists
cd $BUILD_DIR && rm -rf $BUILD_TARGET_DIR

# Update
git pull

rm -rf ${BUILD_TARGET_DIR}
mkdir -p ${BUILD_TARGET_DIR}
cd ${BUILD_TARGET_DIR}

The script:

  1. Navigates to the source directory
  2. Updates the repository with git pull
  3. Creates a fresh build directory
  4. Changes into the build directory

CMake Configuration

# Configure cmake
cmake .. \
    -DGGML_BLAS=ON \
    -DGGML_BLAS_VENDOR=Intel10_64lp \
    -DCMAKE_C_COMPILER=icx \
    -DCMAKE_CXX_COMPILER=icpx \
    -DGGML_NATIVE=ON \
    -DGGML_USE_FLASH_ATTENTION=ON \
    -DCTX_SHIFT=ON \
    -DCMAKE_EXE_LINKER_FLAGS="-ljemalloc"

Key configuration options:

  • GGML_BLAS=ON: Enables BLAS (Basic Linear Algebra Subprograms)
  • GGML_BLAS_VENDOR=Intel10_64lp: Specifies Intel MKL as the BLAS provider
  • CMAKE_C_COMPILER=icx: Uses Intel C compiler
  • CMAKE_CXX_COMPILER=icpx: Uses Intel C++ compiler
  • GGML_NATIVE=ON: Enables native CPU optimizations
  • GGML_USE_FLASH_ATTENTION=ON: Enables Flash Attention for better memory efficiency
  • CTX_SHIFT=ON: Enables context shifting for dynamic memory management
  • CMAKE_EXE_LINKER_FLAGS=”-ljemalloc”: Links with jemalloc for improved memory allocation

Build Process

echo "[INFO] Building..."
# Build with all threads minus 2 (to keep laptop responsive)
NUM_THREADS=$(($(nproc) - 2))
cmake --build . --config Release -j$NUM_THREADS
echo "[INFO] Build complete!"

The build process:

  1. Calculates available CPU threads minus 2 (to maintain system responsiveness)
  2. Builds in Release mode with parallel compilation
  3. Provides clear status messages

Installation

# Remove old codes
echo "[INFO] Remove old files" 
rm -rf $INSTALL_TARGET_DIR
mkdir -p $INSTALL_TARGET_DIR

# Install
echo "[INFO] Installing..."
cp -a  ./bin $INSTALL_TARGET_DIR/bin

echo "[INFO] Install complete! Llama.cpp ready to be used."

The installation:

  1. Removes any previous installation
  2. Creates the installation directory
  3. Copies the built binaries to the target location

Prerequisites

Before running this script, ensure you have:

  1. Intel OneAPI Base Toolkit installed
  2. llama.cpp source code cloned to the specified directory
  3. Git for repository updates
  4. CMake and build tools
  5. Intel processor (obviously!)

Usage Instructions

  1. Make the script executable:
    chmod +x build-llama-intel.sh
    
  2. Run the script:
    ./build-llama-intel.sh
    
  3. Add to PATH (optional):
    export PATH="$HOME/.local/share/llama.cpp/bin:$PATH"
    

Performance Benefits

With these optimizations, you can expect:

  • 2-3x faster inference compared to standard builds
  • Better memory utilization with Flash Attention
  • Improved context handling with dynamic shifting
  • Native CPU optimizations for your specific Intel processor

Troubleshooting

Common Issues:

  1. MKLROOT not found: Ensure Intel OneAPI is properly installed
  2. Compiler errors: Verify icx/icpx are in your PATH
  3. Build failures: Check that all dependencies are installed

Performance Tuning:

  • Adjust NUM_THREADS based on your system’s thermal limits
  • Experiment with different BLAS vendors if Intel MKL isn’t optimal
  • Monitor CPU temperatures during intensive inference tasks

Conclusion

This build script transforms llama.cpp from a basic CPU inference tool into a highly optimized powerhouse for Intel hardware. The combination of Intel’s mathematical libraries, advanced compilers, and memory optimizations makes it possible to run sophisticated AI models efficiently on consumer-grade Intel processors.

Whether you’re building a local AI assistant, running research models, or just experimenting with LLMs, this optimized build provides the performance edge needed for smooth, responsive AI interactions.

The script demonstrates the importance of hardware-specific optimizations in AI inference and serves as a template for building other performance-critical applications on Intel platforms.