LLM Performance on Mac: Native vs Docker Ollama Benchmark

LLM runs slow in Docker on MacOS

I have started generating a daily RSS digest using Matcha and summarizing it with Ollama. You can find more details in my previous article. However, when there are many articles to summarize, the process becomes quite slow. For example, once I had to wait almost one hour to get the daily digest. This made me think about how to make it faster. My first guess was that the GPU is not being used at all. I found evidence for this in the Ollama GitHub repository:

When you run Ollama as a native Mac application on M1 (or newer) hardware, we run the LLM on the GPU.

Docker Desktop on Mac, does NOT expose the Apple GPU to the container runtime, it only exposes an ARM CPU (or virtual x86 CPU via Rosetta emulation) so when you run Ollama inside that container, it is running purely on CPU, not utilizing your GPU hardware.

On PC’s NVIDIA and AMD have support for GPU pass-through into containers, so it is possible for ollama in a container to access the GPU, but this is not possible on Apple hardware.

So let’s install ollama natively on MacOS and run a benchmark to compare the results.

Run LLM natively on MacOS

Install ollama
```
brew install ollama
```
Start ollama
```
ollama serve
```
Make sure ollama is up and running
```
curl http://localhost:11434
```
Pull a model (e.g. granite3.3 by IBM)
```
ollama pull granite3.3:latest
```
Run a model
```
ollama run granite3.3:latest
```
Adding a flag --verbose gives you helpful infromation about the performance
```
ollama run granite3.3:latest --verbose
```

Run a benchmark

We live in remarkable times. Whenever I face a challenge, I first search online, as there is a high chance that someone else has already encountered and solved a similar problem. I was curious if there was a benchmarking tool for LLMs that I could use, and I discovered llm.aidatatools.com.

Install:

pip install llm-benchmark

Setup:

Total memory size : 32.00 GB
cpu_info: Apple M1 Pro
gpu_info: Apple M1 Pro
os_version: macOS 14.6 (23G80)
ollama_version: 0.9.1

It can automatically pick and pull models based on RAM available on your machine but you also can create a config file with models you’d like to test:

file_name: "custombenchmarkmodels.yml"
version: 2.0.custom
models:
  - model: "granite3.3:8b"
  - model: "phi4:14b"
  - model: "deepseek-r1:14b"

Now you can run the benchmark with the models of your choice:

llm_benchmark run --custombenchmark=path/to/custombenchmarkmodels.yml

Here is a sample of what a benchmark looks like:

model_name =    granite3.3:8b
prompt = Summarize the key differences between classical and operant conditioning in psychology.
eval rate:            24.49 tokens/s
prompt = Translate the following English paragraph into Chinese and elaborate more -> Artificial intelligence is transforming various industries by enhancing efficiency and enabling new capabilities.
eval rate:            24.93 tokens/s
prompt = What are the main causes of the American Civil War?
eval rate:            24.00 tokens/s
prompt = How does photosynthesis contribute to the carbon cycle?
eval rate:            24.14 tokens/s
prompt = Develop a python function that solves the following problem, sudoku game.
eval rate:            23.93 tokens/s
--------------------
Average of eval rate:  24.298  tokens/s

During the native run, GPU utilization was consistently close to 100%, confirming that Ollama was able to leverage the Apple M1 Pro’s GPU for accelerated inference.

The same benchmark was performed with Ollama running inside a Docker container, where GPU usage was not detected. As expected, the evaluation rates were significantly lower, and the models relied solely on CPU resources, resulting in much slower inference times.

Benchmark results

The results of benchmarking for ollama running natively and in a docker container:

Model	Avg. Eval Rate (tokens/s)	GPU Utilization	Notes
granite3.3:8b	24.3	~100%	Native, Apple M1 Pro
phi4:14b	14.5	~100%	Native, Apple M1 Pro
deepseek-r1:14b	13.7	~100%	Native, Apple M1 Pro
granite3.3:8b	4.3	0%	Docker, Apple M1 Pro
phi4:14b	2.3	0%	Docker, Apple M1 Pro
deepseek-r1:14b	2.4	0%	Docker, Apple M1 Pro

It’s quite obvious and the benchmark shows that running Ollama natively on a Mac with Apple Silicon delivers up to 5–6 times faster LLM inference speeds compared to Docker, thanks to full GPU utilization, while Docker runs are limited to CPU and are significantly slower.

LLM runs slow in Docker on MacOS

Run LLM natively on MacOS

Run a benchmark

Benchmark results

Enjoy Reading This Article?