Run Local Deep Research with Ollama on Mac (2026 Guide)

Published 03 Apr 2025 • Updated 31 May 2026 • 12 min read

Last updated April 2026 — refreshed for current model/tool versions.

Running a private, iterative web research assistant entirely on your Mac is practical today: Ollama v0.22.0 (released April 28, 2026) handles local inference, and two mature open-source projects — langchain-ai/local-deep-researcher and LearningCircuit/local-deep-research v1.6.6 — wire it all together into a fully automated research loop. This guide covers both tools, current model recommendations, Apple Silicon performance benchmarks, and every pitfall you are likely to hit.

What changed in 2026 — read this first if you set this up in 2025Ollama hit v0.22.0 (April 28, 2026). The MLX backend preview (0.19, March 2026) nearly doubled decode speed on Apple Silicon — from ~58 to ~112 tok/s on Qwen3.5-35B-A3B. The current stable release also adds Gemma 4 Metal shader fixes and Llama 4 Scout/Maverick support.Model landscape overhauled. Replace Llama 3.2 with Llama 4 Scout (17B MoE), Gemma 3 with Gemma 4 (released April 2, 2026), and Qwen 2.5 with Qwen 3.5 or Qwen 3.6-35B-A3B. DeepSeek V3 → V4 is available but not yet in Ollama's library as of late April 2026.LearningCircuit/local-deep-research v1.6.6 (April 29, 2026) is the actively maintained, full-featured fork: web UI, Docker Compose, REST API, MCP server for Claude Desktop, AES-256 encrypted databases, and 10+ search sources (arXiv, PubMed, Semantic Scholar, SearXNG, Tavily, and others).langchain-ai/local-deep-researcher is the original minimal prototype. Still useful for LangGraph Studio visualization; now supports GPT-OSS tool calling. Default model in its README changed from llama3.2 to deepseek-r1:8b.LangChain integration replaced by LangGraph. The original post recommended pip install langchain; the correct dependency is now langgraph-cli[inmem] for the LangChain variant.Python requirement bumped. local-deep-research now requires Python ≥ 3.12 (was 3.10 in 2025 builds).

Want the full picture? Read our continuously-updated Llama 4 Complete Guide (2026) — Scout and Maverick variants, MoE architecture, and deployment patterns.

TL;DR — Which Tool Should You Use?

Scenario	Recommended tool	Install command
Quick start, LangGraph Studio visualization	langchain-ai/local-deep-researcher	`uvx --from "langgraph-cli[inmem]" langgraph dev`
Full-featured web UI + multiple search sources	LearningCircuit/local-deep-research	`pip install local-deep-research`
Team/server deployment with Docker	LearningCircuit/local-deep-research	`docker compose up -d`
Academic research (arXiv, PubMed, Semantic Scholar)	LearningCircuit/local-deep-research v1.6+	pip or Docker
Claude Desktop integration via MCP	LearningCircuit/local-deep-research v1.6+	pip or Docker

What Is Local Deep Research?

Local Deep Research is an AI-powered web research assistant that automates iterative query generation, web scraping, summarization, and gap analysis — entirely on your own hardware. Unlike cloud tools (Perplexity, ChatGPT search, Gemini Deep Research), every API call hits your local Ollama server. Your queries and summaries never leave your machine.

The research loop works in five stages:

Topic input: you provide a research question.
Query generation: the local LLM generates optimized web search queries.
Data collection: results are fetched from configured search engines (DuckDuckGo, SearXNG, Tavily, arXiv, PubMed, and others).
Summarization & reflection: the LLM summarizes findings and identifies knowledge gaps.
Iteration: new queries fill the gaps; the cycle repeats for a configurable number of rounds (default: 3).

The final output is a markdown report with cited sources.

For a broader look at running AI agents locally, see the OpenClaw + Ollama setup guide for running local AI agents — it covers the full local-agent stack including web search integration, tool calling, and persistent memory.

Prerequisites

macOS 13 Ventura or later (Apple Silicon or Intel)
16 GB unified memory minimum; 32 GB recommended for 14B+ models
Ollama v0.22.0 installed (see below)
Python ≥ 3.12 (for pip-based installs)
~10–20 GB free disk space per model

Step 1 — Install Ollama (v0.22.0)

Ollama is the model runtime that handles downloading, quantization, and serving local models via an OpenAI-compatible REST API on port 11434.

Option A — Homebrew (recommended)

brew install --cask ollama

Option B — Direct download

Go to ollama.com and download the macOS installer.
Extract the ZIP and drag Ollama.app to your Applications folder.
Launch Ollama.app — it runs as a menu bar app and starts the local server.

Verify the install

ollama --version
# Expected: ollama version is 0.22.0

curl http://localhost:11434
# Expected: Ollama is running

If localhost:11434 is unreachable, open Ollama.app from your Applications folder first — it must be running before any client can connect.

Step 2 — Choose and Pull a Model

The original post recommended Meta Llama 3.2. As of April 2026, better options exist for research workloads. The table below uses real benchmark data from LLMCheck.net (Q4_K_M quantization, standardized 256-token prompt → 512-token generation, averaged over 3 runs on M4 Pro 24 GB).

2026 Model Comparison for Research Workloads

Model	Size on disk	VRAM needed	Tok/s M4 Pro 24GB	Context window	Strengths
Gemma 4 (12B default)	~9.6 GB	12 GB	~45	128K	Best all-round on 16 GB Macs; native tool calling
Qwen 3.5 9B	~8.1 GB	10 GB	~55	128K	Fastest in class; strong reasoning; best for 8–16 GB Macs
Qwen 3.6-35B-A3B (MoE)	~22 GB	24 GB	~55	128K	Top GPQA (86.0%); only 3B active params per token; needs 32 GB Mac
Llama 4 Scout 17B (MoE)	~14 GB	16 GB	~40	10M	10 million token context window; great for long-doc research
Gemma 4 26B-A4B (MoE)	~18 GB	22 GB	~50	256K	Benchmark leader for coding; 97% of 31B quality at 8× less compute
deepseek-r1:8b	~5 GB	6 GB	~65	32K	Strong reasoning; default in langchain-ai variant; limited JSON reliability

Tok/s figures are approximate; performance varies 5–15% between inference engines. MLX backend (Ollama 0.19+ preview) typically runs 15–30% faster than the default llama.cpp backend on the same chip.

Pull your chosen model

# Good default for 16 GB Macs
ollama pull gemma4

# Best for 32 GB Macs with research + reasoning tasks
ollama pull qwen3.6:35b-a3b

# Maximum context window (useful for long document research)
ollama pull llama4:scout

# Smallest useful model for 8 GB Macs
ollama pull qwen3.5:9b

# Confirm model is loaded
ollama list

Context window note: on 16 GB machines, set context_length to 32,768 tokens. On 32 GB+, you can safely run 131,072 (128K). Larger windows increase VRAM pressure and slow generation speed.

Tool A — langchain-ai/local-deep-researcher (Minimal, LangGraph Studio)

This is the original minimal prototype from the LangChain team. It requires uv (the Rust-based Python package manager) and opens a LangGraph Studio interface in your browser.

Install and run

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repo
git clone https://github.com/langchain-ai/local-deep-researcher.git
cd local-deep-researcher

# Copy environment template
cp .env.example .env

# Edit .env — set your model name
# OLLAMA_MODEL=gemma4   (or qwen3.5:9b, llama4:scout, etc.)

# Launch the LangGraph dev server (Python 3.11 required for this variant)
uvx --refresh --from "langgraph-cli[inmem]" --with-editable . --python 3.11 langgraph dev

The browser will open LangGraph Studio at http://localhost:2024. Type a research question in the input field, hit run, and watch the iterative loop execute in real time.

Key configuration options (.env)

OLLAMA_MODEL=gemma4
OLLAMA_BASE_URL=http://localhost:11434
MAX_WEB_RESEARCH_LOOPS=3        # increase for deeper research; each loop ~30-90s
SEARCH_ENGINE=duckduckgo        # or: searxng, tavily, perplexity

Important: gpt-oss models in Ollama do not support JSON mode. If you use a gpt-oss variant, set USE_TOOL_CALLING=true in your .env. DeepSeek R1 at 7B and 1.5B also has difficulty producing the required JSON output — Gemma 4 or Qwen 3.5 are more reliable choices for this tool.

Tool B — LearningCircuit/local-deep-research v1.6.6 (Full-Featured)

The LearningCircuit fork is the production-ready option. It ships with a web UI, REST API, MCP server, per-user encrypted databases (AES-256 via SQLCipher), and support for 10+ search sources including arXiv, PubMed, Semantic Scholar, Wayback Machine, and private document collections.

Performance benchmarks (SimpleQA accuracy)

Configuration	SimpleQA accuracy
GPT-4.1-mini + SearXNG (focused-iteration strategy)	90–95%
GPT-4.1-mini + Tavily	90–95%
Gemini-2.0-flash-001 + SearXNG	~82%
Local Ollama model + DuckDuckGo	70–80% (varies by model)

Note: cloud model benchmarks shown for comparison; cloud models require API keys and incur per-token costs. Local Ollama configs are fully offline after initial model download.

Install via Docker Compose (recommended)

git clone https://github.com/LearningCircuit/local-deep-research.git
cd local-deep-research
docker compose up -d

The Docker Compose file includes all three services: the research app, SearXNG (self-hosted search), and the Ollama service. Once running, the web UI is at http://localhost:5000.

Network note: all three containers must be on the same Docker network. If you run Ollama separately (e.g., as a desktop app), set OLLAMA_BASE_URL=http://host.docker.internal:11434 in the compose file.

Install via pip (developers)

# Requires Python ≥ 3.12
pip install local-deep-research

# Launch the web UI
local-deep-research

Configure to use Ollama

In the web UI, navigate to Settings → LLM Provider → select Ollama. Set:

Ollama base URL: http://localhost:11434
Model: gemma4 (or your pulled model name)
Context window: 32768 for 16 GB Macs; 131072 for 32 GB+

For search sources, SearXNG (self-hosted) gives the best privacy guarantee. If you need academic sources, enable arXiv, PubMed, and Semantic Scholar from the search engine settings panel.

Claude Desktop / Claude Code integration (MCP)

v1.6+ ships an MCP server, letting you call local deep research from within Claude Desktop or Claude Code:

pip install local-deep-research
local-deep-research --mcp-server

Add the MCP server endpoint (http://localhost:5000/mcp) to your Claude Desktop settings under MCP servers. You can then use /research [topic] commands directly from the Claude interface.

Performance & Apple Silicon Benchmarks

These figures are sourced from LLMCheck.net (April 2026), using Q4_K_M quantization on standardized prompts.

Chip	Unified memory	Gemma 4 12B tok/s	Qwen 3.5 9B tok/s	Llama 4 Scout 17B tok/s
M1 (base)	16 GB	~22	~35	~18
M2 Pro	16 GB	~30	~45	~25
M3 Pro/Max	18–36 GB	~38	~55	~32
M4 Pro	24 GB	~45	~60	~40
M4 Max	36–128 GB	~65	~85	~55

MLX backend (Ollama 0.19+ preview): On 32 GB+ Macs with Qwen3.5-35B-A3B, decode speed nearly doubled from ~58 tok/s to ~112 tok/s after enabling the MLX runner. The preview requires >32 GB unified memory and is currently limited to select models. Enable it with:

OLLAMA_RUNNER=mlx ollama run qwen3.6:35b-a3b

Memory bandwidth is the primary bottleneck. The M4 Max (~546 GB/s) generates tokens roughly 2.5× faster than a base M1 (~68 GB/s) on the same model. For teams running research tasks at scale, Codersera maintains a vetted pool of remote AI engineers who can help architect and deploy local AI infrastructure that matches your hardware constraints.

How to Choose Your Setup

8 GB Mac: Qwen 3.5 9B (8.1 GB) is your only practical option. Use langchain-ai/local-deep-researcher with DuckDuckGo. Do not run Docker Compose (too much memory overhead).
16 GB Mac: Gemma 4 12B or Qwen 3.5 9B. Both tools work. SearXNG via Docker requires ~2 GB RAM; keep that in mind.
32 GB Mac: Qwen 3.6-35B-A3B (MoE) with LearningCircuit/local-deep-research is the sweet spot. Enable MLX backend preview for best speed.
64 GB+ Mac (M4 Max, M3 Ultra, M4 Ultra): Llama 4 Maverick (128K context) or Qwen 3.6-35B-A3B with 128K context window. Run Docker Compose with all three services.
Need academic sources (arXiv, PubMed): use LearningCircuit v1.6+, the LangChain variant does not support those sources.
Want LangGraph visualization: use langchain-ai/local-deep-researcher with LangGraph Studio.

What Was Removed and Why

LM Studio as primary alternative: LM Studio is still functional but Ollama now covers most use cases with a simpler CLI and better model management. LM Studio remains useful for its GUI model browser and for running GGUF models that Ollama's library doesn't yet index.

Llama 3.2 as the recommended model: Llama 3.2 (3B and 11B vision variants) is superseded for text research by Llama 4 Scout and Gemma 4. Llama 3.2 still works with both tools but delivers lower accuracy on research tasks at equivalent quantization levels.

pip install langchain as the install step: the langchain-ai variant now uses langgraph-cli (LangGraph, not LangChain). Installing only langchain will not launch the research loop; you will get import errors at runtime.

Common Pitfalls & Troubleshooting

Ollama not responding on port 11434

Ollama.app must be running in the menu bar before any client connects. Run curl http://localhost:11434 to confirm. If you installed via Homebrew, start it with brew services start ollama.

JSON mode errors / KeyError: 'query'

Certain models (gpt-oss variants, DeepSeek R1 1.5B and 7B) do not reliably produce JSON-structured output required by the research loop. Switch to Gemma 4 or Qwen 3.5. For gpt-oss, set USE_TOOL_CALLING=true in the langchain-ai variant.

Context window exhaustion mid-research

The default Ollama context is 2,048 tokens for some models. Pull with an explicit context length:

OLLAMA_NUM_CTX=8192 ollama run gemma4

Or set it permanently in a Modelfile:

FROM gemma4
PARAMETER num_ctx 8192

Docker container network isolation

If the research container cannot reach Ollama, confirm all services share the same Docker network. Check with docker network inspect. If Ollama runs as a desktop app (not in Docker), use http://host.docker.internal:11434 as the base URL.

Slow performance on large models

Reduce MAX_WEB_RESEARCH_LOOPS to 2 or lower the quantization (Q3_K_M instead of Q4_K_M). On 16 GB Macs, avoid models that exceed 10 GB disk size — the remainder is needed for OS and search cache. Closing other memory-heavy apps (browsers with 30+ tabs, VS Code with large projects) frees bandwidth for the model.

SearXNG returning empty results

Self-hosted SearXNG sometimes trips rate limits on upstream engines (especially Google). In the SearXNG settings, reduce simultaneous engine count and add a 1–2 second delay between queries. DuckDuckGo is more permissive for testing but less comprehensive for academic queries.

Use Cases

Literature review: enable arXiv and PubMed in LearningCircuit v1.6+; the tool fetches, parses, and cites papers directly.
Competitive intelligence: run iterative loops on competitor product names; set 5–7 loops for deep coverage.
Regulatory / compliance research: privacy-sensitive organizations benefit from zero-cloud processing; all queries stay on the local network.
Student research: combine with Llama 4 Scout's 10M token context window to ingest entire PDF corpora before the research loop begins.
Developer tooling: the REST API and MCP server in v1.6+ allow integration into CI pipelines for automated documentation research or changelog summarization.

FAQ

Is Local Deep Research free to use?

Both tools are open-source (MIT licensed). Ollama is free. You pay only for the hardware to run models locally. If you enable cloud providers (OpenAI, Anthropic, Google) inside local-deep-research for higher accuracy, those API calls are billed by the respective vendor at standard token rates.

Does it require an internet connection?

For web research (the primary use case), yes — it fetches search results from the internet. The LLM inference is entirely local. If you configure only local document search (private PDF or knowledge base mode), no internet connection is required post-setup.

Does it run on older M1 Macs?

Yes. On an M1 with 16 GB, Qwen 3.5 9B delivers ~35 tok/s — slow but usable for short research loops. Expect 3–5 minutes per complete research cycle. 8 GB M1 Macs are too constrained for models that perform well enough to be useful; the 7B models that fit tend to hallucinate search queries.

How does this compare to Perplexity or ChatGPT search?

Cloud tools are faster and more accurate out of the box (Perplexity and GPT-4.1 Search consistently outperform local 7–14B models on SimpleQA). The local setup wins on privacy, cost, and customizability: you control the search sources, can index private documents, and your research history is never logged to a third-party server.

Which model is best for research specifically?

On 16 GB Macs: Gemma 4 12B (best all-round) or Qwen 3.5 9B (fastest). On 32 GB+ Macs: Qwen 3.6-35B-A3B (top benchmark accuracy in the sub-40B weight class, 86.0% GPQA Diamond). For long-document research where context window matters most: Llama 4 Scout with its 10M token context.

langchain-ai variant vs. LearningCircuit — which should I pick?

Start with LearningCircuit/local-deep-research v1.6.6 unless you specifically need LangGraph Studio's visual debugging interface. The LearningCircuit project is more actively maintained (weekly releases in April 2026), has a web UI, and supports far more search sources. The langchain-ai variant is a simpler starting point for developers who want to read and modify the research graph code.

Is the research data stored securely?

In LearningCircuit v1.6+, each user's research history is stored in an AES-256-encrypted SQLCipher database locally. No telemetry or usage data is transmitted to any external service. In the langchain-ai variant, research state is in-memory only (not persisted between sessions).

Can I use DeepSeek V4 locally?

DeepSeek V4 weights are not yet available in Ollama's library as of late April 2026. DeepSeek R1 (1.5B, 7B, 8B, 14B, 32B, and 70B distills) are available. For research tasks, DeepSeek R1 14B+ performs well, but the smaller variants (1.5B, 7B) have known issues producing reliable JSON output in the research loop — use Gemma 4 or Qwen 3.5 instead.

References & Further Reading

Related guides on Codersera: