artificial-intelligence
Running Gemma 4 Locally for Your Next Feature: A Practical Decision Guide
I built a commit message generator locally as a playground first—instead of reaching for a hosted API—so the model stays on my Mac (M4 these days), it cos...
11 Apr 2026
I built a commit message generator locally as a playground first—instead of reaching for a hosted API—so the model stays on my Mac (M4 these days), it costs nothing to run, and I never think about bills, rate limits, or someone else’s outage.
The point of running Gemma 4 locally is not winning benchmark Twitter. It is that for a bunch of boring dev workflows, local is the right default and cloud was just habit.
Quick model cheat sheet: Gemma 4 comes in small (E2B, E4B), big MoE (26B), and dense 31B. Old blog posts talk about Gemma 3 12B/27B—that is a different lineup. For “run it on a laptop and forget it,” I use gemma4:e4b in Ollama.
When local actually makes sense
Anything that runs on a dev machine and eats code: commit messages, PR blurbs, test summaries, log triage. You usually want okay output and no copy-paste of proprietary diffs to a third party.
E4B is fine for that. I threw a few hundred real diffs at it; most commit messages were good enough without edits. The ugly cases were huge diffs—same as any small model. Bigger cloud models do better there; they also cost money and see your diff.
Lazy fix for giant diffs: Tell it the main intent in one line before you paste the diff. Sounds silly, saves edits.
If the thing runs on every save or keystroke, local also wins on vibes: no network, no “why is this slow today.”
Speed and RAM without the spreadsheet
On an M4 it feels quick enough for interactive stuff. On older Apple silicon it still works; you wait a bit more. Don’t overthink it—if it feels sluggish, use a smaller context or a smaller quant, not a lecture on tensor cores.
Ollama’s defaults are GGUF quants; if answers feel mushy, bump quality (e.g. a Q8 tag) before you jump to a bigger model.
RAM in one breath: E4B fits a normal Mac; 26B/31B want a lot of unified memory and are for when you actually need the quality. MoE still has to load the whole checkpoint, so don’t assume “small active params” means “tiny RAM.” Details live in the official Gemma 4 docs if you care.
Data residency and the privacy argument
Lots of places can’t ship source to random APIs without paperwork. Local means the diff never leaves the box. That is the whole pitch.
Wiring up the function calling interface
Ollama already ships Gemma 4 tags (gemma4:e4b, etc.), speaks OpenAI-ish HTTP on localhost, and does tools. Pull and go—no Modelfile unless you are doing something weird.
ollama pull gemma4:e4b
Point the OpenAI Node SDK at http://localhost:11434/v1 and swap the model name. Current openai on npm is v6.x; same idea as always.
One tool and one function call: usually fine. A dozen tools and a plan: use a bigger model or stop pretending. If it answers in plain text instead of a tool call, tighten the system prompt and retry; worst case, parse the text.
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama', // required by the SDK, ignored by Ollama
});
const tools: OpenAI.Chat.Completions.ChatCompletionTool[] = [
{
type: 'function',
function: {
name: 'generate_commit_message',
description: 'Generate a conventional commit message from a git diff',
parameters: {
type: 'object',
properties: {
type: {
type: 'string',
enum: ['feat', 'fix', 'refactor', 'chore', 'docs', 'test'],
},
scope: { type: 'string' },
subject: { type: 'string', maxLength: 72 },
body: { type: 'string' },
},
required: ['type', 'subject'],
},
},
},
];
async function generateCommitMessage(diff: string): Promise<string> {
const response = await client.chat.completions.create({
model: 'gemma4:e4b',
messages: [
{
role: 'system',
content:
'You are a commit message generator. Analyse the provided git diff and call the generate_commit_message function with an appropriate conventional commit.',
},
{ role: 'user', content: `Git diff:\n\`\`\`\n${diff}\n\`\`\`` },
],
tools,
tool_choice: { type: 'function', function: { name: 'generate_commit_message' } },
});
const toolCall = response.choices[0].message.tool_calls?.[0];
if (!toolCall) {
const text = response.choices[0].message.content;
throw new Error(
text ? `No tool call; model returned text: ${text.slice(0, 200)}` : 'No tool call in response',
);
}
const args = JSON.parse(toolCall.function.arguments);
const scope = args.scope ? `(${args.scope})` : '';
const body = args.body ? `\n\n${args.body}` : '';
return `${args.type}${scope}: ${args.subject}${body}`;
}
Before and after: commit message generation in practice
Before: Haiku over the network. Cheap enough to ignore on paper, still annoying—keys in every machine, internet, and “is this client okay with diffs in the cloud.”
After: gemma4:e4b in Ollama. Slightly more manual cleanup on nasty diffs, zero marginal cost, no keys, no network drama. Good trade for us; maybe not for you.
Where it does not make sense
Hard reasoning, big refactors, “why is this flaky”—use something serious or you will get confident nonsense. Smaller quants make that worse; bump quality before you blame the model.
Non-dev end users will not install Ollama. For “runs in the browser” demos, WebGPU stacks (Transformers.js, MLC LLM) exist; you still pay the download.
Getting started
Install Ollama, ollama pull gemma4:e4b, point your SDK at http://localhost:11434/v1. The code change is tiny; the only real question is whether local actually fixes your problem or you just want a new toy.
For commit messages and similar: it fixed mine. Start there if the shoe fits.