How to use Transformers.js in a Chrome extension

We recently released the Transformers.js demo browser extension powered by Gemma 4 E2B to assist users with web navigation.

During the build, I encountered some practical observations regarding the Manifest V3 runtime, model loading, and messaging. These are worth sharing.

who is this for

This guide is for developers who want to run local AI functions in Chrome extensions using Transformers.js under Manifest V3 constraints.

The end result is the same architecture used in this project, including a background service worker to host the model, a chat UI in the side panel, and content scripts for page-level actions.

what we build

This guide recreates the core architecture of the Transformers.js Gemma 4 Browser Assistant using publicly available extensions as references and the open source codebase as an implementation map.

1) Chrome Extension Architecture (MV3)

Before we get into the main topic, let me give you a quick overview. We won’t go into detail about React UI layers or Vite build configurations. The focus here is on high-level architectural decisions: what each Chrome runtime does and how those parts are coordinated.

If you’re new to Manifest V3, start by reading this short overview: What is Manifest V3?

1.1 Runtime context and entry points

In MV3, the architecture starts at public/manifest.json. This project defines three entry points:

The background service worker also handles chrome.action.onClicked to open the active tab’s side panel. Related entry points to know: Popups can be defined with action.default_popup and are suitable for quick actions. This project uses a side panel for persistent chat, but the orchestration pattern is the same.

1.2 What is done and where?

A key design decision is to maintain a large amount of orchestration in the background and keep the UI/page logic thin.

Background (src/background/background.ts) is the control plane. Shared services such as agent lifecycle, model initialization, tool execution, and feature extraction. The side panel (src/sidebar/*) is the interaction layer: chat input/output, streaming updates, and setup controls. The content script (src/content/content.ts) is the page bridge. DOM extraction and highlighting actions.

One practical consequence of this split is that conversation history also exists in the background (Agent.chatMessages). The UI sends events such as AGENT_GENERATE_TEXT, adds messages in the background, performs inference, and then sends MESSAGES_UPDATE back to the side panel.

This split avoids duplicate loading of models, keeps the UI responsive, and respects Chrome’s security boundaries regarding DOM access.

1.3 Messaging Agreement

Once the runtime is separated, messaging becomes the backbone. In this project, all messages are entered via enums in src/shared/types.ts.

Side panel -> BackgroundTasks: CHECK_MODELS, INITIALIZE_MODELS AGENT_INITIALIZE, AGENT_GENERATE_TEXT, AGENT_GET_MESSAGES, AGENT_CLEAR EXTRACT_FEATURES Background -> Side panel (BackgroundMessages): DOWNLOAD_PROGRESS, MESSAGES_UPDATE Background -> Content (ContentTasks): EXTRACT_PAGE_DATA, HIGHLIGHT_ELEMENTS, CLEAR_HIGHLIGHTS

Orchestration rules are simple. The background is a single coordinator. Side panels and content scripts are special workers that request actions and render results.

Typical request flow:

The side panel sends AGENT_GENERATE_TEXT. Add the background to Agent.chatMessages and run the model/tool step. A MESSAGES_UPDATE is issued in the background. The side panel will re-render from the updated message list.

2) Transformers.js integration details

2.1 Model and responsibility

In src/shared/constants.ts, this extension uses two model roles.

This division is intentional. Gemma 4 handles inference/tool decisions and MiniLM generates vector embeddings for semantic similarity searches in ask_website and find_history.

2.2 Where inference is performed

All inference is performed in the background (src/background/background.ts).

Text generation and vector normalization with Pipeline(“text-generation”, …) using consistent KV caching enabled by new DynamicCache class embedding and vector normalization via Pipeline(“feature-extraction”, …)

This provides a single model host for all tabs/sessions, avoids duplicate memory usage, and keeps the side panel UI responsive. Because the model is loaded from a background service worker, artifacts are cached in the extension origin (chrome-extension://) rather than in a per-website origin, providing one shared cache for the entire extension installation.

A note about the MV3 lifecycle: Because service workers can be paused and restarted, the runtime state of the model should be treated as recoverable and reinitialized as needed.

2.3 Download and cache lifecycle

The model lifecycle is explicit.

CHECK_MODELS examines what is already cached and estimates the remaining download size. INITIALIZE_MODELS downloads/initializes the model and prints DOWNLOAD_PROGRESS to the UI. Long-lived instances are reused after setup.

Permissions and privacy are part of the architecture, not the last checkbox. In this project, public/manifest.json requests host_permissions for http(s)://*/* in addition to sidePanel, storage, scripts, and tabs.

SidePanel: Required to open and control the side panel UX. Storage: Required to maintain tool/settings state between sessions. Tabs + Scripts: Required for tab-enabled tools and page-level actions. host_permissions for http(s)://*/*: Required as content extraction/highlighting is designed to work on any website.

Why we’re making this narrow: Permissions define user trust and Chrome Web Store review risk. Request only what the feature actually needs, and clearly state that the inference will be performed locally in the extension runtime so that users understand where their data will be processed.

3) Agent and tool execution loop

3.1 Tool invocation basics (why this layer exists)

Helps you understand how model tool invocations work (the fundamentals of agent workflow) before the run loop. You pass the message and tool schema (name, description, parameters), and Transformers.js uses the model’s chat template to format the actual prompt from those inputs. Chat templates are model-specific, so the exact tool invocation format depends on the model you use. Using Gemma-4 style templates, the model generates a special tool call token block when it decides to call it.

import {pipeline} from “@Hugging Face/Transformers”;

constant generator = wait pipeline(
“Text generation”,
“onnx-community/gemma-4-E2B-it-ONNX”,{
dtype: “q4f16”,
device: “Web GPU”}, );

constant message = ({ role: “user”, content: “What’s the weather like in Bern?” });

constant output = wait generator(message,{
max_new_tokens: 128,
do_sample: error,
tool🙁 {
type: “function”,
function: {
name: “Get the weather”,
explanation: “Know the weather in that location”,
parameters: {
type: “object”,
properties: {
position: {
type: “string”,
explanation: “A place to know the weather”}, },
Required🙁“position”), }, }, }, ), });

At generation time, the model can emit output like the following:

call:getWeather{location:Bern}

This is exactly why this project includes a normalization layer (webMcp) and a parser (extractToolCalls). The output of the model must be translated into a deterministic tool execution.

3.2 Tool interface for this project

src/background/agent/webMcp.tsx normalizes the extension tool to a shape appropriate for your model.

name, description, inputSchema, execution

Examples of tools include get_open_tabs, go_to_tab, open_url, close_tab, find_history, ask_website, and highlight_website_element.

3.3 Loop design (Agent.runAgent)

The main design choice here is to separate internal model messages from UI-oriented chat messages.

Internal model transcript (message): System/user/tool/assistant turn used for generator(…) messages. UI transcript (chat messages): What the user sees. Contains streamed assistant text and tool execution metadata (tools) and performance metrics.

Execution flow:

Add user input to chatMessages, create placeholder assistant messages, and stream tokens. Parse the streaming/final model output into { message, toolCalls } using extractToolCalls.ts. The tool call runs in the background, but the assistant message that the user sees remains as plain text. Add the tool’s results to the assistant tool’s metadata and feed back the results as the next prompt turn. Iterate until you run out of tool calls to finalize your assistant’s content and metrics.

This keeps user communications clean while keeping a definitive tool loop in the background.

4) Data boundaries and persistence

State placement is another very important architectural decision in MV3. In this implementation, state is partitioned by lifecycle and access pattern.

Conversation state: Background memory for fast turn-by-turn orchestration (Agent.chatMessages). Tool settings: chrome.storage.local so settings persist between sessions. Semantic history vectors: IndexedDB (VectorHistoryDB) for larger local search data. Extracted page content: Background cache (WebsiteContentManager) keyed by active URL.

As explained in Section 1.2, keeping conversation history in the background provides one canonical state across UI updates. This keeps short-lived state in memory, persistent settings in extended storage, and large amounts of captured data in a local database.

5) Notes on building and packaging

Although complex build configurations are not required, MV3 requires predictable output at each runtime.

Multi-entry build in vite.config.ts: Check output names/paths (sidebar.html, background.js, content.js) to match manifest. Keep content scripts as self-contained output to avoid chunk loading issues at runtime.

The goal is simple. Place one artifact for each Chrome entry point exactly where public/manifest.json expects it.

final point

The choice of architecture that will unlock this entire project is a clear separation of concerns. Orchestration and model execution occurs in the background, the UI surface remains thin, and content scripts handle page access.

This project uses a side panel, but the same approach will work for other setups.

Popup-first assistant: Use action.default_popup for quick interactions that own conversation state and model execution in the background. Copilot in the side panel: Maintain long-running conversations in a persistent panel while tool loops and caching are handled in the background. Agent per tab: If you want each tab to have its own context, maintain one agent state per tabId in the background. Hybrid UI (popup + side panel + options page): All UI entry points communicate with the same background coordinator and reuse the same message contract.

The practical rules are simple. Decide where the state lives (globally, tabId, or site scoped) and keep that state and model inference in the background (basically as a background service), with the UI/content runtime acting as a centralized client.

versatileai

See Full Bio

What's Hot

Google DeepMind and A24 begin research partnership

NVIDIA BioNeMo accelerates human clade science

Benchmarking AI Agents for Enterprise Java Framework Migration

Google DeepMind and A24 begin research partnership

NVIDIA BioNeMo accelerates human clade science

Benchmarking AI Agents for Enterprise Java Framework Migration

Achieve density and score across distributions with one transformer

How NVIDIA AI-Q reached #1 on DeepResearch Bench I and II

New in llama.cpp: Model Management

Most Popular

Achieve density and score across distributions with one transformer

How NVIDIA AI-Q reached #1 on DeepResearch Bench I and II

New in llama.cpp: Model Management

Don't Miss

Google DeepMind and A24 begin research partnership

NVIDIA BioNeMo accelerates human clade science

Benchmarking AI Agents for Enterprise Java Framework Migration

Subscribe to Updates

What's Hot

How to use Transformers.js in a Chrome extension

who is this for

what we build

1) Chrome Extension Architecture (MV3)

1.1 Runtime context and entry points

1.2 What is done and where?

1.3 Messaging Agreement

2) Transformers.js integration details

2.1 Model and responsibility

2.2 Where inference is performed

2.3 Download and cache lifecycle

3) Agent and tool execution loop

3.1 Tool invocation basics (why this layer exists)

3.2 Tool interface for this project

3.3 Loop design (Agent.runAgent)

4) Data boundaries and persistence

5) Notes on building and packaging

final point

Related Posts