We recently released the Transformers.js demo browser extension powered by Gemma 4 E2B to assist users with web navigation.
During the build, I encountered some practical observations regarding the Manifest V3 runtime, model loading, and messaging. These are worth sharing.
who is this for
This guide is for developers who want to run local AI functions in Chrome extensions using Transformers.js under Manifest V3 constraints.
The end result is the same architecture used in this project, including a background service worker to host the model, a chat UI in the side panel, and content scripts for page-level actions.
what we build
This guide recreates the core architecture of the Transformers.js Gemma 4 Browser Assistant using publicly available extensions as references and the open source codebase as an implementation map.
1) Chrome Extension Architecture (MV3)
Before we get into the main topic, let me give you a quick overview. We won’t go into detail about React UI layers or Vite build configurations. The focus here is on high-level architectural decisions: what each Chrome runtime does and how those parts are coordinated.
If you’re new to Manifest V3, start by reading this short overview: What is Manifest V3?
1.1 Runtime context and entry points
In MV3, the architecture starts at public/manifest.json. This project defines three entry points:
The background service worker also handles chrome.action.onClicked to open the active tab’s side panel. Related entry points to know: Popups can be defined with action.default_popup and are suitable for quick actions. This project uses a side panel for persistent chat, but the orchestration pattern is the same.
1.2 What is done and where?
A key design decision is to maintain a large amount of orchestration in the background and keep the UI/page logic thin.
Background (src/background/background.ts) is the control plane. Shared services such as agent lifecycle, model initialization, tool execution, and feature extraction. The side panel (src/sidebar/*) is the interaction layer: chat input/output, streaming updates, and setup controls. The content script (src/content/content.ts) is the page bridge. DOM extraction and highlighting actions.
One practical consequence of this split is that conversation history also exists in the background (Agent.chatMessages). The UI sends events such as AGENT_GENERATE_TEXT, adds messages in the background, performs inference, and then sends MESSAGES_UPDATE back to the side panel.
This split avoids duplicate loading of models, keeps the UI responsive, and respects Chrome’s security boundaries regarding DOM access.
1.3 Messaging Agreement
Once the runtime is separated, messaging becomes the backbone. In this project, all messages are entered via enums in src/shared/types.ts.
Side panel -> BackgroundTasks: CHECK_MODELS, INITIALIZE_MODELS AGENT_INITIALIZE, AGENT_GENERATE_TEXT, AGENT_GET_MESSAGES, AGENT_CLEAR EXTRACT_FEATURES Background -> Side panel (BackgroundMessages): DOWNLOAD_PROGRESS, MESSAGES_UPDATE Background -> Content (ContentTasks): EXTRACT_PAGE_DATA, HIGHLIGHT_ELEMENTS, CLEAR_HIGHLIGHTS
Orchestration rules are simple. The background is a single coordinator. Side panels and content scripts are special workers that request actions and render results.
Typical request flow:
The side panel sends AGENT_GENERATE_TEXT. Add the background to Agent.chatMessages and run the model/tool step. A MESSAGES_UPDATE is issued in the background. The side panel will re-render from the updated message list.
2) Transformers.js integration details
2.1 Model and responsibility
In src/shared/constants.ts, this extension uses two model roles.
This division is intentional. Gemma 4 handles inference/tool decisions and MiniLM generates vector embeddings for semantic similarity searches in ask_website and find_history.
2.2 Where inference is performed
All inference is performed in the background (src/background/background.ts).
Text generation and vector normalization with Pipeline(“text-generation”, …) using consistent KV caching enabled by new DynamicCache class embedding and vector normalization via Pipeline(“feature-extraction”, …)
This provides a single model host for all tabs/sessions, avoids duplicate memory usage, and keeps the side panel UI responsive. Because the model is loaded from a background service worker, artifacts are cached in the extension origin (chrome-extension://) rather than in a per-website origin, providing one shared cache for the entire extension installation.
A note about the MV3 lifecycle: Because service workers can be paused and restarted, the runtime state of the model should be treated as recoverable and reinitialized as needed.
2.3 Download and cache lifecycle
The model lifecycle is explicit.
CHECK_MODELS examines what is already cached and estimates the remaining download size. INITIALIZE_MODELS downloads/initializes the model and prints DOWNLOAD_PROGRESS to the UI. Long-lived instances are reused after setup.
Permissions and privacy are part of the architecture, not the last checkbox. In this project, public/manifest.json requests host_permissions for http(s)://*/* in addition to sidePanel, storage, scripts, and tabs.
SidePanel: Required to open and control the side panel UX. Storage: Required to maintain tool/settings state between sessions. Tabs + Scripts: Required for tab-enabled tools and page-level actions. host_permissions for http(s)://*/*: Required as content extraction/highlighting is designed to work on any website.
Why we’re making this narrow: Permissions define user trust and Chrome Web Store review risk. Request only what the feature actually needs, and clearly state that the inference will be performed locally in the extension runtime so that users understand where their data will be processed.
3) Agent and tool execution loop
3.1 Tool invocation basics (why this layer exists)
Helps you understand how model tool invocations work (the fundamentals of agent workflow) before the run loop. You pass the message and tool schema (name, description, parameters), and Transformers.js uses the model’s chat template to format the actual prompt from those inputs. Chat templates are model-specific, so the exact tool invocation format depends on the model you use. Using Gemma-4 style templates, the model generates a special tool call token block when it decides to call it.
import {pipeline} from “@Hugging Face/Transformers”;
constant generator = wait pipeline(
“Text generation”,
“onnx-community/gemma-4-E2B-it-ONNX”,{
dtype: “q4f16”,
device: “Web GPU”}, );
constant message = ({ role: “user”, content: “What’s the weather like in Bern?” });
constant output = wait generator(message,{
max_new_tokens: 128,
do_sample: error,
tool🙁 {
type: “function”,
function: {
name: “Get the weather”,
explanation: “Know the weather in that location”,
parameters: {
type: “object”,
properties: {
position: {
type: “string”,
explanation: “A place to know the weather”}, },
Required🙁“position”), }, }, }, ), });
At generation time, the model can emit output like the following:
call:getWeather{location:Bern}
This is exactly why this project includes a normalization layer (webMcp) and a parser (extractToolCalls). The output of the model must be translated into a deterministic tool execution.
3.2 Tool interface for this project
src/background/agent/webMcp.tsx normalizes the extension tool to a shape appropriate for your model.
name, description, inputSchema, execution
Examples of tools include get_open_tabs, go_to_tab, open_url, close_tab, find_history, ask_website, and highlight_website_element.
3.3 Loop design (Agent.runAgent)
The main design choice here is to separate internal model messages from UI-oriented chat messages.
Internal model transcript (message): System/user/tool/assistant turn used for generator(…) messages. UI transcript (chat messages): What the user sees. Contains streamed assistant text and tool execution metadata (tools) and performance metrics.
Execution flow:
Add user input to chatMessages, create placeholder assistant messages, and stream tokens. Parse the streaming/final model output into { message, toolCalls } using extractToolCalls.ts. The tool call runs in the background, but the assistant message that the user sees remains as plain text. Add the tool’s results to the assistant tool’s metadata and feed back the results as the next prompt turn. Iterate until you run out of tool calls to finalize your assistant’s content and metrics.
This keeps user communications clean while keeping a definitive tool loop in the background.
4) Data boundaries and persistence
State placement is another very important architectural decision in MV3. In this implementation, state is partitioned by lifecycle and access pattern.
Conversation state: Background memory for fast turn-by-turn orchestration (Agent.chatMessages). Tool settings: chrome.storage.local so settings persist between sessions. Semantic history vectors: IndexedDB (VectorHistoryDB) for larger local search data. Extracted page content: Background cache (WebsiteContentManager) keyed by active URL.
As explained in Section 1.2, keeping conversation history in the background provides one canonical state across UI updates. This keeps short-lived state in memory, persistent settings in extended storage, and large amounts of captured data in a local database.
5) Notes on building and packaging
Although complex build configurations are not required, MV3 requires predictable output at each runtime.
Multi-entry build in vite.config.ts: Check output names/paths (sidebar.html, background.js, content.js) to match manifest. Keep content scripts as self-contained output to avoid chunk loading issues at runtime.
The goal is simple. Place one artifact for each Chrome entry point exactly where public/manifest.json expects it.
final point
The choice of architecture that will unlock this entire project is a clear separation of concerns. Orchestration and model execution occurs in the background, the UI surface remains thin, and content scripts handle page access.
This project uses a side panel, but the same approach will work for other setups.
Popup-first assistant: Use action.default_popup for quick interactions that own conversation state and model execution in the background. Copilot in the side panel: Maintain long-running conversations in a persistent panel while tool loops and caching are handled in the background. Agent per tab: If you want each tab to have its own context, maintain one agent state per tabId in the background. Hybrid UI (popup + side panel + options page): All UI entry points communicate with the same background coordinator and reuse the same message contract.
The practical rules are simple. Decide where the state lives (globally, tabId, or site scoped) and keep that state and model inference in the background (basically as a background service), with the UI/content runtime acting as a centralized client.

