Snehal Patel

Snehal Patel

I love to build things ✨

Building a Browser Agent with Gemini and Playwright

February 15, 2026

Browser Agent Logo

If you have written Selenium or Playwright scripts for more than a few months, you know the ritual. You spend an afternoon automating some workflow, everything works perfectly, and then the site ships a new build and your carefully crafted XPath dissolves into a NoSuchElementException. You patch it. A week later it breaks again. At some point the maintenance cost quietly exceeds the value of the automation itself.

The deeper problem is that traditional browser automation is imperative: you are encoding a specific sequence of actions against a specific DOM structure. Any change to the site’s structure invalidates your script. For one-off tasks or fast-moving targets, this is a bad trade. What you actually want is to describe intent: “go to this site and do this thing”: and have something else figure out the mechanics.

That is the premise behind browser-agent: a lightweight Python tool that wires Gemini’s function-calling API to a live Playwright browser, turning plain-English instructions into real browser actions.


The Problem with Traditional Browser Automation

CSS selectors and XPaths are brittle by design. They are references into a tree structure that nobody promised would stay stable. When developers refactor markup, add a wrapper div, or switch from IDs to data attributes for better test hygiene, your automation silently breaks.

Beyond fragility, scripts are task-specific. The script that logs into your dashboard and exports a CSV cannot be repurposed for a different site without a full rewrite. Each new workflow requires a fresh encoding of element locations and action sequences. For any organization running a dozen automations, this becomes a part-time maintenance job.

The combination of LLMs and tool-use changes this equation. A model that can read a snapshot of the current page and decide which element to click next does not care what the site looked like last month. It reasons about current state, not encoded assumptions. The script stops being a rigid plan and becomes a responsive loop.


What I Built

browser-agent is an open-source Python project that gives Gemini eight browser tools and a Playwright Chromium session, then steps back and lets the model drive.

It supports: navigating to URLs, capturing page snapshots, clicking elements, typing into fields, exporting pages as PDFs, going back in history, and waiting. There is both a Python API and a CLI. The one-liner that motivated the whole thing:

>> browser-agent "Go to Hacker News and save the front page as a PDF"

No selectors. No page objects. Just the task.


Under the Hood: The Agentic Loop

The core architecture is a tight loop between Gemini and a live browser. Here is how a single run unfolds:

1. Task input. The user provides a plain-English instruction, either via CLI argument or the Python run_browser_agent() function.

2. Task refinement (optional). Before the main loop starts, task_generator.py makes a separate Gemini call at temperature 0.3. It converts open-ended prompts into 3–10 numbered steps. This decouples intent clarification from execution: the main loop receives a concrete plan rather than an ambiguous request. If the generator fails for any reason, the original input passes through unchanged. Graceful degradation, not a hard error.

3. Browser and client initialization. run_browser_agent() launches a Playwright Chromium instance with anti-bot configuration applied (more on this below), initializes the Gemini client, and prepares the tool schema for all eight browser functions.

4. The loop (≤ 50 iterations). Each iteration follows the same pattern:

  • Send the full conversation history plus the eight-function tool schema to Gemini
  • Gemini returns a function_call response naming one of the eight tools and its arguments
  • Execute that tool call against the live browser
  • Append the result to conversation history
  • Repeat

5. Exit conditions. The loop exits when Gemini calls task_complete() (success) or task_failed() (the model has determined it cannot finish the task). The 50-iteration cap is a safety rail against runaway loops.

The eight tools available to the model:

Tool What it does
browser_navigate(url) Load a URL
browser_snapshot() Capture interactive elements + page text
browser_click(index) Click the nth element
browser_type(index, text, submit) Fill a field
browser_pdf(filename) Export page as PDF
browser_back() Go back in history
browser_wait(seconds) Pause
task_complete / task_failed Signal outcome

The whole thing is stateless across runs but stateful within one: Gemini sees the full history of what it has done at every step.


Three Design Decisions Worth Calling Out

Indexed element references instead of raw DOM

The most important implementation choice is what the model actually sees when it reads a page. Serializing a full DOM tree into the prompt is expensive in tokens and noisy in signal. Most of the DOM is irrelevant to any given action.

Browser.snapshot() takes a different approach: it runs a single JavaScript evaluation that finds all visible interactive elements on the page: links, buttons, inputs, selects: assigns each an integer index starting from 0, and returns a compact JSON map of index → element description. The prompt stays small. Gemini only needs to say browser_click(index=7) to click the eighth interactive element on the page. It never has to reason about XPaths, CSS selectors, or DOM hierarchy.

This also means the index space is stable within a single snapshot. The model can call browser_snapshot(), read the result, and immediately reference any element by its index in the next step. Deterministic and cheap.

Task generator for vague inputs

Vague prompts are the hardest input for an action loop. “Summarize the top stories on Hacker News” could mean five different things depending on how you interpret “top” and “summarize.” If that ambiguity hits the main loop directly, the model has to resolve it under execution pressure: while also managing browser state.

Separating the clarification step from the execution step keeps both simpler. The task generator makes one focused call: take this open-ended request, return numbered steps. Temperature 0.3 keeps it deterministic. The output: a clean numbered list: is what the main loop actually runs against.

The fallback matters: if the generator call fails or returns garbage, the original prompt passes through. The system degrades gracefully rather than halting on a pre-processing failure.

Anti-bot configuration without third-party packages

Browser fingerprinting is a real concern for any automation tool. The typical response in the Selenium ecosystem is to reach for a stealth library: a package that patches browser internals to hide automation signals. These libraries tend to be fragile, lag behind browser releases, and add dependency surface area.

browser-agent takes a lighter approach: a custom User-Agent string, the --disable-blink-features=AutomationControlled Chromium launch flag, and a one-line JavaScript snippet on page load that deletes navigator.webdriver. Three lines of configuration, no extra package, minimal maintenance burden. The README is honest that this won’t defeat every detection system: but for most casual use cases, it is sufficient and far easier to maintain than a full stealth wrapper.


A Quick Walkthrough

Here is a concrete example: “Go to DuckDuckGo, search for ‘best Python libraries 2025’, and save the results as a PDF.”

After the task generator breaks this into steps, the main loop produces something like:

Step 1 → browser_navigate("https://duckduckgo.com")
Step 2 → browser_snapshot()          # finds search box at index 4
Step 3 → browser_type(4, "best Python libraries 2025", submit=True)
Step 4 → browser_wait(2)
Step 5 → browser_pdf("ddg_results")
Step 6 → task_complete("Saved results to ddg_results.pdf")

Gemini decided every one of those steps. The code just executed them, captured results, and fed them back. The browser_snapshot() call in step 2 is what gave the model enough information to know the search box was at index 4: it read the element map and made an inference.

Six steps, zero selectors written by hand.


Closing Thoughts

The interesting lesson from building browser-agent is that the LLM integration was not the hard part. Gemini’s function-calling API is well-designed; once you define the tool schema, the model uses it reliably. The hard part was building a clean, token-efficient interface between the model and the browser.

Indexed snapshots instead of raw DOM. Graceful fallback on pre-processing failures. Simple anti-bot configuration that does not require external packages. None of those are AI problems: they are interface design problems. Getting that layer right made everything else fall into place.

If you want to try it, extend it with your own tools, or just poke at the internals, the code is on GitHub: github.com/spate141/browser-agent. Stars and PRs welcome.