Spectre Lens: Giving Agents Eyes Without Handing Them the Whole Browser

This is the second article in the Spectre series.

The first one introduced Spectre Kinetic, the small planning layer that lets a model express intent while Elixir decides what reality accepts.

Now I want to move from action to perception, because this is another part of agent systems that sounds simple until you try to make it work in a real browser.

An agent can have tools, plans, and actions, but if it cannot understand what is on the page, it is still kind of blind. It can click things, sure. But clicking is not the same as seeing.

The Browser Is Not the Interface Agents Need

Humans need browsers. We need buttons, scrollbars, tabs, cookie banners, modals, menus, and all the little visual details that help us understand a page.

Agents dont need the browser in that same human way. They need perception.

By perception I mean something practical, not poetic. The agent needs to know what is on the page, what changed, what can be clicked, what can be submitted, what looks important, and what is just layout or decoration.

The usual approach is to throw raw HTML, a DOM dump, or sometimes a screenshot at the model and hope it understands enough. This can work for small things, but it becomes messy very fast.

The DOM was built for browsers, not for language models. It contains everything. Navigation, forms, sidebars, hidden inputs, tracking scripts, hydration leftovers, marketing text, nested divs, and a lot of things that are technically true but not really helpful for the agent.

That is why I started building Spectre Lens. I did not want the model to work directly with raw DOM unless it really has to. I wanted a layer that turns browser state into something the agent can actually read and use.

The idea is simple: the browser stays the browser, but the agent receives a shaped view of the page.

After Kinetic, Lens Is the Next Piece

Spectre is not meant to be one giant agent framework that tries to own everything.

I prefer smaller libraries that each handle one part of the problem. Kinetic handles tool planning. Lens handles browser perception. Other pieces can handle memory, orchestration, and whatever else becomes painful later.

This separation matters. Tool planning is not browsing. Browsing is not memory. Memory is not execution. When all of these things are mixed into one abstraction, the demo can look nice, but debugging it later becomes difficult.

Spectre Lens is the part that says: before the agent clicks anything, it should first understand the page.

It needs a readable view. It needs actions it can refer to. It needs forms, links, semantic structure, screenshots, page maps, and exported artifacts. It should be able to ask, “What does this page mean?” before asking, “What can I click?”

Browser automation clicks buttons. Spectre Lens tries to understand what the button is for.

That is the difference I care about. Automation is movement. Perception is knowing enough about the page before moving.

Lightpanda Drives, Spectre Lens Translates

Spectre Lens currently controls Lightpanda through CDP, which is the Chrome DevTools Protocol. CDP is the low-level machinery used for browser inspection and automation.

CDP is usefull, but I dont want the agent thinking in CDP terms. That is too low level. The agent should not care about browser internals unless the application really needs that.

The public contract is not “here is CDP, good luck.” The public contract is SpectreLens.Protocol: page views, actions, exports, page maps, watchers, and agent context.

Lightpanda is the current driver. Spectre Lens is the perception layer. That distinction is important, becasue drivers can change, but the agent-facing view should stay stable.

Here is the basic flow:

{:ok, lens} = SpectreLens.open(instances: 2)
{:ok, tab} = SpectreLens.new_tab(lens, url: "https://example.com")

{:ok, view} =
  SpectreLens.look(tab,
    include: [:markdown, :semantic_tree, :interactive, :forms, :links, :structured_data]
  )

view.markdown
view.actions
view.llms_context

The important part is that the agent does not receive “the browser”. It receives a shaped view.

Markdown gives readable content. Actions describe possible interactions. LLM context gives the page information that is already meant for machines. The model does not need to parse every piece of frontend noise by itself.

This is not magic. It is just a better boundary between the browser and the agent.

Page Maps Make The Page Easier To Reason About

Raw page content is useful, but pages are not only text. They have regions. A hero section, a pricing block, a form, a sidebar, a footer, navigation links. Humans understand this structure visually almost without thinking.

An agent needs help with that.

Spectre Lens gives page maps through zoom_out and zoom_in. The names sound a bit fancy, but the idea is simple. The agent can first ask for the big picture, then focus on one part of the page.

{:ok, map} = SpectreLens.zoom_out(tab)
map.description

{:ok, focused} = SpectreLens.zoom_in(tab, "#contact")

A zoomed-out map can describe the page in words: navigation at the top, hero section first, content in the middle, forms near the bottom, footer after that.

A zoom-in can focus on a section like #contact or #pricing. This matters because agents usually work better when the context is shaped around the task, instead of dumping the whole page into the prompt.

There is also goal-scoped discovery. This is useful when the agent needs to find something specific, like API documentation or a pricing page.

{:ok, discovery} = SpectreLens.discover(tab, goal: "api reference")
discovery.text
discovery.candidates

The point is not to crawl the whole internet. It is to explore just enough same-origin pages to find usefull context.

That is what I want from this layer: small, controlled movement. Not a crawler running around like it drank too much coffee.

The Agent Can Act, But It Still Does Not Own Reality

At some point perception needs to become action.

The agent sees a search field. It sees a submit button. It wants to fill something and click. Thats fine. Let it ask. Lens can translate that into browser actions.

But I still think it is important to keep the action explicit.

:ok = SpectreLens.act(tab, {:fill, ref: "#q", value: "spectre"})
:ok = SpectreLens.act(tab, {:click, ref: "button[type=submit]"})

The model does not need to invent a browser session in its head. The runtime has a tab. The tab has a page. The page has references. The agent proposes an action, and the library performs it against real state.

This is also why exports matter. In production, you need evidence. You want the screenshot. You want the markdown. You want artifacts you can save, inspect, debug, attach to logs, or compare later.

When the agent says it saw something, I want a way to check what it actually saw.

{:ok, "screenshots/example.png"} =
  SpectreLens.export(tab, :screenshot, path: "screenshots/example.png")

:ok = SpectreLens.close(lens)

And yes, close the lens. Processes should end. Browser instances should be supervised. Your machine should not suffer becasue an agent forgot to clean up after itself.

The Runtime Should Stay Calm Even When The Page Is Messy

Lens gives the agent eyes, but it does not pretend that eyes are the same thing as judgment.

This distinction matters. A model can observe, describe, choose, and suggest. Your application still decides what is allowed, what is logged, what is retried, and what gets blocked.

There is also support for llms.txt and llms-full.txt, which I find very practical. If a site exposes agent-oriented documentation, Lens can discover it and include that context during look.

That means the agent can read the page, but also receive the site’s own explanation of how machines should understand it. I like this direction. Websites are starting, slowly, to expose something more useful for agents than just visual layout and SEO text.

Errors also need to be readable. If an element is not found, the system should say that. If an export is unsupported, say that. If something is retryable, make that visible. If the target is known, return it.

This is the difference between an agent that can recover and one that fails becasue a button moved or the page loaded differently than expected.

This is the Spectre pattern again. Let the model be expressive, but keep the runtime boring and controlled.

Kinetic makes tool intent inspectable. Lens makes browser state readable. One handles action planning. The other handles perception.

Together they make an agent a little less blind before it touches something expensive.

We can't find the internet

Something went wrong!

Spectre Lens: Giving Agents Eyes Without Handing Them the Whole Browser

The Browser Is Not the Interface Agents Need

After Kinetic, Lens Is the Next Piece

Lightpanda Drives, Spectre Lens Translates

Page Maps Make The Page Easier To Reason About

The Agent Can Act, But It Still Does Not Own Reality

The Runtime Should Stay Calm Even When The Page Is Messy

Send via.chat

Yuriy Zhar

Get in Touch

Recommended Reading

I Wanted My Own Agent Library, So I Built One

Read Next

Spectre Mnemonic: Memory for Agents That Actually Forgets Correctly

I Wanted My Own Agent Library, So I Built One

Welcome to Dev Heaven, Where Interviews Are Finally Fun

Stay updated