A mechanical answer

What robotic process automation is, in three layers you can open in source.

Most explanations of robotic process automation stop at an analogy: a software worker that imitates a human. That answer is fine for a slide and useless for a buyer. The mechanical answer is three layers stacked on top of each other. A capture filter that admits exactly six event types. A processing pipeline with four named stages. A replay engine with four match strategies tried in order. Every working RPA system has all three. Every difference between vendors lives inside one of them. This page walks each layer with the open-source files that implement it, so you can read along.

M
Matthew Diakonov
10 min

Why the analogy is not enough

The standard one-line definition of robotic process automation is accurate but underspecified. Yes, the bot performs a defined business task by interacting with the same applications a human would. Yes, it works through screen, keyboard, mouse, and accessibility surfaces. That tells you what the bot looks like from the outside. It does not tell you what it does on the inside. The questions a buyer actually needs answered, the ones that decide whether a workflow ships in three weeks or three quarters, all live below that one-liner.

Which keystrokes and clicks count as a step worth recording, and which are filtered out as noise? After capture, what gets done to the raw event stream before it becomes a runnable file? When the bot is replaying that file two months later and the underlying app has shipped a small UI update, what does the bot try to keep the run going? These three questions map onto three concrete pieces of machinery: capture, processing, and replay. The next three sections walk each one with a real file path you can open.

The implementation is the public Mediar codebase, an open-source Rust executor and a Windows desktop recorder published under MIT at github.com/mediar-ai/terminator. The point of citing specific files is not to praise the project; it is to let you verify a definition rather than take one on trust.

Layer one

Capture: only six event types ever count as a step.

The first thing a robotic process automation system does, while a human runs the target task once, is record events. Not all events. A modern Windows session emits hundreds of events per second between mouse moves, hover transitions, focus shifts, scroll wheels, raw keystrokes, idle window updates, and so on. If every one of those landed in the recording, the synthesis stage would have nothing useful to reason about and the workflow file would be unreadable. So every RPA system applies a filter. The Mediar filter lives in apps/desktop/src-tauri/src/recording_processor.rs, in a function called is_meaningful_event_type on line 250. It admits six event types and rejects everything else.

The six event types admitted into the pipeline

1

button_click

native Win32 click on a control

2

browser_click

click inside a Chromium DOM

3

text_input_completed

user finished typing in a field

4

browser_tab_navigation

URL change inside a tab

5

application_switch

foreground process changed

6

file_opened

an OS-level file open

The list looks short because it is. A button click on a Win32 control, a click inside a Chromium DOM, a completed text input after the user stops typing, a tab URL change, a foreground application switch, an OS-level file open. That is the universe of recordable user intent, as far as the synthesis stage is concerned. Mouse moves are filtered explicitly fifty lines away in workflow_recorder.rs with a comment that reads “too noisy”. Hovers and scrolls never make it into the meaningful index. Single keystrokes are coalesced into a single text_input_completed event once the user moves on.

Why six and not eight or four? Because each of these six maps cleanly onto a unit of business intent the rest of the pipeline can describe in a sentence: I clicked something, I typed something, I navigated, I switched apps, I opened a file. The sub-events that compose those (the focus shift before the click, the keystroke that ends the typing, the WM_ACTIVATE before the switch) carry information but only as supporting context. They are recorded into the surrounding UI tree snapshot, not into the event index. That separation is what keeps a one-hour recording session from producing a one-million-line synthesis prompt.

The first thing to notice about a real RPA system, then, is that it is a filter as much as it is a recorder. The bot does not see what the human did; it sees a curated subset of what the human did. Every downstream guarantee about reliability rides on that subset being well chosen. If the filter were narrower, real intent would be missed. If it were wider, the synthesis stage would drown in noise. Six event types is the answer this codebase landed on, and you can read it.

Layer two

Processing: four stages turn an event stream into a typed file.

Once the recorder has a stream of admitted events, the system has to turn it into something a deterministic runtime can replay. That transformation is the single largest piece of machinery in any modern robotic process automation product, and it is the place where vendors differ the most. The Mediar pipeline, declared in the ProcessingProgress struct on line 94 of recording_processor.rs, is exactly four stages. The doc-comment on the struct's total_stages field reads “always 4” and names them in order: step_analysis, labeling, synthesis, generation.

  1. Stage 1 of 4

    step_analysis

    For each meaningful event, the pipeline calls Gemini Vertex AI with the event itself plus the accessibility tree before and after the action plus a small screenshot. The model returns a structured object with eight named fields per step: step_title, step_summary, events_that_happened, how_content_changed, results_if_any, what_was_clicked, what_was_typed, and user_intent. This is the only stage where the model interprets pixels; after this point, all reasoning runs on the structured output.

  2. Stage 2 of 4

    labeling

    A second pass tags each step with a coarse label so the synthesis stage knows which steps belong to a navigation preamble, which to the data-entry body, and which to the cleanup tail. Labels are short strings, not free text. They are the seam the pipeline uses to split a long session into multiple shorter workflows.

  3. Stage 3 of 4

    synthesis

    The labeled steps are grouped into one or more SynthesizedWorkflow objects. A workflow has a title, a description, and a list of WorkflowStep objects, each of which contains substeps with explicit inputs, outputs, and business logic notes. The granularity is whatever the model found in the recording. A single recording session often produces several workflows, not one giant macro.

  4. Stage 4 of 4

    generation

    Each synthesized workflow is rendered into a TypeScript file with the eight semantic fields per step preserved as strings. That file is the deliverable. It can be reviewed in a pull request, diffed against the previous version of the workflow, and replayed by the executor crate without ever calling the model again. After generation finishes, the AI is gone from the loop.

The reason there are four stages and not one big monolithic call is failure isolation. If the synthesis stage produces a workflow that does not run, the system can re-synthesize without re-analyzing the events. If the labeling pass produces a bad split, the analyses are still good. If a model call hits a rate limit mid-session, the pipeline picks back up at the stage that failed instead of starting over. That property is hard to retrofit into a single-shot prompt and easy to keep when the stages are explicit fields on a progress struct.

The other property worth naming is that the model only touches stages one through four. After generation, the deterministic runtime takes over and there is no inference call in the hot path of the bot. Most enterprise compliance teams will not accept a model on the runtime path because it makes the bot non-reproducible. Splitting authoring (AI) from replay (deterministic) is the design choice that lets a bot pass a SOC 2 audit and still benefit from frontier models during the workflow-writing step.

Layer three

Replay: four match strategies, tried in order, before the bot stops.

Replay is where most older RPA installations break. A workflow is recorded against the version of an enterprise app that exists in April. Two months later the vendor ships a small update; a button moves five pixels; a panel reorders; the recorded selector no longer matches the live element. Selector-based runtimes have one answer to that case, which is to fail and queue a developer ticket. Modern RPA systems have a cascade. Mediar's cascade is in apps/desktop/src-tauri/src/focus_state.rs between lines 168 and 196. It tries four strategies in order before it gives up.

Replay: one selector versus a four-strategy cascade

The recorder stores one selector per step: an XPath, a Win32 control id, or an image hash. At replay, the runtime asks the live UI for that exact selector. If the selector matches, the click runs. If it does not match, the run fails and queues a maintenance ticket.

  • Single selector per step, no fallback
  • Brittle: a small layout change breaks the run
  • Maintenance scales with the number of workflows times the rate of UI updates

None of the four strategies call a model. All four read the live UI Automation tree, which is the same tree screen readers use to describe Windows apps to a blind user. That tree exposes role, name, automation id, bounding box, and parent chain for every visible control in every running application. The four strategies are the four most informative slices of that data: the most specific identifier first, the spatial fingerprint second, the human-readable label third, and the application container last as a coarse fallback.

The semantic guarantee that comes out of this is different from the 2003 selector-only semantics. The old guarantee was binary: the script ran or it failed. The new guarantee is graceful: the bot tried four ways to find the same element, and only after all four failed did it pause for re-recording. That single shift is what makes automation rates above ninety-five percent on legacy desktop apps possible without an army of developers patching selectors after every release. It is also why a Mediar workflow can run unattended for weeks through normal application updates while a Blue Prism era workflow needs maintenance every time a vendor ships.

One nuance worth being explicit about: the cascade does not improvise. There is no “if all four fail, the model picks something close enough.” If all four fail, the step queues for re-recording the next time a human walks the path, and the AI authoring layer takes over to write a fresh step. That separation preserves the original promise of the word automation: predictable, repeatable execution. A model in the hot path would compromise that, and most enterprise compliance teams will not accept it.

6 + 4 + 4

The capture layer admits six event types. The processing layer always runs four stages. The replay layer tries four strategies before it stops. These three numbers, taken together, are the closest you can get to a mechanical definition of robotic process automation in 2026, and you can verify each one in apps/desktop/src-tauri/src/recording_processor.rs and focus_state.rs.

capture, processing, and replay layers in the open-source Mediar codebase

Reading the capture filter, in code

The single most uncopyable fact on this page is the actual six-line match statement. Vendor pages will tell you their bot “learns your workflow”. None of them tell you what counts as something the bot will learn from. Here is the function that does:

fn is_meaningful_event_type(event_type: &str) -> bool {
    matches!(
        event_type,
        "button_click"
            | "browser_click"
            | "text_input_completed"
            | "browser_tab_navigation"
            | "application_switch"
            | "file_opened"
    )
}

That function is the gate between the raw event stream and the processing pipeline. Anything outside the six listed types is dropped on the floor. When you ask “what does the bot pay attention to”, this match is the answer for this codebase. A competing vendor's answer might be a longer list, or a shorter one; the difference matters more than the marketing claims, because it directly bounds what the bot can learn from one demonstration.

The mechanical definition, then

Robotic process automation is software that does three things in order. It captures a curated subset of UI events while a human runs a business task once. It compiles that event stream into a deterministic, file-based workflow through a fixed number of processing stages, possibly with help from a frontier model during authoring. It replays the workflow later by walking the live accessibility tree through a small ordered list of match strategies, falling back gracefully when the underlying interface has shifted. The bot is the binary that runs the third step. The process is the file produced by the second step. The automation is the property that the third step runs unattended.

That definition does work the standard analogy does not. It tells a buyer which questions to ask a vendor: how many event types does your recorder admit, what is the shape of your processing pipeline, how many match strategies does your runtime try before it pauses. None of those questions are answered by “our software robot mimics a human worker”. All of them are answered by source code.

The reason RPA as a category survived twenty years and two architectural shifts is that the user-facing promise (a software bot runs a business process unattended) stayed steady while the implementation underneath kept getting better. The 2026 version has three layers; the 2003 version had one. The next version will probably have a different number again. Open the files, count the layers, and you will know which version you are buying.

See the three layers run against your own workflow.

Bring a Windows process, ideally one that has stalled in UiPath or Power Automate. We will record it live, walk through the four-stage synthesis on screen, and replay the generated TypeScript file in the same call.

Frequently asked questions

What does robotic process automation actually do, in one sentence?

It watches a small set of UI events while a human runs a business task once, compiles those events into a deterministic script, and replays the script later by walking the live accessibility tree to find the same elements again. The 'robot' is the replay engine. The 'process' is the script file. The 'automation' is that the engine runs without a human present.

Which UI events does an RPA system count as worth recording?

In Mediar, six event types are admitted into the pipeline by the is_meaningful_event_type function in apps/desktop/src-tauri/src/recording_processor.rs at line 250: button_click, browser_click, text_input_completed, browser_tab_navigation, application_switch, file_opened. Mouse moves are filtered explicitly at workflow_recorder.rs line 404 because they would drown the recording in noise. Hovers, scrolls, and idle keystrokes never reach the synthesis stage. Every legacy RPA recorder makes a similar choice; this is the modern, source-visible version of it.

How many stages does a recording go through before it becomes a runnable script?

Four. The ProcessingProgress struct in apps/desktop/src-tauri/src/recording_processor.rs comments that total_stages is always 4 and names them: step_analysis, labeling, synthesis, generation. Step analysis runs Gemini Vertex AI over each meaningful event with the surrounding accessibility tree to extract eight semantic fields. Labeling tags each step with a coarse role. Synthesis groups steps into one or more workflows. Generation emits the TypeScript file the executor will later run. After generation, the AI is gone from the loop entirely.

How does the bot find the same button later if the UI has shifted?

It walks four strategies in order from focus_state.rs lines 168 to 196. Strategy 1: match the recorded automation/accessibility id. Strategy 2: match the window plus bounds. Strategy 3: match the visible text. Strategy 4: focus the parent window and let the next step retry. Only after all four fail does the run pause. None of the four strategies call a model. They all read the live UI Automation tree, which is the same tree screen readers consume.

Is RPA the same as a desktop macro recorder?

No. A macro recorder stores keystrokes and mouse coordinates and replays them at the same coordinates. RPA stores semantic descriptions of what was clicked and what was typed, then at replay time walks the live UI tree to find the matching element by id, by bounds, or by text. Coordinates are recomputed every run, so a window move, a DPI change, or a small UI tweak does not break the workflow. That is the entire reason the category exists.

Where does the AI live in this picture?

Only during stage two through four of recording: a Gemini Vertex AI call reads the meaningful events with surrounding UI context and produces the eight-field semantic record per step. After generation finishes, the TypeScript workflow file is checked into storage and the executor takes over. The production executor crate at crates/executor in github.com/mediar-ai/terminator has zero references to gemini, claude, openai, or any inference library. It only talks to a Windows session through MCP. So the bot is AI at authoring time and plain Rust at runtime.

Does the system mimic a human, or call APIs underneath?

It mimics a human, but at the operating system layer rather than at the pixel layer. Every action goes through Windows UI Automation, which is the same accessibility surface a screen reader uses. There is no API integration with the underlying app, no SAP RFC, no Oracle SQL, no FIS REST endpoint. That is the point: the same approach works for a SAP GUI screen, a Jack Henry green-screen, a Cerner intake form, or a regular web page in Chrome. The bot is application agnostic because the accessibility tree is application agnostic.

Why does this matter for a buyer?

Because the three numbers above (six event types, four stages, four match strategies) are what the bot's per-run reliability comes down to. Selector-based RPA in the 2003 lineage stops at one match strategy, which is why those scripts break when an interface ships a small update. The four-strategy cascade is why a Mediar workflow can run for weeks unattended through normal application updates, and the bounded six-event capture is why the recording stays small enough for a frontier model to reason about per session.