The two-layer stack — Physea Wiki

Local AI is two jobs. A runtime (the engine) loads the model file and does the actual math; an app on top gives you a chat window and a model browser. Sometimes they are two separate programs, sometimes one download bundles both.

Running a model on your own computer breaks into two jobs, and it helps to keep them separate in your head.

The first job is the runtime, sometimes called the engine. This is the piece that opens the model file, loads its weights into memory, and does the actual math that turns your prompt into words. It is the part that talks to your hardware, whether that is a graphics card or the chip in a Mac. On its own, a runtime is often just a background service or a command line program. It works, but it is not pretty.

The second job is the app. This is the part you see and click: a chat window, a box to type into, a list of models you can download. The app does not do the heavy math itself. It sends your message down to the runtime and shows you what comes back, the same way a web browser shows you a page that some server actually produced.

One download or two? Sometimes these are two separate programs you install and connect. Sometimes a single app quietly bundles a runtime inside it, so installing one thing gives you both layers at once. Either way, the two jobs are still happening underneath.

Knowing which layer is which makes the rest of this topic easier. When something is slow or a model will not load, that is usually a runtime question about your hardware. When you are choosing how chatting should look and feel, that is an app question. The next two pages look at each layer and the common tools for it.

What are the two pieces of running AI on your own machine?