Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Tokenomics from first principles
# Tokenomics from first principles/ The economics of running AI/A guide to LLM inference economics

“Tokenomics” is a word at the intersection of tokens and economics, expressing the relationship between dollars and the primary unit of computation in large language models (LLMs)—tokens. We will explain where the cost of serving/hosting LLMs comes from, how many tokens can be produced by a GPU, and why this is the case. We will build a (simplified) world model of LLM inference arithmetics, based on the popular open-source model—LLama. The goal is to develop an accurate intuition regarding LLM inference.
“Tokenomics” is a word at the intersection of tokens and economics, expressing the relationship between dollars and the primary unit of computation in large language models (LLMs)—tokens. We will explain where the cost of serving/hosting LLMs comes from, how many tokens can be produced by a GPU, and why this is the case. We will build a (simplified) world model of LLM inference arithmetics, based on the popular open-source model—LLama 3.3. The goal is to develop an accurate intuition regarding LLM inference.

The topic of LLM inference economics has far-reaching implications beyond technical considerations. As AI capabilities rapidly advance, inference efficiency directly shapes both industry economics and accessibility. For AI labs, token production costs fundamentally determine profit margins - more efficient inference means higher returns on a fixed investment in hardware that can fuel further research and development cycles. For users, lower token costs democratize access to these powerful tools, potentially transforming AI from a premium resource into an everyday utility available for even routine tasks. Understanding these cost structures isn't merely academic - it provides insight into one of the key economic forces that will shape AI development in the coming years as we approach increasingly capable systems.
The topic of LLM inference economics has far-reaching implications beyond technical considerations. As AI capabilities rapidly advance, inference efficiency directly shapes both industry economics and accessibility. For AI labs, token production costs fundamentally determine profit margins (and also the cost of generating synthetic training data) - more efficient inference means higher returns on a fixed investment in hardware that can fuel further research and development cycles. For users, lower token costs democratize access to these powerful tools, potentially transforming AI from a premium resource into an everyday utility available for even routine tasks. Understanding these cost structures isn't merely academic - it provides insight into one of the key economic forces that will shape AI development in the coming years as we approach increasingly capable systems.

The primary cost behind a generated token boils down to the cost of compute—you need to buy or rent a GPU. In both cases, there is a fixed cost associated with running a GPU per hour. Each GPU can produce a limited number of tokens in an hour. The number of tokens produced per hour divided by cost of hardware per hour will tell you the unit cost of generating a single token. This is how most of the LLM providers price their API offering, and this will be the model we will explore.

Expand Down