A deep dive into how LLM prompt caching works under the hood, focusing on the transformer attention mechanism and the exact data providers reuse between requests. This is also one of the most accessible explanations of how LLMs work that I’ve encountered. The visuals are really clear, and the step by step walkthrough is incredibly clear. Via Simon Willison.