In our previous post, “Inside LLM serving (1),” we explored how LLMs work and why tokens and GPUs are fundamental to their operation. In this second installment, we’ll trace the complete path of a user’s prompt as it flows through the server, examine how responses are generated, and understand why memory (specifically KV cache) becomes […]