Tokenizer

How Language is Translated into Tokens for AI Models

Introduction
Large Language Models (LLMs) like GPT-4 don't understand text the way humans do. Instead, they break it down into smaller pieces called tokens — which can be words, parts of words, punctuation, or even spaces. Each token is assigned a unique token ID (integer), which the model uses internally to understand and generate language.

Instructions
This demo uses TiktokenSharp, a C# library that replicates the tokenization logic used by OpenAI language models (including ChatGPT 4). Enter any text below to see how it’s split into tokens, with each token’s numeric ID and text shown. The colorized view shows how the model breaks the text into tokens.

;