TOON Format Specification: Complete Guide to Grammar and Syntax
Discover the official TOON (Token-Oriented Object Notation) specification, ABNF grammar, data types, and key syntax rules of this modern serialization format.
Status: Draft Standard (v2.0)
This document defines the technical specification for TOON (Token-Oriented Object Notation), a data serialization format optimized for Large Language Model (LLM) context windows. See the official project page for updates.
The key goals of this specification are:
- Maximizing Token Density: Representing data in the fewest possible BPE (Byte Pair Encoding) tokens.
- Human Readability: Maintaining a structure that is intuitive for humans and easy for LLMs to reason about.
- Lossless JSON Conversion: Ensuring 1:1 mapping with standard JSON types.
1. Document Structure
A TOON document MUST be encoded in UTF-8. A TOON document represents exactly one root node, which may be an Object or an Array.
2. Whitespace and Indentation
TOON uses Significant Whitespace to denote hierarchy.
- Indent Unit: The standard indentation is 2 spaces (U+0020). Tabs (U+0009) are forbidden to ensure consistent tokenization across different models.
- Newlines: Lines must end with `\n` (LF) or `\r\n` (CRLF).
3. Objects
An Object is a collection of key-value pairs.
Syntax: key: value
user:
name: Alice
role: admin- The colon ` : ` is mandatory. It must be followed by at least one space if a value follows on the same line.
- If the value is a nested object/array, the newline follows immediately after the colon (or key).
4. Arrays
Arrays can be represented in two ways: List Style and Table Style.
4.1 List Style (Heterogeneous)
Used when array items have different structures or are primitives.
tags:
- featured
- new
- on-saleThe hyphen `-` denotes a list item.
4.2 Table Style (Homogeneous)
Used when array items are objects sharing the same keys. This is the primary optimization feature of TOON.
Syntax header: key[Count]{field1, field2}
users[3]{id, name, score}:
1, Alice, 99
2, Bob, 85
3, Charlie, 42Rules:
- Count: The `[N]` indicates the number of rows. This hints the LLM to expect a loop.
- Fields: The `{a, b}` defines the schema for the rows.
- Separator: Values are separated by comma `,`.
Note: Using pipes `|` for visual alignment is allowed by some parsers but discouraged in the strict spec as it consumes extra tokens.
5. Values and Types
5.1 Strings
Strings can be Unquoted or Quoted.
Unquoted Strings (Bare Words):Any sequence of characters that does not start with a special character (`-`, `[`, `{`, `"`, `#`) and does not contain newlines or delimiters.
status: active
color: light blueQuoted Strings:Double quotes `"` are required if the string contains special characters, starts/ends with whitespace, or resembles a boolean/number/null.
greeting: "Hello, World!"
empty: ""
number_string: "123"5.2 Numbers
Follows standard JSON number format (integer, float, exponent).
count: 42
temp: 36.6
avogadro: 6.022e235.3 Booleans
Literals `true` and `false` (lowercase).
5.4 Null
Literal `null`.
6. Comments
Comments start with `#` and extend to the end of the line.
# This is a comment
config:
timeout: 5000 # milliseconds7. ABNF Grammar (Excerpt)
The following is a simplified ABNF definition.
TOON = object / list
NL = %x0A / %x0D.0A
Indent = 2SP
object = 1*(key pair)
pair = key ":" [SP value] NL
list = list-item / table-array
list-item = "- " value NL
table-array = key "[" integer "]" "{" field-list "}" ":" NL *row
row = Indent value *("," SP value) NL
value = string / number / boolean / null / object / list8. Parsing Implementation Guide
When implementing a TOON parser, the primary challenge is Context Tracking. Since structure is defined by indentation, the parser must maintain a stack of current indentation levels.
Algorithm Sketch:
- Read line.
- Calculate indentation level (count leading spaces / 2).
- If level > current_level: Push new container (Object/Array).
- If level < current_level: Pop containers until match.
- Parse content (Key-Value or List Item).
9. Test Suite
To ensure interoperability, all implementations should pass the TOON Core Test Suite (available on GitHub). It covers:
- Deep nesting limits (default: 100).
- Unicode handling (emojis, CJK characters).
- Corner cases (empty keys, empty strings, trailing commas).
Recommended Reading
The Token Economy: Why Data Format Is AI's Hidden Cost Lever
Tokens are the unit of AI economics. Here's why the format you serialize data in—JSON, TOON, or TONL—quietly determines your LLM bill and context budget.
TOON in Python and JavaScript: A Hands-On SDK Guide
Install, encode, decode, and validate TOON in Python and JavaScript/TypeScript with copy-paste examples and a JSON-to-TOON migration checklist.
Using TOON with LangChain and LlamaIndex
Custom output parsers and document formatting to feed TOON-encoded context into LangChain and LlamaIndex pipelines for materially lower token usage.