Principle:ClickHouse ClickHouse MIME Multipart Processing
ClickHouse_ClickHouse
Implementation:ClickHouse_ClickHouse_Poco_MultipartReader
Purpose
Defines the principles behind parsing and processing MIME multipart message bodies as specified by RFC 2046. Multipart messages allow a single HTTP message body to contain multiple distinct data sections, each separated by a boundary string. This is essential for features such as file uploads (multipart/form-data), mixed-content responses, and email attachments.
Theoretical Basis
The MIME multipart format (RFC 2046, Section 5) structures a message body as a sequence of body parts delimited by a boundary string. The key invariants are:
- The boundary string is declared in the `Content-Type` header as a parameter (e.g., `Content-Type: multipart/form-data; boundary=----WebKitFormBoundary`).
- Each boundary line begins with `--` followed by the boundary string.
- The final boundary line ends with an additional `--` suffix, indicating no further parts remain.
- Each part carries its own set of MIME headers (e.g., `Content-Disposition`, `Content-Type`) followed by a blank line and the part body.
- The preamble (text before the first boundary) and epilogue (text after the closing boundary) are to be ignored by conforming parsers.
The parsing algorithm is fundamentally a state machine that reads characters from an input stream:
- Locate the first boundary line.
- For each subsequent part, parse the part headers and stream the part body until the next boundary.
- Detect the closing boundary (with `--` suffix) to signal end of multipart content.
Key Properties
- Boundary-delimited framing: Parts are separated by a unique boundary token that must not appear within the body content itself.
- Streaming capability: Parts can be read incrementally without buffering the entire message, enabling processing of large payloads.
- Self-describing parts: Each part carries its own headers, making the format composable and extensible.
- Boundary guessing: If the boundary is not provided up front, it can be inferred from the first line of the body (must begin with `--`).
- RFC 2046 length limit: Boundary strings should be no longer than 70 characters, though implementations may accept up to 128.
Related RFCs
- RFC 2046 -- MIME Part Two: Media Types (Section 5: Composite Media Types)
- RFC 7578 -- Returning Values from Forms: multipart/form-data
- RFC 2045 -- MIME Part One: Format of Internet Message Bodies