LSEG: Segment-Based Protocol for Data Interpretation (LSEG: A Segment-Based Protocol for Data Interpretation)
ORCID: 0009-0002-7724-5762
02 December 2025
Original language of the article: Russian
Abstract
The paper presents LSEG (Language Segment Encoding)—a minimalistic and extensible segment-based protocol for interpreting data streams. Each segment begins with the byte 0x00, followed by LANG_ID, which determines the choice of parser for the subsequent bytes. The protocol does not constrain the internal structure of tables (alphabets) and allows arbitrary interpretation mechanisms: from simple single-byte tables to full-fledged Unicode decoders, binary formats, DSLs (JSON, XML, EDF), and AST representations.
LSEG provides:
high data compactness (savings up to 50% without compression),
improved compressibility (up to 70–80% with gzip/zstd),
stream self-synchronization,
clear separation of structure and interpretation mechanism.
Files using this protocol are recommended to be designated with the.lseg extension, and the corresponding MIME type: application/lseg.