= OML Measurement Stream Protocol (OMSP) =

[[TOC(heading=General Documentation, General/*, depth=1)]]

The OML Measurement Stream Protocol is used to describe and transport measurement tuples between Injection Points and Processing/Collection Points. All data injected in a Measurement Point (MP) (with omlc_inject) is timestamped and sent to the destination as a Measurement Stream (MS).

The liboml2 provides an API allowing to generate this MSs using this protocol and send them to a remote host, or store them in a local file.

Upon connection to a collection point, a set of headers is first sent, describing the injection point (protocol version, name, application, local timestamp), along with the schemata ofthe transported MSs".

Once done, timestamped measurement data is serialised, either using a binary encoding, or a text encoding.

== Generalities ==
There are 5 versions of the OML protocol:

* OMSP V1 was the initial protocol, inherited from OML (version 1!);
* OMSP V2 introduced more precise types (commit:28daef3f), and was released with OML 2.4.0;
* OMSP V3 introduced changes to the binary protocol to support blobs and, incidentally, longer marshalled packets (commit:6d8f0597), and was released with OML 2.5.0;
* OMSP V4 was introduced with OML 2.10.0; its main additions are the support for the definition of new Measurement Points (and Measurement Stream Schemas) at any time, and the ability to inject metadata.
* OMSP V5 which is the current, most recent and implemented version (since OML 2.11); its main advantages are the support for vectors, and the introduction of a DOUBLE64_T IEEE 754 binary64 for more precision in representing doubles in binary mode (vectors only).

The protocol is loosely modelled after HTTP. The client first start with a few textual headers, then switches into either the text or binary protocol for the serialisation of tuples, following a previously advertised schema. Both modes include contextual information with each tuple. There is no feedback communication from the server.

The client first opens a connection to an OML server and sends a header followed by a sequence of measurement tuples. The header consists of a sequence of key/value pairs representing parameters valid for the whole connection, each terminated by a new line. The headers are also used to declare the schema following which measurement tuples will be streamed. The end of the header section is identified by an empty line. Each measurement tuple is then serialised following the mode selected in the headers. For the text mode, this is a series of newline-terminated strings containing tab-separated text-based serialisations of each element, while the binary mode encodes the data following a specific marshalling. Clients end the session by simply closing the TCP connection.

=== !Key/Value Parameters ===
The connection is initially configured through setting the value of a few property, using a key/value model. The properties (and their keys) are the following.

* protocol: OMSP version, as specified in this document. The oml2-server currently supports 1–4;
* domain (experiment-id in V<4): string identifying the experimental domain (should match /[-_A-Za-z0-9]+/);
* start-time: local UNIX time in seconds taken at the time the header is being sent (see gettimeofday(3)), the server uses this information to rebase timestamps within its own timeline;
* sender-id: (start_time in V<4): string identifying the source of this stream (should match /[_A-Za-z0-9]+/);
* app-name: string identifying the application producing the a measurements (should match /[_A-Za-z0-9]+/), in the storage backend, this may be used to identify specific measurements collections (e.g., tables in SQL);
* content: encoding of forthcoming tuples, can be either binary for the binary protocol or text for the text protocol.
* schema: describes the schema of each measurement stream.

These parameters can only be set as part of the headers, and are not valid once the server expects serialised measurements (V<4).

Since V>=4, key/value metadata can be sent along with tuples using the schema 0, the rest of the key/value parameters presented here are all invalid in schema 0, and will be rejected by the server, except for key schema itself, allowing to (re)define schemata (XXX not including schema 0?).

=== Time-stamping and book-keeping ===
Regardless of the mode (binary or text), each measurement tuple is prefixed with some per-sample metadata.

Prior to serialising tuples according to their schema, three elements are inserted.

* timestamp: a double timestamp in seconds relative to the start-time sent in the headers, the server uses this information to rebase timestamps within its own timeline;
* stream_id: an integer (marshalled specifically as a uint8_t in binary mode) indicating which previously defined schema this tuple follows;
* seq_no: an int32 monotonically increasing sequence number in the context of this measurement stream.

The order of these fields varies depending on the mode (text or binary).

== OMSP Binary Marshalling ==
Marshalling is done directly into Mbuffers. A marshalled packet is structured as follows.

A header is first populated using marshal_init(), depending on the expected length. Both headers start with a double SYNC_BYTE.

A short header (OMB_DATA_P), created by marshal_header_short(), is 5 bytes long.

||  0  ||  1  ||  2  ||  3  ||
|| 0 1 2 3 4 5 6 7 || 8 9 0 1 2 3 4 5 || 6 7 8 9 0 1 2 3 || 4 5 6 7 8 9 0 1 ||
||   SYNC_BYTE   ||   SYNC_BYTE   ||  OMB_DATA_P   ||   msg-len-H   ||
||   msg-len-L  || || || ||


A long header (OMB_LDATA_P), created by marshal_header_long(), allows two more bytes for the length.

||  0  ||  1  ||  2  ||  3  ||
|| 0 1 2 3 4 5 6 7 || 8 9 0 1 2 3 4 5 || 6 7 8 9 0 1 2 3 || 4 5 6 7 8 9 0 1 ||
||   SYNC_BYTE   ||   SYNC_BYTE   ||  OMB_LDATA_P  ||   msg-len-HH  ||
||   msg-len-HL  ||   msg-len-LH  ||   msg-len-LL  || ||

This header is followed by metadata about its content, in the form of one byte indicating the number of measuments, and one the MS number (identifying the schema). The former is is updated at the end, by marshal_finalize(), that function also takes care of promoting a short header to a long one if too much data was written into the packet.

               ||  0  ||  1  ||
               || 0 1 2 3 4 5 6 7 || 8 9 0 1 2 3 4 5 ||
               ||   num-meas    ||   ms-index    ||
        
Then, num-meas marshalled values follow, respecting the schema defined by ms-index. The values are marshalled after a one-byte header identifying their type (OmlValueT).

The first two values are always a 32-bit integer representing the message sequence number, and a double representing the injection timestamp. marshal_measurements() should be called to add these two elements. Then, marshal_values() is used to marshall the array of OmlValue; it marshalls each of the with marshal_value().

32-bit integers (INT32_T and UINT32_T; and longs, LONG_T) are put on the wire verbatim, in network byte order.

||  0  ||  1  ||  2  ||  3  ||
|| 0 1 2 3 4 5 6 7 || 8 9 0 1 2 3 4 5 || 6 7 8 9 0 1 2 3 || 4 5 6 7 8 9 0 1 ||
||  (U)INT32_T   ||  int-byte-HH  ||  int-byte-HL  ||  int-byte-LH  |
||  int-byte-LL  || || || || 

The same goes for 64-bit integers (INT64_T and UINT64_T). GUIDs, introduced with OMSPv4, are marshalled in the same way, but use the GUID_T type.

||  0  ||  1  ||  2  ||  3  ||
|| 0 1 2 3 4 5 6 7 || 8 9 0 1 2 3 4 5 || 6 7 8 9 0 1 2 3 || 4 5 6 7 8 9 0 1 ||
||  (U)INT64_T    ||  int-byte-HHH  ||  int-byte-HHL  ||  int-byte-HLH  ||
||  int-byte-HLL  ||  int-byte-LHH  ||  int-byte-LHL  ||  int-byte-LLH  ||
||  int-byte-LLL  || || || ||

Doubles (DOUBLE_T) are represented with a 4-byte mantissa $M$ and a one-byte exponent $x$, so that $v=\frac{2^xM}{2^{30}}$. In case the conversion fails, DOUBLE_NAN is used as a type instead of DOUBLE_T.

||  0  ||  1  ||  2  ||  3  ||
|| 0 1 2 3 4 5 6 7 || 8 9 0 1 2 3 4 5 || 6 7 8 9 0 1 2 3 || 4 5 6 7 8 9 0 1 ||
||   DOUBLE_T    ||  mant-byte-HH ||  mant-byte-HL ||  mant-byte-LH ||
||  mant-byte-LL ||   exponent    || || ||

Strings (STRING_T) and blobs (BLOB_T) are serialised as bytes, with the second byte (i.e., first after the type), being their length.

||  0  ||  1  ||  2  ||  3  ||
|| 0 1 2 3 4 5 6 7 || 8 9 0 1 2 3 4 5 || 6 7 8 9 0 1 2 3 || 4 5 6 7 8 9 0 1 ||
||  STRING_T|BLOB_T  ||       n       ||   1st byte    ||               ||
||              ||    ...    ||               ||
||              ||     ...   ||   nth byte    ||
Boolean values are only encoded as one byte, with a different type depending on there truth value (BOOL_FALSE_T or BOOL_TRUE_T). They were introduced with OMSPv4.

||  0  ||
|| 0 1 2 3 4 5 6 7 ||
||  BOOL_xxx_T   ||

Vectors (VECTOR_T) are represented by specifying the type of the vector elements and then the size of the vector (a sixteen bit unsigned integer in network byte order) and followed by the vector of values themselves. Vectors were introduced in OMSPv5.

The vector elements are marshalled depending on the their type. For vectors of integers of INT32_T or (U)INT32_T the elements are packed in network-byte order as shown below:

||  0  ||  1  ||  2  ||  3  ||
|| 0 1 2 3 4 5 6 7 || 8 9 0 1 2 3 4 5 || 6 7 8 9 0 1 2 3 || 4 5 6 7 8 9 0 1 ||
||   VECTOR_T    ||   (U)INT32_T    ||      n-H      ||      n-L      ||
||  int[0]-byte-HH  ||  int[0]-byte-HL   ||  int[0]-byte-LH  ||  int[0]-byte-LL  ||
||  int[1]-byte-HH  ||  int[1]-byte-HL   ||  int[1]-byte-LH  ||  int[1]-byte-LL  ||

Similarly for the INT64_T and UINT64_T the elements are packed in network byte order (i.e., with the most significant octet first).

For vectors of boolean values the vector elements are represented by a one-octet values which must be either BOOL_TRUE_T or BOOL_FALSE_T.

||  0  ||  1  ||  2  ||  3  ||
|| 0 1 2 3 4 5 6 7 || 8 9 0 1 2 3 4 5 || 6 7 8 9 0 1 2 3 || 4 5 6 7 8 9 0 1 ||
||   VECTOR_T    ||     BOOL_T      ||      n-H      ||      n-L      ||
||    bool[0]    ||     bool[1]     ||    bool[2]    ||    bool[3]    ||
||    bool[4]    ||  ||  ||  ||

For vectors of double an IEEE 754 binary64 value is transferred and we require that the byte ordering within that value is in network byte order (IEEE 754 does not specify byte ordering but the Wikipedia suggests that it is reasonable to assume that, for a given host, the endian-ness of doubles is the same as for integers).

||  0  ||  1  ||  2  ||  3  ||
|| 0 1 2 3 4 5 6 7 || 8 9 0 1 2 3 4 5 || 6 7 8 9 0 1 2 3 || 4 5 6 7 8 9 0 1 ||
||   VECTOR_T    ||   DOUBLE64_T    ||      n-H      ||      n-L      ||
||  dbl[0]-MS-byte ||  dbl[0]-byte-7  ||  dbl[0]-byte-6  ||  dbl[0]-byte-5   ||
||  dbl[0]-byte-4  ||  dbl[0]-byte-3  ||  dbl[0]-byte-2  ||  dbl[0]-LS-byte  ||

See also marshal_init, marshal_header_short, marshal_header_long, marshal_measurements, marshal_values, marshal_finalize

== OMSP Schema Specification ==

Schemas describe the name, type and order of the values defining a sample in a measurement stream.

Schema declarations are a space-delimited concatenation sequence of name/type pairs. The name and type in each pair are separated by a colon :.

Valid types in OMSP the following.

* int32 (V>=1)
* uint32 (V>=2)
* int64 (V>=2)
* uint64 (V>=2)
* double (V>=2)
* string (V>=1)
* blob (V>=3)
* guid (V>=4)
* bool (V>=4)

OMSP also supports vector types (V>=5), in the form [t] where t is any valid type except for string, blob, or guid.

Additionally, some deprecated values are kept for backwards compatibility, and interpreted in the latest version as indicated. They should not be used in new implementations.

* int (V<2, mapped to int32 in V>=3)
* integer (V<2, mapped to int32 in V>=3)
* long (V<2, clamped and mapped to int32 in V>=3)
* float and real (V<2, mapped to double in V>=3)

A full schema also has a name, prepended to its definition and separated by a space. This must consist of only alpha-numeric characters and underscores and must start with a letter or an underscore, i.e., matching /[_A-Za-z][_A-Za-z0-9]/. The same rule applies to the names of the elements of the schema. Each schema is also associated with a numeric MS identifier, which is used to link it to all associated measurement tuples later sent. In ABNF, a schema is defined as follows.

{{{
schema = ms-id ws schema-name ws field-definition 0*63(ws field-definition)

ms-id = integer
schema-name = 1*letter-or-decimal-or-underscore
field-definition = field-name ":" oml-type

field-name = 1*letter-or-decimal-or-underscore
oml-type = current-oml-type / vector-type / deprecated-oml-type

current-oml-type = vectorisable-oml-type / "string" / "blob" / "guid"
vectorisable-oml-type = "int32" / "uint32" / "int64" / "uint64" / "double" / "bool" / "guid"
vector-type = "[" vectorisable-oml-type "]"
deprecated-oml-type = "int" / "integer" / "long" / "float"

integer = 1*decimal
letter-or-decimal-or-underscore = letter / decimal / "_"

decimal = "0"-"9"
letter = "a"-"z" / "A"-"Z"
ws = " "
}}}

Each client should number its measurement streams sequentially starting from 1 (not 0), and prepend that number to their schema definition. It will later be used to label tuples following this schema, and allow to group them together in the storage backend.

== Example ==

1. generator_sin label:string phase:double value:double
1. generator_lin label:string counter:uint64
1. generator_spectrum label:string distribution:[uint64]

== Schema 0 (OMSP V>=4) ==
Schema 0 is a specific hard-coded stream for metadata. Its core elements are two fields, named key and value. Data from this stream is stored in the same way as any other data, but its semantic is different in that it only describes and adds information about other measurement streams. Metadata follows an Subject-Key-Value model where the key/value pair is an attribute of a specific subject. Subjects are expressed in dotted notation. The default subject, ., is the experiment itself. At the second level are schemas, and their fields at the third level (e.g., .a refers to all of schema a, while .a.f refers only to its field f).

To support this, schema 0 is therefore:

{{{
0 _experiment_metadata subject:string key:string value:string
}}}

On the server side, everything gets stored in the _experiment_metadata table. However, additional processing might happen. For example, if key schema is defined for subject . (the experiment root), a new schema is defined at the collection point so new MSs can be sent.

In case of re connection, it is up to the client MUST re-send the headers headers, as well as all schema0 metadata with key schema (see OML User-visible API). Other metadata MAY be re transmitted as well. The server MAY store duplicate metadata if this happens.

== OMSP Text Protocol ==
The text protocol is meant to simplify sourcing of measurement streams from applications written in languages which are not supported by the OML library or where the OML library is considered too heavy. It is primarily envisioned for low-volume streams which do not require additional client side filtering. There are native instrumentation (liboml2, OML4R, OML4Py) but implementing the protocol from scratch in any language of choice should be very straight forward.

The text protocol simply serialises metadata and values of a tuple as one newline-terminated (\n), tab-separated (\t) line per sample.

The textual representation of the types defined above is as follows:

* All numeric types are represented as decimal strings suitable for strtod(3) and siblings; using snprintf(3), with the relevant PRIuN format if needed, should provide good functionality (at least V>=2; as of V<=3, there is no guarantee for the interpretation of non-decimal notations)
* Strings are represented directly (except for the nil-terminator) but some character values require special processing;
  * As the text protocol assigns special meaning to the tab and newline characters they would confused the parser if they appeared verbatim. To avoid this a simple backslash encoding is used: tab characters are represented by the string "`\t`", newlines by the string "`\n`" and backslash itself by the string "`\\`" (V>=4; no other backslash expansion is made TODO what if \whatever is input?);
* BLOBs are encoded using BASE64 encoding and the resulting string is sent. No line breaks are permitted within the BASE64-encoded string (V>=4);
* GUIDs are globally unique IDs used to link different measurements. These are treated as large numbers and thus represented as UINT64, unsigned decimal strings. (V>=4);
* booleans are encoded as any case-insensitive stem of FALSE or TRUE (e.g., fAL, trUe, but generally F and T will suffice), being respectively False or True; any other value is considered True, including '0' (V>=4);
* vectors are encoded as a space-separated list in which the first element is the size of the vector followed by the vector elements themselves. Each vector entry is encoded according to its type as above. (V>=5).

=== Example ===
This example shows two streams, matching the schema and headers examples.

{{{
0.903816 2 0 sample-1  1
0.903904 1 0 sample-1  0.000000  0.000000
1.903944 2 1 sample-2  2
1.903961 1 1 sample-2  0.628319  0.587785
2.460049 2 3 sample-3  3
2.460557 1 3 sample-3  1.256637  0.951057
3.461064 2 4 sample-4  4
3.461103 1 4 sample-4  1.884956  0.951056
}}}