Changes between Initial Version and Version 1 of General/3aProtocol


Ignore:
Timestamp:
Feb 18, 2019, 11:58:43 PM (3 years ago)
Author:
seskar
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • General/3aProtocol

    v1 v1  
     1= OML Measurement Stream Protocol (OMSP) =
     2
     3The OML Measurement Stream Protocol is used to describe and transport measurement tuples between Injection Points and Processing/Collection Points. All data injected in a Measurement Point (MP) (with omlc_inject) is timestamped and sent to the destination as a Measurement Stream (MS).
     4
     5The liboml2 provides an API allowing to generate this MSs using this protocol and send them to a remote host, or store them in a local file.
     6
     7Upon connection to a collection point, a set of headers is first sent, describing the injection point (protocol version, name, application, local timestamp), along with the schemata ofthe transported MSs".
     8
     9Once done, timestamped measurement data is serialised, either using a binary encoding, or a text encoding.
     10
     11== Generalities ==
     12There are 5 versions of the OML protocol:
     13
     14* OMSP V1 was the initial protocol, inherited from OML (version 1!);
     15* OMSP V2 introduced more precise types (commit:28daef3f), and was released with OML 2.4.0;
     16* OMSP V3 introduced changes to the binary protocol to support blobs and, incidentally, longer marshalled packets (commit:6d8f0597), and was released with OML 2.5.0;
     17* OMSP V4 was introduced with OML 2.10.0; its main additions are the support for the definition of new Measurement Points (and Measurement Stream Schemas) at any time, and the ability to inject metadata.
     18* OMSP V5 which is the current, most recent and implemented version (since OML 2.11); its main advantages are the support for vectors, and the introduction of a DOUBLE64_T IEEE 754 binary64 for more precision in representing doubles in binary mode (vectors only).
     19
     20The protocol is loosely modelled after HTTP. The client first start with a few textual headers, then switches into either the text or binary protocol for the serialisation of tuples, following a previously advertised schema. Both modes include contextual information with each tuple. There is no feedback communication from the server.
     21
     22The client first opens a connection to an OML server and sends a header followed by a sequence of measurement tuples. The header consists of a sequence of key/value pairs representing parameters valid for the whole connection, each terminated by a new line. The headers are also used to declare the schema following which measurement tuples will be streamed. The end of the header section is identified by an empty line. Each measurement tuple is then serialised following the mode selected in the headers. For the text mode, this is a series of newline-terminated strings containing tab-separated text-based serialisations of each element, while the binary mode encodes the data following a specific marshalling. Clients end the session by simply closing the TCP connection.
     23
     24=== !Key/Value Parameters ===
     25The connection is initially configured through setting the value of a few property, using a key/value model. The properties (and their keys) are the following.
     26
     27* protocol: OMSP version, as specified in this document. The oml2-server currently supports 1–4;
     28* domain (experiment-id in V<4): string identifying the experimental domain (should match /[-_A-Za-z0-9]+/);
     29* start-time: local UNIX time in seconds taken at the time the header is being sent (see gettimeofday(3)), the server uses this information to rebase timestamps within its own timeline;
     30* sender-id: (start_time in V<4): string identifying the source of this stream (should match /[_A-Za-z0-9]+/);
     31* app-name: string identifying the application producing the a measurements (should match /[_A-Za-z0-9]+/), in the storage backend, this may be used to identify specific measurements collections (e.g., tables in SQL);
     32* content: encoding of forthcoming tuples, can be either binary for the binary protocol or text for the text protocol.
     33* schema: describes the schema of each measurement stream.
     34
     35These parameters can only be set as part of the headers, and are not valid once the server expects serialised measurements (V<4).
     36
     37Since V>=4, key/value metadata can be sent along with tuples using the schema 0, the rest of the key/value parameters presented here are all invalid in schema 0, and will be rejected by the server, except for key schema itself, allowing to (re)define schemata (XXX not including schema 0?).
     38
     39=== Time-stamping and book-keeping ===
     40Regardless of the mode (binary or text), each measurement tuple is prefixed with some per-sample metadata.
     41
     42Prior to serialising tuples according to their schema, three elements are inserted.
     43
     44* timestamp: a double timestamp in seconds relative to the start-time sent in the headers, the server uses this information to rebase timestamps within its own timeline;
     45* stream_id: an integer (marshalled specifically as a uint8_t in binary mode) indicating which previously defined schema this tuple follows;
     46* seq_no: an int32 monotonically increasing sequence number in the context of this measurement stream.
     47
     48The order of these fields varies depending on the mode (text or binary).
     49
     50== OMSP Binary Marshalling ==
     51Marshalling is done directly into Mbuffers. A marshalled packet is structured as follows.
     52
     53A header is first populated using marshal_init(), depending on the expected length. Both headers start with a double SYNC_BYTE.
     54
     55A short header (OMB_DATA_P), created by marshal_header_short(), is 5 bytes long.
     56
     57||  0  ||  1  ||  2  ||  3  ||
     58|| 0 1 2 3 4 5 6 7 || 8 9 0 1 2 3 4 5 || 6 7 8 9 0 1 2 3 || 4 5 6 7 8 9 0 1 ||
     59||   SYNC_BYTE   ||   SYNC_BYTE   ||  OMB_DATA_P   ||   msg-len-H   ||
     60||   msg-len-L  || || || ||
     61
     62
     63A long header (OMB_LDATA_P), created by marshal_header_long(), allows two more bytes for the length.
     64
     65||  0  ||  1  ||  2  ||  3  ||
     66|| 0 1 2 3 4 5 6 7 || 8 9 0 1 2 3 4 5 || 6 7 8 9 0 1 2 3 || 4 5 6 7 8 9 0 1 ||
     67||   SYNC_BYTE   ||   SYNC_BYTE   ||  OMB_LDATA_P  ||   msg-len-HH  ||
     68||   msg-len-HL  ||   msg-len-LH  ||   msg-len-LL  || ||
     69
     70This header is followed by metadata about its content, in the form of one byte indicating the number of measuments, and one the MS number (identifying the schema). The former is is updated at the end, by marshal_finalize(), that function also takes care of promoting a short header to a long one if too much data was written into the packet.
     71
     72               ||  0  ||  1  ||
     73               || 0 1 2 3 4 5 6 7 || 8 9 0 1 2 3 4 5 ||
     74               ||   num-meas    ||   ms-index    ||
     75       
     76Then, num-meas marshalled values follow, respecting the schema defined by ms-index. The values are marshalled after a one-byte header identifying their type (OmlValueT).
     77
     78The first two values are always a 32-bit integer representing the message sequence number, and a double representing the injection timestamp. marshal_measurements() should be called to add these two elements. Then, marshal_values() is used to marshall the array of OmlValue; it marshalls each of the with marshal_value().
     79
     8032-bit integers (INT32_T and UINT32_T; and longs, LONG_T) are put on the wire verbatim, in network byte order.
     81
     82||  0  ||  1  ||  2  ||  3  ||
     83|| 0 1 2 3 4 5 6 7 || 8 9 0 1 2 3 4 5 || 6 7 8 9 0 1 2 3 || 4 5 6 7 8 9 0 1 ||
     84||  (U)INT32_T   ||  int-byte-HH  ||  int-byte-HL  ||  int-byte-LH  |
     85||  int-byte-LL  || || || ||
     86
     87The same goes for 64-bit integers (INT64_T and UINT64_T). GUIDs, introduced with OMSPv4, are marshalled in the same way, but use the GUID_T type.
     88
     89||  0  ||  1  ||  2  ||  3  ||
     90|| 0 1 2 3 4 5 6 7 || 8 9 0 1 2 3 4 5 || 6 7 8 9 0 1 2 3 || 4 5 6 7 8 9 0 1 ||
     91||  (U)INT64_T    ||  int-byte-HHH  ||  int-byte-HHL  ||  int-byte-HLH  ||
     92||  int-byte-HLL  ||  int-byte-LHH  ||  int-byte-LHL  ||  int-byte-LLH  ||
     93||  int-byte-LLL  || || || ||
     94
     95Doubles (DOUBLE_T) are represented with a 4-byte mantissa $M$ and a one-byte exponent $x$, so that $v=\frac{2^xM}{2^{30}}$. In case the conversion fails, DOUBLE_NAN is used as a type instead of DOUBLE_T.
     96
     97||  0  ||  1  ||  2  ||  3  ||
     98|| 0 1 2 3 4 5 6 7 || 8 9 0 1 2 3 4 5 || 6 7 8 9 0 1 2 3 || 4 5 6 7 8 9 0 1 ||
     99||   DOUBLE_T    ||  mant-byte-HH ||  mant-byte-HL ||  mant-byte-LH ||
     100||  mant-byte-LL ||   exponent    || || ||
     101
     102Strings (STRING_T) and blobs (BLOB_T) are serialised as bytes, with the second byte (i.e., first after the type), being their length.
     103
     104||  0  ||  1  ||  2  ||  3  ||
     105|| 0 1 2 3 4 5 6 7 || 8 9 0 1 2 3 4 5 || 6 7 8 9 0 1 2 3 || 4 5 6 7 8 9 0 1 ||
     106--+---------------+---------------+---------------+---------------+
     107  |STRING_T|BLOB_T|       n       |   1st byte    |               |
     108--+---------------+---------------+---------------+---------------+
     109  |               |              ...              |               |
     110  +---------------+---------------+---------------+---------------+
     111  |              ...              |   nth byte    |
     112  +---------------+---------------+---------------+--
     113Boolean values are only encoded as one byte, with a different type depending on there truth value (BOOL_FALSE_T or BOOL_TRUE_T). They were introduced with OMSPv4.
     114
     115--+---------------+--
     116  |  BOOL_xxx_T   |
     117--+---------------+--
     118Vectors (VECTOR_T) are represented by specifying the type of the vector elements and then the size of the vector (a sixteen bit unsigned integer in network byte order) and followed by the vector of values themselves. Vectors were introduced in OMSPv5.
     119
     120The vector elements are marshalled depending on the their type. For vectors of integers of INT32_T or (U)INT32_T the elements are packed in network-byte order as shown below:
     121
     122--+---------------+-----------------+---------------+---------------+
     123  |   VECTOR_T    |   (U)INT32_T    |      n-H      |      n-L      |
     124--+---------------+-----------------+---------------+---------------+
     125  |int[0]-byte-HH |int[0]-byte-HL   |int[0]-byte-LH |int[0]-byte-LL |
     126  +---------------+-----------------+---------------+---------------+--
     127  |int[1]-byte-HH |int[1]-byte-HL   |int[1]-byte-LH |int[1]-byte-LL |
     128  +---------------+-----------------+---------------+---------------+--
     129Similarly for the INT64_T and UINT64_T the elements are packed in network byte order (i.e., with the most significant octet first).
     130
     131For vectors of boolean values the vector elements are represented by a one-octet values which must be either BOOL_TRUE_T or BOOL_FALSE_T.
     132
     133--+---------------+-----------------+---------------+---------------+
     134  |   VECTOR_T    |     BOOL_T      |      n-H      |      n-L      |
     135--+---------------+-----------------+---------------+---------------+
     136  |    bool[0]    |     bool[1]     |    bool[2]    |    bool[3]    |
     137  +---------------+-----------------+---------------+---------------+
     138  |    bool[4]    |
     139  +---------------+--
     140For vectors of double an IEEE 754 binary64 value is transferred and we require that the byte ordering within that value is in network byte order (IEEE 754 does not specify byte ordering but the Wikipedia suggests that it is reasonable to assume that, for a given host, the endian-ness of doubles is the same as for integers).
     141
     142--+---------------+-----------------+---------------+---------------+
     143  |   VECTOR_T    |   DOUBLE64_T    |      n-H      |      n-L      |
     144--+---------------+-----------------+---------------+---------------+
     145  |dbl[0]-MS-byte |  dbl[0]-byte-7  | dbl[0]-byte-6 | dbl[0]-byte-5 |
     146  +---------------+-----------------+---------------+---------------+--
     147  | dbl[0]-byte-4 |  dbl[0]-byte-3  | dbl[0]-byte-2 |dbl[0]-LS-byte |
     148  +---------------+-----------------+---------------+---------------+--
     149See also
     150marshal_init, marshal_header_short, marshal_header_long, marshal_measurements, marshal_values, marshal_finalize
     151
     152== OMSP Schema Specification ==
     153
     154Schemas describe the name, type and order of the values defining a sample in a measurement stream.
     155
     156Schema declarations are a space-delimited concatenation sequence of name/type pairs. The name and type in each pair are separated by a colon :.
     157
     158Valid types in OMSP the following.
     159
     160* int32 (V>=1)
     161* uint32 (V>=2)
     162* int64 (V>=2)
     163* uint64 (V>=2)
     164* double (V>=2)
     165* string (V>=1)
     166* blob (V>=3)
     167* guid (V>=4)
     168* bool (V>=4)
     169
     170OMSP also supports vector types (V>=5), in the form [t] where t is any valid type except for string, blob, or guid.
     171
     172Additionally, some deprecated values are kept for backwards compatibility, and interpreted in the latest version as indicated. They should not be used in new implementations.
     173
     174* int (V<2, mapped to int32 in V>=3)
     175* integer (V<2, mapped to int32 in V>=3)
     176* long (V<2, clamped and mapped to int32 in V>=3)
     177* float and real (V<2, mapped to double in V>=3)
     178
     179A full schema also has a name, prepended to its definition and separated by a space. This must consist of only alpha-numeric characters and underscores and must start with a letter or an underscore, i.e., matching /[_A-Za-z][_A-Za-z0-9]/. The same rule applies to the names of the elements of the schema. Each schema is also associated with a numeric MS identifier, which is used to link it to all associated measurement tuples later sent. In ABNF, a schema is defined as follows.
     180
     181{{{
     182schema = ms-id ws schema-name ws field-definition 0*63(ws field-definition)
     183
     184ms-id = integer
     185schema-name = 1*letter-or-decimal-or-underscore
     186field-definition = field-name ":" oml-type
     187
     188field-name = 1*letter-or-decimal-or-underscore
     189oml-type = current-oml-type / vector-type / deprecated-oml-type
     190
     191current-oml-type = vectorisable-oml-type / "string" / "blob" / "guid"
     192vectorisable-oml-type = "int32" / "uint32" / "int64" / "uint64" / "double" / "bool" / "guid"
     193vector-type = "[" vectorisable-oml-type "]"
     194deprecated-oml-type = "int" / "integer" / "long" / "float"
     195
     196integer = 1*decimal
     197letter-or-decimal-or-underscore = letter / decimal / "_"
     198
     199decimal = "0"-"9"
     200letter = "a"-"z" / "A"-"Z"
     201ws = " "
     202}}}
     203
     204Each client should number its measurement streams sequentially starting from 1 (not 0), and prepend that number to their schema definition. It will later be used to label tuples following this schema, and allow to group them together in the storage backend.
     205
     206== Example ==
     207
     2081. generator_sin label:string phase:double value:double
     2091. generator_lin label:string counter:uint64
     2101. generator_spectrum label:string distribution:[uint64]
     211
     212== Schema 0 (OMSP V>=4) ==
     213Schema 0 is a specific hard-coded stream for metadata. Its core elements are two fields, named key and value. Data from this stream is stored in the same way as any other data, but its semantic is different in that it only describes and adds information about other measurement streams. Metadata follows an Subject-Key-Value model where the key/value pair is an attribute of a specific subject. Subjects are expressed in dotted notation. The default subject, ., is the experiment itself. At the second level are schemas, and their fields at the third level (e.g., .a refers to all of schema a, while .a.f refers only to its field f).
     214
     215To support this, schema 0 is therefore:
     216
     217{{{
     2180 _experiment_metadata subject:string key:string value:string
     219}}}
     220
     221On the server side, everything gets stored in the _experiment_metadata table. However, additional processing might happen. For example, if key schema is defined for subject . (the experiment root), a new schema is defined at the collection point so new MSs can be sent.
     222
     223In case of re connection, it is up to the client MUST re-send the headers headers, as well as all schema0 metadata with key schema (see OML User-visible API). Other metadata MAY be re transmitted as well. The server MAY store duplicate metadata if this happens.
     224
     225== OMSP Text Protocol ==
     226The text protocol is meant to simplify sourcing of measurement streams from applications written in languages which are not supported by the OML library or where the OML library is considered too heavy. It is primarily envisioned for low-volume streams which do not require additional client side filtering. There are native instrumentation (liboml2, OML4R, OML4Py) but implementing the protocol from scratch in any language of choice should be very straight forward.
     227
     228The text protocol simply serialises metadata and values of a tuple as one newline-terminated (\n), tab-separated (\t) line per sample.
     229
     230The textual representation of the types defined above is as follows:
     231
     232* All numeric types are represented as decimal strings suitable for strtod(3) and siblings; using snprintf(3), with the relevant PRIuN format if needed, should provide good functionality (at least V>=2; as of V<=3, there is no guarantee for the interpretation of non-decimal notations)
     233* Strings are represented directly (except for the nil-terminator) but some character values require special processing;
     234  * As the text protocol assigns special meaning to the tab and newline characters they would confused the parser if they appeared verbatim. To avoid this a simple backslash encoding is used: tab characters are represented by the string "`\t`", newlines by the string "`\n`" and backslash itself by the string "`\\`" (V>=4; no other backslash expansion is made TODO what if \whatever is input?);
     235* BLOBs are encoded using BASE64 encoding and the resulting string is sent. No line breaks are permitted within the BASE64-encoded string (V>=4);
     236* GUIDs are globally unique IDs used to link different measurements. These are treated as large numbers and thus represented as UINT64, unsigned decimal strings. (V>=4);
     237* booleans are encoded as any case-insensitive stem of FALSE or TRUE (e.g., fAL, trUe, but generally F and T will suffice), being respectively False or True; any other value is considered True, including '0' (V>=4);
     238* vectors are encoded as a space-separated list in which the first element is the size of the vector followed by the vector elements themselves. Each vector entry is encoded according to its type as above. (V>=5).
     239
     240=== Example ===
     241This example shows two streams, matching the schema and headers examples.
     242
     243{{{
     2440.903816 2 0 sample-1  1
     2450.903904 1 0 sample-1  0.000000  0.000000
     2461.903944 2 1 sample-2  2
     2471.903961 1 1 sample-2  0.628319  0.587785
     2482.460049 2 3 sample-3  3
     2492.460557 1 3 sample-3  1.256637  0.951057
     2503.461064 2 4 sample-4  4
     2513.461103 1 4 sample-4  1.884956  0.951056
     252}}}