Changes between Initial Version and Version 1 of Developer/6aFuture/Meehua

Jul 15, 2020, 4:37:11 PM (3 years ago)



  • Developer/6aFuture/Meehua

    v1 v1  
     1= Mēhua Design Document =
     2(Spelled “Meehua” with academic macrons [1])
     4 * [wiki:Meehua/Minutes 2013-05- 01 Meeting Minutes]
     5 * [wiki:Meehua/Types Meehua Types]
     6 * [wiki:Meehua/Structure Meehua Structure]
     8== Generic Architecture ==
     10=== Generic reporting architecture for Mehua ===
     12The key difference of Meehua as compared to OML is in the Proccesing Points (PP). While OML filtering can only happen at the source, and the results be sent to a terminal Collection Point (CP, the oml2-server), Processing Points (PP) allow to tap into existing Measurement Streams (MS), and generate new streams. Another key difference is in the availability of providing control feedback, either to the reporting chain, or to the system itself, based on the output of some processing.
     14An MS is a series of subsequent tuples following a given schema. Several MSs, coming from the same or different Injection Points (IP), can follow the same schema (e.g., a packet capture tool running on different nodes, or two TCP streams from different sources to a single Iperf instance).
     16A snapshot in time of an MS can be seen as an SQL table, and we can envision running queries on it.
     18== Measurement Points, Schemata and Measurement Streams ==
     19Each MP outputs samples into their associated MS following a given, static schema. For each new stream, the sender generates a unique identifier.
     21A schema is defined as a named tuple of elements. An element has a name, type, and optional unit. Both schema and element also have a storage-dependent name, which can be used as, e.g., valid table and field names, for database backends, while retaining the ability map to and from human-readable names.
     23Schemas are instantiated as tables, with at least a stream identifier and a timestamp, alongside their primary key. Each row in such table is a sample tuple from a stream corresponding to that schema.
     25'''OM question:''' Does the schema definition in the Schema table include ID, StreamID and TS as explicit elements?
     27Schemas are defined when receiving new streams. Schema and element identifiers are local to the receiver (but stream IDs aren't). The mapping between StreamID and table is done when the schema is defined.
     29'''OM question:''' How do do this mapping? By human-readable names? (Probably) Or by actual schema, regardless of the name? (This might confuse people, e.g. in/out bytes would have the same schema, but not the same semantics).
     31== Metadata ==
     32=== Schema ===
     33By default, one schema is known: that of Metadata streams. Each metadata sample to a specific stream (maybe even themselves!), and optionally a specific element of that stream, and provide a Key-Type-Value information. Metadata can include: domain, node-id, application, command line parameters, etc.
     35'''OM comment:''' Should the Metadata schema be declared as all other schemata, only when the storage is initialised? On the one hand, it would support discovery as for the other schemata but, on the other hand, it also implies information duplication (but also easy migration and no assumption as to what is available). I think it makes sense to declare it as any other schemata.
     37'''OM question:''' What is the "span" of metadata? It would make sense to have it last until it is replaced by a later (based on timestamp) sample with the same StreamID/ElementID. This might be tricky to capture with SQL statements (SELECT s.field, m.precision FROM stream s, metadata m WHERE s.ts <= m.ts ...?). Or could this might be left at the application description?
     39'''OM comment:''' If metadata is a separate stream, then we have both its streamID, and that of the stream it refers to.
     41== Relationship to other streams, and propagation ==
     42One metadata stream might cover several data streams (e.g. one application with multiple MPs).
     44Metadata are separate streams, they might or might not be propagated alongside the stream(s) they refer to. It might however be a good idea when setting up the reporting pipe to do so. In any case, forwarding of the relevant metadata subset is at the discretion of the PP along the way.
     46== API ==
     47The basic API (which OML will also implement) should be provided with a single function:
     51  const OmlMP *mp,         /* Measurement point to which the metadata is related (can be NULL) */
     52  const char *key,         /* Attribute described */
     53  const OmlValueU *value,  /* Value of that attribute */
     54  OmlValueT type,          /* Type of that value */
     55  const char *fname);      /* Optional field to which that metadata relates */
     58The mp would be used to set the right StreamID for the tuple created in the metadata stream, while the fname would be passed to a new internal function to determine the field index in the schema (int fname2idx(const OmlMP *mp, const char *fname)), and can be NULL (i.e., metadata referring to the whole MP).
     60=== Example ===
     61Taking the example of a GPS location schema, and two streams providing data following this schema, we would end with the following layout in the storage backend.
     63Schema table:
     64|| ID || Name || !TableName ||
     65|| mdID || Metadata Table || Metadata ||
     66|| gpsID || Example GPS || ExampleGPS ||
     68Element table:
     70|| ID || SchemaID || Name || Type || Unit || ColumnName ||
     71|| elID1 || mdID || AboutStreamID || || int || AboutStreamID ||
     72|| elID2 || mdID || ElementID || int || || ElementID ||
     73|| elID3 || mdID || Key || string || || Key ||
     74|| elID4 || mdID || Type || type || || Type ||
     75|| elID5 || mdID || Value || string || || Value ||
     76|| elID6 || gpsID || Longitude || double || ||  Longitude ||
     77|| elID7 || gpsID || Latitude || double || || Latitude ||
     78|| elID8 || gpsID || Elevation || double || || Elevation ||
     81|| ID || SchemaID ||
     82|| sID1 || gpsID ||
     83|| sID2 || gpsID ||
     84|| sID3 || mdID ||
     85|| sID4 || mdID ||
     88|| ID || StreamID || TS || Longitude || Latitude || Elevation ||
     89|| x1 || sID1 || t1 || long11 || lat11 || el11 ||
     90|| x2 || sID2 || t2 || long21 || lat21 || el21 ||
     91|| x3 || sID1 || t3 || long12 || lat12 || el12 ||
     92|| x4 || sID2 || t4 || long22 || lat22 || el22 ||
     93|| x5 || sID1 || t5 || long13 || lat11 || el11 ||
     95And some Metadata:
     96|| ID || StreamID || AboutStreamID || TS || ElementID || Key || Type || Value ||
     97|| y1 || sID3 || sID1 || t1 || || sender || string || ||
     98|| y2 || sID4 || sID2 || t2 || || sender || string || ||
     99|| y3 || sID4 || sID2 || t6 || || fix || int || 0 ||
     100|| y4 || sID3 || sID1 || t7 || elID6 || noise || double || no11 ||
     102== Processing Point ==
     103The core of a processing point is made of three subsequent functions: an s2t function converting MSs to tables, which periodically triggers the filtering function proper, f, running the query on the table(s), and a final t2s function re-serialising the output as a new MS.
     105'''OM question:''' Do we allow filtering across MSs from difference domains? Can a PP create a stream for a different domain (probably).
     107'''OM question:''' Can different applications generate the same schema? (Probably)
     109MSs (or subsets of their columns) are aggregated into tables. Every so often (number of elements or time window), a filter is run on these joined tables, creating new tuples which are then reconverted into an output MS.
     111A unique identifier is generated for each stream, which can be used to select specifically data from a given stream within a pool of several matching the same schema.
     113'''OM comment:''' We could just use their sender-id/domain, but this is not specific enough to uniquely identify them, perhaps the lib should give a UUID to each stream when they are created; this would however create the problem of not knowing streams ID in advance, or after a restart of the sender)
     115'''OM comment:''' It seems better to me to group all tuples from MSs with the same schema into the same table, and add columns identifying their source, to support GROUP BY constructs if needed, or simple aggregation otherwise.
     117Source streams, and their metadata are listed as metadata of the created stream for provenance management.
     119'''OM comment:''' I'm not sure it makes sense to only expose the columns manipulated by a given filter, as this would either require the filtered data subsets separately in ad hoc tables, or doing the column-filtering just before providing the data to the filter, which would incur an additional SELECT-like construct, which the filter could very well do on its own.
     121'''OM comment:''' We might need a specific OML_TIMESTAMP type which similar to the OML_KEY_XXX would allow to carry semantic about the use of the field and allow automatic filtering, particularly when we have period-based filtering, as the time when and pace at which the PP receives measurements is not guaranteed to be correlated with those at which they have been created (e.g. with a proxy on the way). PPs should probably add a timestamp by default. Also, I don't think we should rely on protocol-level timestamps (oml_ts_*).
     123== Control Language ==
     124 1. Create new stream (à la StreamSQL [2]) with parameters (as metadata)
     125 1. Send parameter value as a specific schema to control PP (as data stream)
     126The example above, trigerred either when the number of input samples has reached a threshold or every given time period (in seconds), and sending data streams to both a configurable next hop and a backup CP (which could be another PP) could be done as follows.
     128CREATE Sx (ws:int, period:double:s, collect:string) \
     129SELECT A, hist(B), avg(C) from SchemaX \
     130GROUP BY Sid \
     131WINDOW $ws OR \
     132PERIOD $period \
     133COLLECT $collect AND \
     134COLLECT tcp:BACKUP:3003
     136With no additional filtering specification, a first() filter is applied.
     138The (ws:int, period:double, collect:string) part declares parameters configurable at run time by, e.g., sending a datastream with one tuple to the processing point with schema Sx_params.
     140== Reporting Chain Instantiation ==
     141It is voluntarily left out of the scope of Meehua how the chain of PPs is instanciated and controlled, as this depends on the use cases. This can, for example, be left to a Resource Proxy in OMF.
     143However, some corner cases are not clear-cut as to whether they belong to the PP (and its control language) or the control framework. For example, for sample-based measurement, it might be needed to count only those samples which match a specific criterion (WHERE clause) before triggering a query based on this criterion, to get the desired number of matching samples (1). Another example is in case of a PP only interested in a limited subset of an MS (e.g., to save capacity) in which case filtering should be done at the upstream PP generating the stream, rather than at the downstream one (2). Some syntactic sugar for these purposes can be introduced, such as extending the FROM clause to specify filtering criteria and or subsets of an MS deemed relevant.
     145'''OM/MO questions:''' The question lies in the fact that such an upstream communication is not currently envisionned and would probably require a new different protocol, as well a create scalabitily issues. How do we do this? Should the control framework extend the filtering language to properly instanciate upstream PPs (e.g., ... FROM stream(b<1))?
     147This is also a concern for authorisation and authentication. The former should probably be part of the PP, to support the latter for the control framework.
     149== API ==
     150The Meehua API should be conceptually compatible with OML's (i.e., we should be able to write an oml-comp library in a few line of code, to support easy migration).
     152However, it should be reentrant and thread-safe. Particularly, it should manipulate an initial context, to which connections and MP definitions, as well as buffers, would be attached. Also, the parametrisation of the library should be more modular than omlc_init(), to avoid having to create fake command line arguments, though helpers functions to parse those should still be available.
     154meehua_context ctx = meehua_init(app_name);
     155meehua_config_argv(ctx, &argc, &argv); //also does meehua_config_env, and internally calls meehua_set_[nodeid,domain,...]
     156meehua_start(ctx); // probably cannot set nodeid and others after that, but can declare new MPs
     158meehua_terminate(ctx); // does the same as the two following commands
     163== High Level Use Cases ==
     164What do people want of Meehua?
     166OM comment This section is still very young and needs refinement.
     168=== Stakeholders ===
     169Platform provider
     172=== Requirements ===
     173 * Platform provider
     174   * Monitor the platform
     175   * Give access to a subset of the information to the experimenter
     176     * Limited scoped
     177     * Aggregated data
     178 * Experimenter/Use
     179  * Collect contextual data (relevant platforms health)
     180  * Get own data
     181  * Keep experimental data to themselves
     183== References ==
     189* architecture.png - Generic reporting architecture for M?hua (4.54 KB) Olivier Mehani, 14/01/2013 05:56 PM
     190* pp.png (18 KB) Olivier Mehani, 15/01/2013 04:16 PM
     191* schemata.png (21.2 KB) Olivier Mehani, 16/01/2013 06:45 PM
     192* ExampleTableGPS.png (9.17 KB) Olivier Mehani, 16/01/2013 06:45 PM
     193* meehua_design_IMG_20130116_112220.jpg (1 MB) Olivier Mehani, 04/02/2013 05:22 PM
     194* meehua_pp_control_IMG_20130130_123507.jpeg (68.9 KB) Olivier Mehani, 04/02/2013 05:22 PM
     195* meehua_pp_control_IMG_20130130_123513.jpeg (64.7 KB) Olivier Mehani, 04/02/2013 05:22 PM