Reading variable-length event sources

Some data sources, such as mod_udp, only know how to fetch data in fixed-size blocks. The modules that take those data and dissect them into Orchids events, on the other hand, may require a variable number of bytes.  The mod_utils.[ch] files provide a general-purpose API to solve this impedance match problem: the blox API.

There are three simple ways the length can be specified, in principle:

  • implicitly: all required data blocks have exactly the same size (this seems to be the case of no event record format known to OrchIDS);
  • explicitly: the first few bytes, say, contain the number of bytes to be read (surprisingly, no event record format known to OrchIDS does something as simple as that);
  • by end of record character: every byte until the terminator character is taken to form the data block (e.g., the mod_bintotext module works this way, considering the newline character '\n' as terminator).

The blox API was initially meant to solve the problem in the context of the mod_openbsm module, for which finding the length is a bit more complicated. However, it is suited to solve the length problem in all three cases above as well, as we shall see, partly, at the end of this post.

Automata

Let us explain the mod_openbsm case in more detail: the first byte is a type tag, and depending on that type, we find the length in different ways. In the first 4 cases, the length is given by the next 4 bytes, in big-endian format (including the already read 5 bytes). In a fifth case, the next 8 bytes are a time value, and the following 2 bytes hold the length of the subsequent file name (excluding the already read 11 bytes) in big-endian format.

Reading the length, and in fact the whole data block, can be described by the following automaton:

  • There are four states, BLOX_INITBLOX_NOT_ALIGNED, BLOX_FINAL (those three are all predefined in the blox library, as numbers 0, 1, 2 respectively), STATE_HEADER, and STATE_FILE_EXPECT_FILENAMELEN (defined in mod_openbsm).
  • The initial state is BLOX_INIT. In that state, we have read the first byte of the data block.
  • When in state BLOX_INIT, we look at the first byte. There are 5 legal values for this byte.  In the first 4 cases, we go to state STATE_HEADER, and request to read 5 bytes (that is, 4 extra bytes: we have already read 1 byte). In the fifth case, we go to state STATE_FILE_EXPECT_FILENAMELEN, and request to read 11 bytes. If the character read does not match any of the previous cases, we go to state BLOX_NOT_ALIGNED, requesting to resynchronize the data: throw away whatever we have read, re-read one byte and go back to state BLOX_INIT.  (Resynchronizing is done automatically by the blox engine, and is implemented in the provided function blox_dissect().  However, you must describe the other actions.)
  • When in state STATE_HEADER, we have read 5 bytes.  We interpret the last 4 bytes as a length n: we request to read n bytes (including the 5 bytes we have already read), and go to state BLOX_FINAL: we have finished our task, the blox engine will make sure that we have read n bytes, and pass it on (to the subdissector, see below).
  • When in state STATE_FILE_EXPECT_FILENAMELEN, we have read 11 bytes.  We interpret the last 2 bytes as a length m: we request to read m more bytes (excluding the 11 bytes we have already read: so we request to read m+11 bytes in total), and go to state BLOX_FINAL, again.

The blox API

This automaton is described by a function of the following type, which you must provide (in the case of mod_openbsm, this function is called openbsm_compute_length):

typedef size_t (*compute_length_fun) (unsigned char *first_bytes,
                                      size_t n_first_bytes,
                                      size_t available_bytes,
                                      int *state, /* pointer to blox state */
                                      void *sd_data);

When your function, of that type, is called, first_bytes will point to a memory zone that holds available_bytes bytes; this number is always larger than or equal to the previously required number of bytes, n_first_bytes. (If you request 11 bytes, the mod_udp module may decide to read 1024 bytes instead, for example.) The integer pointed to by state is the current state. Your function’s task will be to update the state by storing the new state into *state, and return the requested number of bytes to read (5, or 11, for example, in the open_bsm case). The pointer sd_data contains private data that is passed to each invocation of your function: do whatever you please with it.

Once we have reached the BLOX_FINAL state (by storing it into *state), the blox engine will call a subdissector function, which you must provide, too, and is of the following type (in the openbsm case, this is openbsm_subdissect()):

typedef void (*subdissect_fun) (orchids_t *ctx, mod_entry_t *mod,
                                event_t *event,
                                ovm_var_t *delegate,
                                unsigned char *stream,
                                size_t stream_len,
                                void *sd_data,
                                int dissector_level);

When your subdissector is called, stream will hold a pointer to stream_len bytes, holding a complete data block. You must now chop this block in pieces, enriching the event (list of field/value pairs) event. This works just like an ordinary dissector.

The mod value points to the current module (mod_openbsm in our example), sd_data is the same pointer to private data that we mentioned above.

The delegate value is a bit more mysterious.  In the mod_openbsm example again, the data block will be part of an OrchIDS binary string str (or a virtual binary string).  Some the field/value pairs will include substrings of it.  It is interesting to create those substrings as virtual strings.  For that, the ovm_vstr_new() and ovm_vbstr_new() functions require a delegate: this is the delegate value. Most often, delegate will be the string str. However, if str is itself virtual, delegate might be its own delegate instead.

Finally, the dissector_level value holds the number of nested dissectors called until now on the current data source.  You don’t need to know anything about it, except that you should pass it along to calls to further subdissectors, and to post_event() and REGISTER_EVENTS() (which themselves may call subdissectors, and will do so with a value of dissector_level+1).  The blox API uses it to register itself into the rtactionlist priority queue, with a priority equal to dissector_level*128.  This ensures that blox dissectors consume their input faster than this input is produced.

Only two small tasks remain.  We must write our dissector: this will just be a simple call to the following function, provided by the blox API:

int blox_dissect(orchids_t *ctx, mod_entry_t *mod, event_t *event,
                 void *sd_data, int dissector_level);

For example, the dissector of the mod_openbsm module is:

static int openbsm_dissect (orchids_t *ctx, mod_entry_t *mod,
                            event_t *event, void *data, int dissector_level)
{
  return blox_dissect (ctx, mod, event, data, dissector_level);
}

And we must also make sure that we have initialized an instance of the blox API for each possible matching source, using the following function:

blox_hook_t *init_blox_hook(orchids_t *ctx,
                            blox_config_t *bcfg,
                            char *tag,
                            size_t taglen);

This returns a pointer to a blox_hook_t structure, which holds various buffers and flags.  Each pair of a source and a blox dissector should have its own blox_hook_t structure.  It is therefore natural to call init_blox_hook() for each DISSECT directive.  This is done by installing a pre-dissection hook in the input_module_t structure describing the module we are creating.  For example, the pre-dissector of the mod_openbsm module is:

static void *openbsm_predissect(orchids_t *ctx, mod_entry_t *mod,
                                char *parent_modname,
                                char *cond_param_str,
                                int cond_param_size)
{
  blox_hook_t *hook;

  hook = init_blox_hook (ctx, mod->config, cond_param_str, cond_param_size);
  return hook;
}

The OrchIDS engine will make sure that the hook returned by the pre dissector will be passed on to the blox API, so that it knows which buffers and flags pertain to which input/dissector pair.

Finally, the bcfg argument to init_blox_hook() holds configuration information for the whole dissector module (not for each one if its instances).  You obtain it by calling:

blox_config_t *init_blox_config(orchids_t *ctx,
                                mod_entry_t *mod,
                                size_t n_first_bytes,
                                compute_length_fun compute_length,
                                subdissect_fun subdissect,
                                void *sd_data
                                );

Here, n_first_bytes is the number of bytes that should be read each time blox_dissect() is called. In the case of the mod_openbsm module, we only need to read one byte. For other formats, we may need to read 4 bytes holding a length, for example.

The compute_length and subdissect function arguments are those we have described above, and this is how we inform the blox engine what those functions are.  Finally, sd_data is the private pointer that will be passed to both.

Again in the case of the mod_openbsm module, this initialization is done in the preconfiguration function below (the call to register_fields() is meant to register all fields known to mod_openbsm, and is not directly relevant to this post):

static void *openbsm_preconfig(orchids_t *ctx, mod_entry_t *mod)
{
  blox_config_t *bcfg;

  register_fields(ctx, mod, openbsm_fields, OPENBSM_FIELDS);
  bcfg = init_blox_config (ctx, mod, 1,
                           openbsm_compute_length,
                           openbsm_subdissect,
                           NULL);
  return bcfg;
}

Returning bcfg makes sure it will be stored into the config field of the module mod: we retrieve it as mod->config in the call we have made above to init_blox_hook().

Other uses of the blox API

We have said that the blox API could be used for more general purposes.  Let us give the example of the mod_bintotext module, which converts blocks of binary data into sequences of lines terminated by the newline character \n.

In state BLOX_INIT, the bintotext_compute_length() function looks for a newline character \n inside the first_bytes array, of length available_bytes. (We reuse the same argument names as in the length computing function of the mod_openbsm module.) Note that we only require to read 1 byte, just as in the mod_openbsm case, but available_bytes may be much larger: typically 1024 bytes will be available for binary data coming from a binary file or a UDP socket.

If the newline character was found, then bintotext_compute_length() goes to the BLOX_FINAL state, and returns the offset of the first character past the newline character.  This provides the subdissector with the first line of text, including the final newline.

If the newline character was not found, then bintotext_compute_length() goes to a new state, BLOX_NSTATES + available_bytes. This funny state number is guaranteed to be larger than or equal to BLOX_NSTATES, the number of reserved states (BLOX_INIT, BLOX_FINAL, and BLOX_NOT_ALIGNED). Adding available_bytes to BLOX_NSTATES allows us to remember the number of bytes we had available in which no newline could be found.  (We could also have used the sd_data pointer to that purpose.)  The bintotext_compute_length() function then requests just one byte more, i.e., returns available_bytes+1.

When control is returned to bintotext_compute_length() in some state other than BLOX_INIT, BLOX_FINAL, and BLOX_NOT_ALIGNED (say, BLOX_NSTATES + 1024), with a new value of available_bytes (say, 2048), we look for a newline in the yet unexplored part of the character array first_bytes (i.e., between offsets 1024 included and 2048 excluded, in our example), and we proceed as above.

When BLOX_FINAL is reached, the bloxAPI will then call our subdissector.  This merely takes the stream_len first bytes of the stream character array and makes then a virtual string, associated with the .bintotext.line field. (We take the same variable names as in the mod_bsm example; note that we do not subtract 1 from stream_len, so that the trailing newline is kept.)