Some data sources, such as mod_udp
, only know how to fetch data in fixed-size blocks. The modules that take those data and dissect them into Orchids events, on the other hand, may require a variable number of bytes. The mod_utils.[ch]
files provide a general-purpose API to solve this impedance match problem: the blox
API.
There are three simple ways the length can be specified, in principle:
- implicitly: all required data blocks have exactly the same size (this seems to be the case of no event record format known to OrchIDS);
- explicitly: the first few bytes, say, contain the number of bytes to be read (surprisingly, no event record format known to OrchIDS does something as simple as that);
- by end of record character: every byte until the terminator character is taken to form the data block (e.g., the
mod_bintotext
module works this way, considering the newline character'\n'
as terminator).
The blox
API was initially meant to solve the problem in the context of the mod_openbsm
module, for which finding the length is a bit more complicated. However, it is suited to solve the length problem in all three cases above as well, as we shall see, partly, at the end of this post.
Automata
Let us explain the mod_openbsm
case in more detail: the first byte is a type tag, and depending on that type, we find the length in different ways. In the first 4 cases, the length is given by the next 4 bytes, in big-endian format (including the already read 5 bytes). In a fifth case, the next 8 bytes are a time value, and the following 2 bytes hold the length of the subsequent file name (excluding the already read 11 bytes) in big-endian format.
Reading the length, and in fact the whole data block, can be described by the following automaton:
- There are four states,
BLOX_INIT
,BLOX_NOT_ALIGNED
,BLOX_FINAL
(those three are all predefined in theblox
library, as numbers 0, 1, 2 respectively),STATE_HEADER
, andSTATE_FILE_EXPECT_FILENAMELEN
(defined inmod_openbsm
). - The initial state is
BLOX_INIT
. In that state, we have read the first byte of the data block. - When in state
BLOX_INIT
, we look at the first byte. There are 5 legal values for this byte. In the first 4 cases, we go to stateSTATE_HEADER
, and request to read 5 bytes (that is, 4 extra bytes: we have already read 1 byte). In the fifth case, we go to stateSTATE_FILE_EXPECT_FILENAMELEN
, and request to read 11 bytes. If the character read does not match any of the previous cases, we go to stateBLOX_NOT_ALIGNED
, requesting to resynchronize the data: throw away whatever we have read, re-read one byte and go back to stateBLOX_INIT
. (Resynchronizing is done automatically by theblox
engine, and is implemented in the provided functionblox_dissect()
. However, you must describe the other actions.) - When in state
STATE_HEADER
, we have read 5 bytes. We interpret the last 4 bytes as a length n: we request to read n bytes (including the 5 bytes we have already read), and go to stateBLOX_FINAL
: we have finished our task, theblox
engine will make sure that we have read n bytes, and pass it on (to the subdissector, see below). - When in state
STATE_FILE_EXPECT_FILENAMELEN
, we have read 11 bytes. We interpret the last 2 bytes as a length m: we request to read m more bytes (excluding the 11 bytes we have already read: so we request to read m+11 bytes in total), and go to stateBLOX_FINAL
, again.
The blox API
This automaton is described by a function of the following type, which you must provide (in the case of mod_openbsm
, this function is called openbsm_compute_length
):
typedef size_t (*compute_length_fun) (unsigned char *first_bytes, size_t n_first_bytes, size_t available_bytes, int *state, /* pointer to blox state */ void *sd_data);
When your function, of that type, is called, first_bytes
will point to a memory zone that holds available_bytes
bytes; this number is always larger than or equal to the previously required number of bytes, n_first_bytes
. (If you request 11 bytes, the mod_udp
module may decide to read 1024 bytes instead, for example.) The integer pointed to by state
is the current state. Your function’s task will be to update the state by storing the new state into *state
, and return the requested number of bytes to read (5, or 11, for example, in the open_bsm
case). The pointer sd_data
contains private data that is passed to each invocation of your function: do whatever you please with it.
Once we have reached the BLOX_FINAL
state (by storing it into *state
), the blox
engine will call a subdissector function, which you must provide, too, and is of the following type (in the openbsm case, this is openbsm_subdissect()
):
typedef void (*subdissect_fun) (orchids_t *ctx, mod_entry_t *mod, event_t *event, ovm_var_t *delegate, unsigned char *stream, size_t stream_len, void *sd_data, int dissector_level);
When your subdissector is called, stream
will hold a pointer to stream_len
bytes, holding a complete data block. You must now chop this block in pieces, enriching the event (list of field/value pairs) event
. This works just like an ordinary dissector.
The mod
value points to the current module (mod_openbsm
in our example), sd_data
is the same pointer to private data that we mentioned above.
The delegate
value is a bit more mysterious. In the mod_openbsm
example again, the data block will be part of an OrchIDS binary string str
(or a virtual binary string). Some the field/value pairs will include substrings of it. It is interesting to create those substrings as virtual strings. For that, the ovm_vstr_new()
and ovm_vbstr_new()
functions require a delegate: this is the delegate
value. Most often, delegate
will be the string str
. However, if str
is itself virtual, delegate
might be its own delegate instead.
Finally, the dissector_level
value holds the number of nested dissectors called until now on the current data source. You don’t need to know anything about it, except that you should pass it along to calls to further subdissectors, and to post_event()
and REGISTER_EVENTS()
(which themselves may call subdissectors, and will do so with a value of dissector_level+1
). The blox
API uses it to register itself into the rtactionlist
priority queue, with a priority equal to dissector_level*128
. This ensures that blox
dissectors consume their input faster than this input is produced.
Only two small tasks remain. We must write our dissector: this will just be a simple call to the following function, provided by the blox
API:
int blox_dissect(orchids_t *ctx, mod_entry_t *mod, event_t *event, void *sd_data, int dissector_level);
For example, the dissector of the mod_openbsm
module is:
static int openbsm_dissect (orchids_t *ctx, mod_entry_t *mod, event_t *event, void *data, int dissector_level) { return blox_dissect (ctx, mod, event, data, dissector_level); }
And we must also make sure that we have initialized an instance of the blox
API for each possible matching source, using the following function:
blox_hook_t *init_blox_hook(orchids_t *ctx, blox_config_t *bcfg, char *tag, size_t taglen);
This returns a pointer to a blox_hook_t
structure, which holds various buffers and flags. Each pair of a source and a blox dissector should have its own blox_hook_t
structure. It is therefore natural to call init_blox_hook()
for each DISSECT
directive. This is done by installing a pre-dissection hook in the input_module_t
structure describing the module we are creating. For example, the pre-dissector of the mod_openbsm
module is:
static void *openbsm_predissect(orchids_t *ctx, mod_entry_t *mod, char *parent_modname, char *cond_param_str, int cond_param_size) { blox_hook_t *hook; hook = init_blox_hook (ctx, mod->config, cond_param_str, cond_param_size); return hook; }
The OrchIDS engine will make sure that the hook returned by the pre dissector will be passed on to the blox
API, so that it knows which buffers and flags pertain to which input/dissector pair.
Finally, the bcfg
argument to init_blox_hook()
holds configuration information for the whole dissector module (not for each one if its instances). You obtain it by calling:
blox_config_t *init_blox_config(orchids_t *ctx, mod_entry_t *mod, size_t n_first_bytes, compute_length_fun compute_length, subdissect_fun subdissect, void *sd_data );
Here, n_first_bytes
is the number of bytes that should be read each time blox_dissect()
is called. In the case of the mod_openbsm
module, we only need to read one byte. For other formats, we may need to read 4 bytes holding a length, for example.
The compute_length
and subdissect
function arguments are those we have described above, and this is how we inform the blox
engine what those functions are. Finally, sd_data
is the private pointer that will be passed to both.
Again in the case of the mod_openbsm
module, this initialization is done in the preconfiguration function below (the call to register_fields()
is meant to register all fields known to mod_openbsm
, and is not directly relevant to this post):
static void *openbsm_preconfig(orchids_t *ctx, mod_entry_t *mod) { blox_config_t *bcfg; register_fields(ctx, mod, openbsm_fields, OPENBSM_FIELDS); bcfg = init_blox_config (ctx, mod, 1, openbsm_compute_length, openbsm_subdissect, NULL); return bcfg; }
Returning bcfg
makes sure it will be stored into the config
field of the module mod
: we retrieve it as mod->config
in the call we have made above to init_blox_hook()
.
Other uses of the blox API
We have said that the blox
API could be used for more general purposes. Let us give the example of the mod_bintotext
module, which converts blocks of binary data into sequences of lines terminated by the newline character \n
.
In state BLOX_INIT
, the bintotext_compute_length()
function looks for a newline character \n
inside the first_bytes
array, of length available_bytes
. (We reuse the same argument names as in the length computing function of the mod_openbsm
module.) Note that we only require to read 1 byte, just as in the mod_openbsm
case, but available_bytes
may be much larger: typically 1024 bytes will be available for binary data coming from a binary file or a UDP socket.
If the newline character was found, then bintotext_compute_length()
goes to the BLOX_FINAL
state, and returns the offset of the first character past the newline character. This provides the subdissector with the first line of text, including the final newline.
If the newline character was not found, then bintotext_compute_length()
goes to a new state, BLOX_NSTATES + available_bytes
. This funny state number is guaranteed to be larger than or equal to BLOX_NSTATES
, the number of reserved states (BLOX_INIT
, BLOX_FINAL
, and BLOX_NOT_ALIGNED
). Adding available_bytes
to BLOX_NSTATES
allows us to remember the number of bytes we had available in which no newline could be found. (We could also have used the sd_data
pointer to that purpose.) The bintotext_compute_length()
function then requests just one byte more, i.e., returns available_bytes+1
.
When control is returned to bintotext_compute_length()
in some state other than BLOX_INIT
, BLOX_FINAL
, and BLOX_NOT_ALIGNED
(say, BLOX_NSTATES + 1024
), with a new value of available_bytes
(say, 2048), we look for a newline in the yet unexplored part of the character array first_bytes
(i.e., between offsets 1024 included and 2048 excluded, in our example), and we proceed as above.
When BLOX_FINAL
is reached, the blox
API will then call our subdissector. This merely takes the stream_len
first bytes of the stream
character array and makes then a virtual string, associated with the .bintotext.line
field. (We take the same variable names as in the mod_bsm
example; note that we do not subtract 1 from stream_len
, so that the trailing newline is kept.)