Thread-local imperative data structures

All primitives and all constructions of the Orchids language must behave in a thread-local way (except for a few special-purpose features).  This is incompatible with the way some libraries work, such as libxml2. We explain the problem, and a standard solution: thread-local objects. We illustrate this on the mod_xml module.

The problem

The mod_xml module offers a few Orchids primitives that manipulate in-memory XML documents, notably xml_get_str (takes an XML document, an XPath, and an attribute name, and returns the value of the attribute as a string) and xml_set_str (takes an XML document, an XPath, an attribute name, and a string, and sets the attribute to the given string).

Consider the following toy example of Orchids code:

state p {
  $doc = idmef_new_alert(); /* new IDMEF XML document */
  expect (...) goto ipv4_state;
  expect (...) goto ipv6_state;
}

state ipv4_state {
  xml_set_str ($doc, "/*/idmef:Alert/idmef:Source/idmef:Node/idmef:Address",
               "category", "ipv4addr");
  expect (...) goto alert;
}

state ipv6_state {
  xml_set_str ($doc, "/*/idmef:Alert/idmef:Source/idmef:Node/idmef:Address",
               "category", "ipv6addr");
  expect (...) goto alert;
}

state alert {
  idmef_write_alert ($doc);
}

This is a typical way of writing part of a signature that, in state p, knows it will have something to report, hence creates an IDMEF alert document $doc; but does not know yet the value of some of the attributes.

Depending on some conditions (abbreviated as ...) it will set the "category" attribute to "ipv4addr" or to "ipv6addr", then waits for additional conditions to happen and go to the alert state, finally emitting the IDMEF alert document.

Note the two expect clauses in state p will create two OrchIDS threads, call them T1 and T2. Each one waits for some specific form of event before it goes to ipv4_state or to ipv6_state.

Now imagine XML documents were handled in the obvious way: idmef_new_alert creates an IDMEF conformant XML document, xml_set_str modifies it. That would be wrong, because the document would be shared between T1 and T2, and any modification by one of the threads would change the other thread’s document.

To make that clear, imagine a sequence of events numbered, say, from 1 to 100. At event 1, we are in state p and the two threads T1 and T2 are created.

At event 2, T1 finds an event that satisfies the condition of its expect clause, goes to state ipv4_address, and sets the "category" attribute to "ipv4addr". It then waits on another event that would enable it to go to state alert; let’s say this happens for the first time at event 100.

Meanwhile, T2 waits for an event that would match its own expect clause, finds it at event 3. Imagine, for the purpose of illustration, that T2 never gets the opportunity of going to state alert, so only one IDMEF alert will be reported, by thread T1, at event 100.

One would expect this alert to report an alert with the "category" attribute equal to "ipv4addr", since T1 is responsible for it. However, the IDMEF document will have been clobbered by thread T2 at event 3, and T1 will erroneously report a "category" attribute equal to "ipv6addr".

The problem is that our naive implementation of XML documents is not thread-local. All data structures handled by the OrchIDS virtual machine should be thread-local, namely: modifications done by one thread should be invisible from any other thread. (There are exceptions to this rule… starting with the mod_sharedvars module, which aims at making two threads communicate through shared values.)

Applicative data structures

The ideal cure for that kind of problem would be to implement an API of applicative XML documents: instead of modifying XML documents, one would create new XML documents from old and never actually modify any. This is what was done with the database objects that the OrchIDS language offers: instead of adding a record R to a database D, we build the union of R with D, and store the new database into a variable.

For some data structures, this would be impractical. The case of XML documents is a typical case. We handle XML documents by calling the libxml2 library, and it is not applicative.

How do we make non-applicative data structures thread-local?

Thread-local objects

A good solution is provided by the notion of thread-local objects, namely objects that OrchIDS will copy each time it creates a new thread.  This way, each thread will have its own copy of the object.

You may object that copying all the time will be costly.  This is why most thread-local objects will be copy-on-write: they won’t actually be copied, until we try to write into them.  I said “most”, not “all”.  Currently, OrchIDS only allows for 32 copy-on-write objects per thread. This should be sufficient for most envisioned applications: the only thread-local objects one usually needs are XML documents (IDMEF alerts, IODEF reports, prelude alerts), and one usually needs only one of each in a given rule.  The other thread-local objects will actually be copied, in an inefficient way—however, we have just argued that there should hardly ever be any such inefficient thread-local object.

Roughly, encapsulation works as follows. Each thread-local object appears as a handle, namely, an unsigned integer, of type uint16_t. For each handle k, the value that the object has in the current thread is to be found is stored in a fake variable (a variable that the user has no access to), of number HANDLE_FAKE_VAR(k). While user variables have low numbers 0, 1, 2, etc., those fake variables have high numbers ULONG_MAX, ULONG_MAX-1, etc.

To create a new thread-local object with value val, we call:

uint16_t create_fresh_handle (gc_t *gc_ctx, state_instance_t *si,
                              ovm_var_t *val);

If the number returned is strictly less than MAX_HANDLES (=32), then this object will be copy-on-write. Otherwise, it will be copied each time a new thread (of type state_instance_t) is created by the OrchIDS engine (in engine.c).

We can access the actual object referred to by the handle, by using one of the following two functions.

ovm_var_t *handle_get_rd (gc_t *gc_ctx, state_instance_t *si, uint16_t k);

returns the local copy of the object val referred to by handle k This is very fast (just one variable lookup). However, by calling this function, you promise not to make any modification to the structure pointed to by the returned pointer: it may be shared with other threads. If you cannot hold that promise, you must use the following function.

ovm_var_t *handle_get_wr (gc_t *gc_ctx, state_instance_t *si, uint16_t k);

returns a private copy of the non-applicative data stored in the thread-local object tl. The copy is guaranteed to be used only with thread si. In fact, handle_get_wr() works by checking whether si owns the object behind the handle. If it does, then the object is returned directly, just as if we had used handle_get_rd(). Otherwise, handle_get_wr() copies the object, stores it as the value of fake variable HANDLE_FAKE_VAR(k), and declares that it owns the new object.

This is less efficient than calling handle_get_rd(), but it is safe.

Copies are done by using the issdl_clone() function. XML documents, for instance, are (handles to) OrchIDS values of type T_EXTERN; that is, they are really pointers to objects of type ovm_extern_t. Every such object must provide function pointers that copy and that free the embedded objects, and that is how OrchIDS knows how to do copies.

Finally, the following function is used to free the thread-local object, namely the local copy held by thread si, and also the handle k, for further reuse.

int release_handle (gc_t *gc_ctx, state_instance_t *si, uint16_t k);

Currently, release_handle() is not used in OrchIDS. That would mean, for example, allowing the user to manually call a function that releases a handle, causing possible run-time errors (undefined object). Since handles are meant to be created in very low numbers, and are anyway releases when threads terminate, it is probably not worth the trouble to provide a way to manually release them.