Garbage collection

Contrarily to all versions of OrchIDS ≤1.1, OrchIDS 2.0 has garbage-collected memory.

This means that you still allocate memory, but do not deallocate it explicitly. This imposes a certain number of constraints on the way you program.

The garbage collector is implemented in file src/util/gc.c. This is a mark-during-sweep garbage collector, an algorithm described by Queinnec, Beaudoing and Queille in 1989. This little-known algorithm is an incremental garbage collector that improves upon Dijkstra’s three-color incremental garbage collector.

Let us say that, at any point in time, every garbage-collectable piece of data can be white, black, or grey.  The black objects are those that have been marked as used by the garbage collector.  The white objects are those that have not yet been marked by the garbage collector (the objects that eventually remain white after the end of the marking phase are those that will be deallocated).  Finally, the grey objects are those that are in the process of being marked.

One should remember that no black object should ever point directly to a white object.  To store a pointer to a garbage-collectable object p into another one, one must always shade p first, namely, make it grey if it was white.  This is the purpose of the gc_touch() function.

The main functions are:

  • The allocator itself:
    void *gc_alloc (gc_t *gc_ctx, size_t n, gc_class_t *class);

    Allocates a garbage-collectable, n byte memory zone.
    The first argument, gc_ctx, is the garbage collector context.  This is a struct containing all the information that the garbage collector needs to function propertly.  You are not meant to touch it in any way.
    The class argument describes all that the garbage-collector needs to know about your structure, in particular, how to mark its subfields.  It should not be NULL.  See gc_class_t below for more explanations.

    gc_alloc() returns an object of type void *. Internally, however, gc_alloc() allocates an object of size n that starts with a gc_header_t header.  In particular, n should be ≥ sizeof(gc_header_t).
    The basic strategy to define a garbage-collectable type is to define it as a struct with a first field of type gc_header_t.  For example, the Orchids native integer type ovm_int_t is defined by:

    struct ovm_int_s
    {
      gc_header_t gc;
      long val;
    };

    and you would allocate an integer by calling:

    struct ovm_int_s *s =
       gc_alloc(gc_ctx, sizeof(struct ovm_int_s), &int_class);
    s->gc.type = T_INT;
    

    The int_class gc_class we are using for that structure, and we shall explain this bit later.
    Note that you should follow the call to gc_alloc() by code that sets the type (here, T_INT).
    If you create a new gc_class, you must also enforce the following relation between the type (here, T_INT) and the gc_class: there is an array gc_classes[256] in lang.c, and its entry at index given by the type should be the corresponding gc_class; in our case, you should check that gc_classes[T_INT] indeed contains &int_class.

  • By the way, the purpose of declaring the return type of gc_alloc() to be void * instead of gc_header_t * is to allow you to write code similar to the above. Otherwise, you would need a cast, and would have to write: struct ovm_int_s *s = (struct ovm_int_s *)gc_alloc(gc_ctx, sizeof(struct ovm_int_s), NULL);.
    Any data allocated by gc_alloc() is created white.  This means that it can be deallocated at the next call to gc(), gc_full(), or gc_alloc() itself!  If you mean to use the object you’ve just created, you should first store it into some other garbage-collectable object… calling GC_TOUCH() on it first.  This is a macro that encapsulates a call to the gc_touch() function, which does all the work.
  • The shading function:
    void gc_touch (gc_t *gc_ctx, gc_header_t *p);
    #define GC_TOUCH(gc_ctx,p) gc_touch(gc_ctx,(gc_header_t *)p)

    shades p. This is required before storing p into an object. Shading means making p grey if it was white. This way, you can store p into some other object without running the risk of having a black object point to a white object.
    Hence, in particular, storing a garbage-collected object p at some location loc should always be written:

    GC_TOUCH (gc_ctx, loc=p);

    and never loc=p;, for example. To make it clear that you have not forgotten the call to GC_TOUCH(), you should always use the above form GC_TOUCH (gc_ctx, loc=p);, and not equivalent forms such as loc=p; GC_TOUCH (gc_ctx, p);.
    By the way, shading works slightly differently in one case: if the gc_class of p has a mark_subfields function pointer equal to NULL, this lets the garbage-collector know that p has not garbage-collectable subfield, and then gc_touch() will make p black directly, instead of grey. This makes the garbage-collector more efficient in those cases. It is therefore relatively important to be able to leave the mark_subfields entry NULL in gc_classes.

  • The garbage-collector itself:
    void gc (gc_t *gc_ctx);
    void gc_full (gc_t *gc_ctx);

    gc() calls the incremental garbage collector, and tries to mark and to reclaim a few objects. By default, it marks 3 objects and tries to reclaim 2.5 objects on average. You should almost never need to call gc() yourself, as gc_alloc() already calls gc() to get back a bit of memory itself.

    gc_full() does a complete round of garbage collection. You should call it twice in a row, in critical low-memory situations, to reclaim as much memory as you can. This is what gc_alloc() does anyway when memory runs low.

  • Object classes:
    typedef struct gc_class_s gc_class_t;

    A record of specific function hooks for specific kinds of garbage-collectable objects.  The most useful one is the mark_subfields hook, which is used to mark garbage-collectable subobjects.  For example, one may define a garbage-collectable type of binary trees as follows:

    typedef struct tree {
      gc_header_t h;
      int n;
      struct tree *left, *right;
    } tree;

    If you allocated a tree object using a NULL class argument, the garbage collector would have no way of realizing that it should mark the left and right fields, too, recursively. To tell the garbage collector it has to do so, call gc_alloc() with the argument &tree_class, where:

    gc_class_t tree_class = {
      GC_ID('t','r','e','e'),
      tree_mark_subfields,
      tree_finalize,
      tree_traverse,
      tree_save,
      tree_restore
    };
  • The tree_mark_subfields() function is defined as follows. It simply shades the two garbage-collectable subfields:
    static void tree_mark_subfields (gc_t *ctx, gc_header_t *p)
    {
      tree *t = (tree *)p;
    
      GC_TOUCH (ctx, t->left);
      GC_TOUCH (ctx, t->right);
    }

    It is also possible to leave the mark_subfields hook equal to NULL, in which case the garbage-collector will consider that no subfield has to be marked. This would be wrong for binary trees, of course. However, it makes sense for types such as ovm_int_t, which have no garbage-collected subfields (only a long subfield). It is even advised to leave the mark_subfields hook equal to NULL when there are no subfields, because it makes the garbage-collector more efficient on those types (see the earlier discussion on gc_touch()).
    There are four other function hooks that need to be defined. The third one, traverse (here, defined as tree_traverse()) operates a more generic way of traversing the same subfields. We keep the mark_subfields hook separate for efficiency reason, although it could have been implemented in terms of the traverse hook. Here is the necessary (boilerplate) code for trees:

    static int tree_traverse (gc_traverse_ctx_t *gtc, gc_header_t *p,
    			  void *data)
    {
      tree *t = (tree *)p;
      int err = 0;
    
      err = (*gtc->do_subfield) (gtc, (gc_header_t *)t->left, data);
      if (err)
        return err;
      err = (*gtc->do_subfield) (gtc, (gc_header_t *)t->right, data);
      return err;
    }

    The finalize hook (here, tree_finalize()) is meant to be called to free up any other memory pointed to by the object in argument, provided such memory is not directly under the control of the garbage collector. It can also be used to close file descriptors, for example, when the object is deallocated. In our example, tree_finalize() does nothing:

    static void tree_finalize (gc_t *ctx, gc_header_t *p)
    {
      return;
    }

    In that case, it is allowed to leave the finalize equal to NULL, instead of setting it to the do-nothing function tree_finalize().
    The fourth hook, save, implements one half of a save-and-restore mechanism, which allows OrchIDS to dump its internal state to a file, and later recover from the saved file anew. The save hook must run through all the subfields of the object p and call the corresponding save functions. In the case of trees, for example, we would write:

    static int tree_save (save_ctx_t *sctx, gc_header_t *p)
    {
      tree *t = (tree *)p;
      int err;
    
      err = save_int (sctx, t->n);
      if (err) return err;
      err = save_gc_struct (sctx, (gc_header_t *)t->left);
      if (err) return err;
      err = save_gc_struct (sctx, (gc_header_t *)t->right);
      return err;
    }

    Note that this works slightly differently from the mark_subfields and traverse hooks: we also call save functions, recursively, on the fields that are not under the control of the garbage-collector, such as n here.
    The save hook cannot be left to NULL.
    Conversely, you must write a restore hook as follows.

    static gc_header_t *tree_restore (restore_ctx_t *rctx)
    {
      gc_t *gc_ctx = rctx->gc_ctx;
      tree *t;
      tree *left, *right;
      int n;
      int err;
    
      t = NULL; /* pre-set result to NULL, in case an error occurs. */
      GC_START (gc_ctx, 2); /* We shall need to protect the left and right fields
                               against the GC before we can store them
                               into the newly allocated tree. */
      err = restore_int (gc_ctx, &n); /* get back the n field */
      if (err) { errno = err; goto end; } /* error reporting is through
                                             the errno variable. */
      left = (tree *)restore_gc_struct (rctx);
      if (left==NULL && errno!=0) goto end; /* propagate error if one happened
                                               in restoring left;
                                               testing left==NULL is not
                                               enough, as NULL might be
                                               a legal tree. */
      GC_UPDATE (gc_ctx, 0, left); /* protect left against GC! */
      right = (tree *)restore_gc_struct (rctx);
      if (right==NULL && errno!=0) goto end;
      GC_UPDATE (gc_ctx, 1, right); /* protect right against GC! */
      t = gc_alloc (gc_ctx, sizeof(tree), &tree_class);
      t->gc.type = T_TREE; /* assuming this tag is defined in lang.h---
                              otherwise add it, as a number between 0 and 255,
                              making sure it has not been taken already;
                              then make sure the gc_classes[] array, at
                              the end of lang.c, contains &tree_class
                              at index T_TREE. */
      t->n = n;
      GC_TOUCH (gc_ctx, t->left = left); /* do not forget GC_TOUCH()! */
      GC_TOUCH (gc_ctx, t->right = right); /* do not forget GC_TOUCH()! */
    end:
      GC_END (gc_ctx);
      return t;
    }

    The restore hook cannot be left to NULL.

  • Debugging:
    void gc_check (gc_t *gc_ctx);

    Looks for inconsistencies in memory. Works similarly to fsck on Unix systems: call it to check that the garbage collector is in a sane state, that is, that it satisfies all the invariants it is meant to satisfy. For example, this will check that no black object directly points to any white object, as mentioned above.

    To detect memory and garbage collector related bugs, sprinkle your code with calls to gc_check(), and launch OrchIDS under gdb (or lldb on Mavericks MacOSX and later). gc_check() should abort() and fall back to the debugger the first time it is called after something went wrong.