Strange behaviors with characters

Some functions in the C code of Orchids use objects of type char, some others of type unsigned char. You will sometimes see strange sequences of casts in the code of Orchids, such as:

{
  unsigned char c;

  /* ... */
  switch ((int)(unsigned int)c) {
    /* ... */
  }
}

To understand what is at stake, consider the following piece of code:

{
  char c;

  switch (c) {
    /* ... */
    case 'é': return 'e';
    case 'à': return 'a';
    /* ... */
    default: return c;
  }
}

This appears to be some code whose purpose would be to remove accents from letters, but is completely buggy.

One reason is that we accented characters such as 'é' or 'à' are system and compiler-dependent constants (more precisely, everything depends on the chosen character encoding), and should not be used as such.

However, the reason why I am giving that code as an example is to show an example of a very strange bug with characters in C.

It is very likely that, if you start with code with 'é' in c, it will not return 'e'.

Let me give the basic reasons for this strange behavior:

  • The character constants such as 'é' are of type unsigned char0xe9 on my machine)
  • The ANSI C stantard does not specify whether char should be signed or unsigned; let me assume that it is signed, as on many machines today
  • The switch construct takes an object of type int, so implicit conversion from char to int took place in the code above.

As a result, if you put 'é' into c, c will contain the integer 0xe9, namely -0x17 as a signed char; switch will convert it to the int value -0x00000017 = 0xffffffe9... and this is different from the value of the character 'é' it compares it with.

If c had been declared unsigned char instead, we might think the bug would have been averted. However, the path of conversions from unsigned char to (signed) int is not specified, as far as I know. If the C compiler first converts your unsigned char to signed char then to int, the same bug as above will occur.

I prefer not to let the C compiler choose the sequence of conversions it should do, and I write them in full… as (int)(unsigned int)c.