Some functions in the C code of Orchids use objects of type char
, some others of type unsigned char
. You will sometimes see strange sequences of casts in the code of Orchids, such as:
{ unsigned char c; /* ... */ switch ((int)(unsigned int)c) { /* ... */ } }
To understand what is at stake, consider the following piece of code:
{ char c; switch (c) { /* ... */ case 'é': return 'e'; case 'à': return 'a'; /* ... */ default: return c; } }
This appears to be some code whose purpose would be to remove accents from letters, but is completely buggy.
One reason is that we accented characters such as 'é'
or 'à'
are system and compiler-dependent constants (more precisely, everything depends on the chosen character encoding), and should not be used as such.
However, the reason why I am giving that code as an example is to show an example of a very strange bug with characters in C.
It is very likely that, if you start with code with 'é'
in c
, it will not return 'e'
.
Let me give the basic reasons for this strange behavior:
- The character constants such as
'é'
are of typeunsigned char0xe9
on my machine) - The ANSI C stantard does not specify whether
char
should besigned
orunsigned
; let me assume that it issigned
, as on many machines today - The
switch
construct takes an object of typeint
, so implicit conversion fromchar
toint
took place in the code above.
As a result, if you put 'é'
into c
, c
will contain the integer 0xe9
, namely -0x17
as a signed char;
switch
will convert it to the int
value -0x00000017 = 0xffffffe9
... and this is different from the value of the character 'é'
it compares it with.
If c
had been declared unsigned char
instead, we might think the bug would have been averted. However, the path of conversions from unsigned char
to (signed) int
is not specified, as far as I know. If the C compiler first converts your unsigned char
to signed char
then to int
, the same bug as above will occur.
I prefer not to let the C compiler choose the sequence of conversions it should do, and I write them in full… as (int)(unsigned int)c
.