Secure Coding in C and C++: Strings and Buffer Overflows
ReadSecure Coding in C and C++, Second Editionor more than 24,000 other books and videos on Safari Books Online.Start a free trial today.
with Dan Plakosh, Jason Rafail, and Martin Sebor1
- But evil things, in robes of sorrow, Assailed the monarch’s high estate.
- —Edgar Allan Poe,
- “The Fall of the House of Usher”
2.1. Character Strings
Strings from sources such as command-line arguments, environment variables, console input, text files, and network connections are of special concern in secure programming because they provide means for external input to influence the behavior and output of a program. Graphics- and Web-based applications, for example, make extensive use of text input fields, and because of standards like XML, data exchanged between programs is increasingly in string form as well. As a result, weaknesses in string representation, string management, and string manipulation have led to a broad range of software vulnerabilities and exploits.
Strings are a fundamental concept in software engineering, but they are not a built-in type in C or C++. The standard C library supports strings of typecharand wide strings of typewchar_t.
String Data Type
A string consists of a contiguous sequence of characters terminated by and including the first null character. A pointer to a string points to its initial character. The length of a string is the number of bytes preceding the null character, and the value of a string is the sequence of the values of the contained characters, in order.Figure 2.1shows a string representation of “hello.”
Figure 2.1. String representation of “hello”
Strings are implemented as arrays of characters and are susceptible to the same problems as arrays.
As a result, secure coding practices for arrays should also be applied to null-terminated character strings; see the “Arrays (ARR)” chapter ofThe CERT C Secure Coding Standard[Seacord 2008]. When dealing with character arrays, it is useful to define some terms:
The C Standard allows for the creation of pointers that point one past the last element of the array object, although these pointers cannot be dereferenced without invoking undefined behavior. When dealing with strings, some extra terms are also useful:
Array Size
One of the problems with arrays is determining the number of elements. In the following example, the functionclear()uses the idiomsizeof(array) / sizeof(array[0])to determine the number of elements in the array. However,arrayis a pointer type because it is a parameter. As a result,sizeof(array)is equal tosizeof(int *). For example, on an architecture (such as x86-32) wheresizeof(int) == 4andsizeof(int *) == 4, the expressionsizeof(array) / sizeof(array[0])evaluates to 1, regardless of the length of the array passed, leaving the rest of the array unaffected.
01 void clear(int array[]) { 02 for (size_t i = 0; i < sizeof(array) / sizeof(array[0]); ++i) { 03 array[i] = 0; 04 } 05 } 06 07 void dowork(void) { 08 int dis[12]; 09 10 clear(dis); 11 /* ... */ 12 }
This is because thesizeofoperator yields the size of the adjusted (pointer) type when applied to a parameter declared to have array or function type. Thestrlen()function can be used to determine the length of a properly null-terminated character string but not the space available in an array.The CERT C Secure Coding Standard包括“ARR01-C Seacord 2008. Do not apply thesizeofoperator to a pointer when taking the size of an array,” which warns against this problem.
The characters in a string belong to the character set interpreted in the execution environment—theexecution character set. These characters consist of abasic character set, defined by the C Standard, and a set of zero or moreextended characters, which are not members of the basic character set. The values of the members of the execution character set are implementation defined but may, for example, be the values of the 7-bit U.S. ASCII character set.
C uses the concept of alocale, which can be changed by thesetlocale()function, to keep track of various conventions such as language and punctuation supported by the implementation. The current locale determines which characters are available as extended characters.
The basic execution character set includes the 26uppercaseand 26lowercaseletters of the Latin alphabet, the 10 decimal digits, 29 graphic characters, the space character, and control characters representing horizontal tab, vertical tab, form feed, alert, backspace, carriage return, and newline. The representation of each member of the basic character set fits in a single byte. A byte with all bits set to 0, called thenull character, must exist in the basic execution character set; it is used to terminate a character string.
The execution character set may contain a large number of characters and therefore require multiple bytes to represent some individual characters in the extended character set. This is called amultibytecharacter set. In this case, the basic characters must still be present, and each character of the basic character set is encoded as a single byte. The presence, meaning, and representation of any additional characters are locale specific. A string may sometimes be called amultibyte stringto emphasize that it might hold multibyte characters. These are not the same as wide strings in which each character has the same length.
A multibyte character set may have astate-dependent encoding, wherein each sequence of multibyte characters begins in aninitial shift stateand enters otherlocale-specific shift stateswhen specific multibyte characters are encountered in the sequence. While in the initial shift state, all single-byte characters retain their usual interpretation and do not alter the shift state. The interpretation for subsequent bytes in the sequence is a function of the current shift state.
UTF-8
UTF-8 is a multibyte character set that can represent every character in the Unicode character set but is also backward compatible with the 7-bit U.S. ASCII character set. Each UTF-8 character is represented by 1 to 4 bytes (see Table 2.1). If the character is encoded by just 1 byte, the high-order bit is 0 and the other bits give the code value (in the range 0 to 127). If the character is encoded by a sequence of more than 1 byte, the first byte has as many leading 1 bits as the total number of bytes in the sequence, followed by a 0 bit, and the succeeding bytes are all marked by a leading 10-bit pattern. The remaining bits in the byte sequence are concatenated to form the Unicode code point value (in the range0x80to0x10FFFF). Consequently, a byte with lead bit 0 is a single-byte code, a byte with multiple leading 1 bits is the first of a multibyte sequence, and a byte with a leading 10-bit pattern is a continuation byte of a multibyte sequence. The format of the bytes allows the beginning of each sequence to be detected without decoding from the beginning of the string.
Table 2.1. Well-Formed UTF-8 Byte Sequences
Code Points |
First Byte |
Second Byte |
Third Byte |
Fourth Byte |
U+0000..U+007F |
00..7F |
|||
U+0080..U+07FF |
C2..DF |
80..BF |
||
U+0800..U+0FFF |
E0 |
A0..BF |
80..BF |
|
U+1000..U+CFFF |
E1..EC |
80..BF |
80..BF |
|
U+D000..U+D7FF |
ED |
80..9F |
80..BF |
|
U+E000..U+FFFF |
EE..EF |
80..BF |
80..BF |
|
U+10000..U+3FFFF |
F0 |
90..BF |
80..BF |
80..BF |
U+40000..U+FFFFF |
F1..F3 |
80..BF |
80..BF |
80..BF |
U+100000..U+10FFFF |
F4 |
80..8F |
80..BF |
80..BF |
Source: [Unicode 2012] |
前128个字符构成基本的执行ution character set; each of these characters fits in a single byte.
UTF-8 decoders are sometimes a security hole. In some circumstances, an attacker can exploit an incautious UTF-8 decoder by sending it an octet sequence that is not permitted by the UTF-8 syntax.The CERT C Secure Coding Standard[Seacord 2008] includes “MSC10-C. Character encoding—UTF-8-related issues,” which describes this problem and other UTF-8-related issues.
Wide Strings
To process the characters of a large character set, a program may represent each character as a wide character, which generally takes more space than an ordinary character. Most implementations choose either 16 or 32 bits to represent a wide character. The problem of sizing wide strings is covered in the section “Sizing Strings.”
A wide string is a contiguous sequence of wide characters terminated by and including the first null wide character. A pointer to a wide string points to its initial (lowest addressed) wide character. The length of a wide string is the number of wide characters preceding the null wide character, and the value of a wide string is the sequence of code values of the contained wide characters, in order.
String Literals
A character string literal is a sequence of zero or more characters enclosed in double quotes, as in"xyz". A wide string literal is the same, except prefixed by the letterL, as inL"xyz".
In a character constant or string literal, members of the character set used during execution are represented by corresponding members of the character set in the source code or byescape sequencesconsisting of the backslash\followed by one or more characters. A byte with all bits set to 0, called thenull character, must exist in the basic execution character set; it is used to terminate a character string.
During compilation, the multibyte character sequences specified by any sequence of adjacent characters and identically prefixed string literal tokens are concatenated into a single multibyte character sequence. If any of the tokens have an encoding prefix, the resulting multibyte character sequence is treated as having the same prefix; otherwise, it is treated as a character string literal. Whether differently prefixed wide string literal tokens can be concatenated (and, if so, the treatment of the resulting multibyte character sequence) is implementation defined. For example, each of the following sequences of adjacent string literal tokens
"a" "b" L"c" "a" L"b" "c" L"a" "b" L"c" L"a" L"b" L"c"
is equivalent to the string literal
L"abc"
Next, a byte or code of value 0 is appended to each character sequence that results from a string literal or literals. (A character string literal need not be a string, because a null character may be embedded in it by a\0escape sequence.) The character sequence is then used to initialize an array of static storage duration and length just sufficient to contain the sequence. For character string literals, the array elements have typecharand are initialized with the individual bytes of the character sequence. For wide string literals, the array elements have typewchar_tand are initialized with the sequence of wide characters corresponding to the character sequence, as defined by thembstowcs()(multibyte string to wide-character string) function with an implementation-defined current locale. The value of a string literal containing a character or escape sequence not represented in the execution character set is implementation defined.
The type of a string literal is an array ofcharin C, but it is an array ofconst charin C++. Consequently, a string literal is modifiable in C. However, if the program attempts to modify such an array, the behavior is undefined—and therefore such behavior is prohibited byThe CERT C Secure Coding Standard[Seacord 2008], “STR30-C. Do not attempt to modify string literals.” One reason for this rule is that the C Standard does not specify that these arrays must be distinct, provided their elements have the appropriate values. For example, compilers sometimes store multiple identical string literals at the same address, so that modifying one such literal might have the effect of changing the others as well. Another reason for this rule is that string literals are frequently stored in read-only memory (ROM).
The C Standard allows an array variable to be declared both with a bound index and with an initialization literal. The initialization literal also implies an array size in the number of elements specified. For strings, the size specified by a string literal is the number of characters in the literal plus one for the terminating null character.
Array variables are often initialized by a string literal and declared with an explicit bound that matches the number of characters in the string literal. For example, the following declaration initializes an array of characters using a string literal that defines one more character (counting the terminating'\0') than the array can hold:
const char s[3] = "abc";
The size of the arraysis 3, although the size of the string literal is 4; consequently, the trailing null byte is omitted. Any subsequent use of the array as a null-terminated byte string can result in a vulnerability, becausesis not properly null-terminated.
A better approach is to not specify the bound of a string initialized with a string literal because the compiler will automatically allocate sufficient space for the entire string literal, including the terminating null character:
const char s[] = "abc";
This approach also simplifies maintenance, because the size of the array can always be derived even if the size of the string literal changes. This issue is further described byThe CERT C Secure Coding Standard[Seacord 2008], “STR36-C. Do not specify the bound of a character array initialized with a string literal.”
Strings in C++
Multibyte strings and wide strings are both common data types in C++ programs, but many attempts have been made to also create string classes. Most C++ developers have written at least one string class, and a number of widely accepted forms exist. The standardization of C++ [ISO/IEC 1998] promotes the standard class templatestd::basic_string. Thebasic_stringtemplate represents a sequence of characters. It supports sequence operations as well as string operations such as search and concatenation and is parameterized by character type:
- stringis atypedeffor the template specializationbasic_string
. - wstringis atypedeffor the template specializationbasic_string
.
Because the C++ standard defines additional string types, C++ also defines additional terms for multibyte strings. A null-terminated byte string, or NTBS, is a character sequence whose highest addressed element with defined content has the value 0 (the terminating null character); no other element in the sequence has the value 0. A null-terminated multibyte string, or NTMBS, is an NTBS that constitutes a sequence of valid multibyte characters beginning and ending in the initial shift state.
Thebasic_string类模板特殊化不易errors and security vulnerabilities than are null-terminated byte strings. Unfortunately, there is a mismatch between C++ string objects and null-terminated byte strings. Specifically, most C++ string objects are treated as atomic entities (usually passed by value or reference), whereas existing C library functions accept pointers to null-terminated character sequences. In the standard C++ string class, the internal representation does not have to be null-terminated [Stroustrup 1997], although all common implementations are null-terminated. Some other string types, such as Win32LSA_UNICODE_STRING, do not have to be null-terminated either. As a result, there are different ways to access string contents, determine the string length, and determine whether a string is empty.
It is virtually impossible to avoid multiple string types within a C++ program. If you want to usebasic_stringexclusively, you must ensure that there are no
- basic_stringliterals. A string literal such as"abc"is a static null-terminated byte string.
- Interactions with the existing libraries that accept null-terminated byte strings (for example, many of the objects manipulated by function signatures declared in
are NTBSs). - Interactions with the existing libraries that accept null-terminated wide-character strings (for example, many of the objects manipulated by function signatures declared in
are wide-character sequences).
Typically, C++ programs use null-terminated byte strings and one string class, although it is often necessary to deal with multiple string classes within a legacy code base [Wilson 2003].
Character Types
The three typeschar,signed char, andunsigned charare collectively called thecharacter types. Compilers have the latitude to definecharto have the same range, representation, and behavior as eithersigned charorunsigned char. Regardless of the choice made,charis a distinct type.
Although not stated in one place, the C Standard follows a consistent philosophy for choosing character types:
signed char and unsigned char
- Suitable for small integer values
plainchar
- The type of each element of a string literal
- Used for character data (where signedness has little meaning) as opposed to integer data
The following program fragment shows the standard string-handling functionstrlen()being called with a plain character string, a signed character string, and an unsigned character string. Thestrlen()function takes a single argument of typeconst char *.
1 size_t len; 2 char cstr[] = "char string"; 3 signed char scstr[] = "signed char string"; 4 unsigned char ucstr[] = "unsigned char string"; 5 6 len = strlen(cstr); 7 len = strlen(scstr); /* warns when char is unsigned */ 8 len = strlen(ucstr); /* warns when char is signed */
Compiling at high warning levels in compliance with “MSC00-C. Compile cleanly at high warning levels” causes warnings to be issued when
- Converting fromunsigned char []toconst char *whencharis signed
- Converting fromsigned char[]toconst char *whencharis defined to be unsigned
Casts are required to eliminate these warnings, but excessive casts can make code difficult to read and hide legitimate warning messages.
If this code were compiled using a C++ compiler, conversions fromunsigned char []toconst char *and fromsigned char[]toconst char *would be flagged as errors requiring casts. “STR04-C. Use plaincharfor characters in the basic character set” recommends the use of plaincharfor compatibility with standard narrow-string-handling functions.
int
Theinttype is used for data that could be eitherEOF(a negative value) or character data interpreted asunsigned charto prevent sign extension and then converted toint. For example, on a platform in which theinttype is represented as a 32-bit value, the extended ASCII code0xFFwould be returned as00 00 00 FF.
- Consequently,fgetc(),getc(),getchar(),fgetwc(),getwc(), andgetwchar()returnint.
- The character classification functions declared in
, such asisalpha(), acceptintbecause they might be passed the result offgetc()or the other functions from this list.
In C, a character constant has typeint. Its value is that of a plaincharconverted toint. The perhaps surprising consequence is that for all character constantsc,sizeof cis equal tosizeof int. This also means, for example, thatsizeof 'a'is not equal tosizeof xwhenxis a variable of typechar.
In C++, a character literal that contains only one character has typecharand consequently, unlike in C, its size is 1. In both C and C++, a wide-character literal has typewchar_t, and a multicharacter literal has typeint.
unsigned char
Theunsigned chartype is useful when the object being manipulated might be of any type, and it is necessary to access all bits of that object, as withfwrite(). Unlike other integer types,unsigned charhas the unique property that values stored in objects of typeunsigned charare guaranteed to be represented using a pure binary notation. A pure binary notation is defined by the C Standard as “a positional representation for integers that uses the binary digits 0 and 1, in which the values represented by successive bits are additive, begin with 1, and are multiplied by successive integral powers of 2, except perhaps the bit with the highest position.”
Objects of typeunsigned charare guaranteed to have no padding bits and consequently no trap representation. As a result, non-bit-field objects of any type may be copied into an array ofunsigned char(for example, viamemcpy()) and have their representation examined 1 byte at a time.
wchar_t
- Wide characters are used for natural-language character data.
“STR00-C. Represent characters using an appropriate type” recommends that the use of character types follow this same philosophy. For characters in the basic character set, it does not matter which data type is used, except for type compatibility.
Sizing Strings
Sizing strings correctly is essential in preventing buffer overflows and other runtime errors. Incorrect string sizes can lead to buffer overflows when used, for example, to allocate an inadequately sized buffer.The CERT C Secure Coding Standard[Seacord 2008], “STR31-C. Guarantee that storage for strings has sufficient space for character data and the null terminator,” addresses this issue. Several important properties of arrays and strings are critical to allocating space correctly and preventing buffer overflows:
Confusing these concepts frequently leads to critical errors in C and C++ programs. The C Standard guarantees that objects of typecharconsist of a single byte. Consequently, the size of an array ofcharis equal to the count of an array ofchar, which is also the bounds. The length is the number of characters before the null terminator. For a properly null-terminated string of typechar, the length must be less than or equal to the size minus 1.
Wide-character strings may be improperly sized when they are mistaken for narrow strings or for multibyte character strings. The C Standard defineswchar_tto be an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales. Windows uses UTF-16 character encodings, so the size ofwchar_tis typically 2 bytes. Linux and OS X (GCC/g++ and Xcode) use UTF-32 character encodings, so the size ofwchar_tis typically 4 bytes. On most platforms, the size ofwchar_tis at least 2 bytes, and consequently, the size of an array ofwchar_tis no longer equal to the count of the same array. Programs that assume otherwise are likely to contain errors. For example, in the following program fragment, thestrlen()function is incorrectly used to determine the size of a wide-character string:
1 wchar_t wide_str1 [] = L“0123456789”;2 wchar_t *wide_str2 = (wchar_t *)malloc(strlen(wide_str1) + 1); 3 if (wide_str2 == NULL) { 4 /* handle error */ 5 } 6 /* ... */ 7 free(wide_str2); 8 wide_str2 = NULL;
When this program is compiled, Microsoft Visual Studio 2012 generates an incompatible type warning and terminates translation. GCC 4.7.2 also generates an incompatible type warning but continues compilation.
Thestrlen()function counts the number of characters in a null-terminated byte string preceding the terminating null byte (the length). However, wide characters can contain null bytes, particularly when taken from the ASCII character set, as in this example. As a result, thestrlen()function will return the number of bytes preceding the first null byte in the string.
In the following program fragment, thewcslen()函数是用来确定正确的大小of a wide-character string, but the length is not multiplied bysizeof(wchar_t):
1 wchar_t wide_str1 [] = L“0123456789”;2 wchar_t *wide_str3 = (wchar_t *)malloc(wcslen(wide_str1) + 1); 3 if (wide_str3 == NULL) { 4 /* handle error */ 5 } 6 /* ... */ 7 free(wide_str3); 8 wide_str3 = NULL;
The following program fragment correctly calculates the number of bytes required to contain a copy of the wide string (including the termination character):
01 wchar_t wide_str1[] = L"0123456789"; 02 wchar_t *wide_str2 = (wchar_t *)malloc( 03 (wcslen(wide_str1) + 1) * sizeof(wchar_t) 04 ); 05 if (wide_str2 == NULL) { 06 /* handle error */ 07 } 08 /* ... */ 09 free(wide_str2); 10 wide_str2 = NULL;
The CERT C Secure Coding Standard[Seacord 2008], “STR31-C. Guarantee that storage for strings has sufficient space for character data and the null terminator,” correctly provides additional information with respect to sizing wide strings.