The C++ Programming Language – Part 3

This post will be mostly for personal reference as I go through Bjarne Stroustrop’s “The C++ Programming Language” 4th edition textbook. Some of the notes will appear random. This post will cover mostly the vector container. More containers and some algorithms will be featured in part 4. 

Chapter 4 – Containers and Algorithms

The C++ standard library provides standard types and takes up 2/3rds of the ISO C++ standard. It provides facilities for runtime support like allocations, the C standard library, strings, io streams, containers, algorithms, numerical computation, regex matching, concurrency, template programming, smart pointers, and more. In short, the C++ stdlib contains common fundamental data structures along with algorithms used on said data structures. The primary purpose of the C++ stdlib is to provide you with a well-designed and well-test tool set to solve problems in an efficient way. Each tool in the tool set utilizes C++ language features in near optimal/optimal ways.

Strings in C++ are mutable, as opposed to other languages like Python. You can utilize the [] operator to change a C++ String.

/* g++ -std=c++11 -o string string.cpp */

#include <stdio.h>
#include <iostream>
#include <string>

using namespace std;

string compose(const string& name, const string& domain)
{
    return name  + '@' + domain;
}

int main()
{
    auto addr = compose("yee", "badbytes.io");

    /* printf doesnt know what a C++ string is, so must pass C string */
    printf("%s\n", addr.c_str());

    /* cout ostream knows how to handle C++ string type */
    cout << addr[2] << endl;
}

Output:

user@ubuntu:~/cpp/part_1/chapter_4$ ./string
yee@badbytes.io
e

Input/output (io) of built-in types is straight forward. cout and cin from the iostream library can intake all built-in types. iostream also allows developers to define IO operations for their own user-defined types. For example:

/* g++ -std=c++11 -o entry entry.cpp */

#include <iostream>
#include <string>


using namespace std;


struct Entry {
    string name;
    int number;
};


/* user-defined type output */
ostream& operator<< (ostream& os, const Entry& e)
{
    return os << "{\"" << e.name << "\"," << e.number << "}" << endl;
}


/* user-defined type input */
istream& operator>> (istream& is, Entry& e)
{
    char c;
    char c2;
    if (is >> c && c == '{' && is >> c2 && c2 == '"') {
        string name;
        while (is.get(c) && c != '"') {
            name += c;
        }
        if (is >> c && c == ',') {
            int number = 0;
            if (is >> number >> c && c == '}') {
                e = {name,number};
                return is;
            }
        }
    }

    is.setstate(ios_base::failbit);
    return is;
}


int main()
{
    /*
    for (Entry ee; cin >> ee;) {
        cout << ee << endl;
    }
    */

    Entry ee;
    cin >> ee;
    cout << ee;
    return 0;
}

Output:

user@ubuntu:~/cpp/part_1/chapter_4$ ./entry 
{"badbytes", 1}
{"badbytes",1}

A container is a class with the main purpose of holding objects. The entry class shown earlier is a container. Vector is the most useful standard library container. It holds a sequence of elements of a given type and provides efficient ways to operate on the held objects. Vector stores elements contiguously in memory, giving it O(1) access time since you hold a starting pointer, then access elements by offset. Iterators are provided for every standard library container. They expose begin() and end() functions. Begin returns a pointer to the first element, end() points to one-past-the-end of the container. This can be a source of bugs, so be mindful when using end() while using a container iterator. Container iterators can also use unary operators like ++ and — to traverse the objects. While looping through a container with an iterator named iter, *iter is the current container element. If this pointer de-reference returns an object with members, then iter->member is valid. Using this syntax semantically the same as using (*iter)->member. Example vector initialization and usage:

/* g++ -std=c++11 -o vector_init vector_init.cpp */

#include <vector>
#include <iostream>

using namespace std;


int main()
{
    vector<int> v1 = {1, 2, 3, 4};
    vector<string> v2;
    vector<char> v3(23);
    vector<double> v4(8, 9.9);

    cout << "v1 size: " << v1.size() << endl;
    cout << "v1[2]: " << v1[2] << endl;

    cout << "v2 size: " << v2.size() << endl;

    cout << "v3 size: " << v3.size() << endl;
    cout << "v3 vector contents: " << endl;
    for (int i = 0; i != v3.size(); ++i) {
        cout << v3[i];
    }

    cout << "v4 size: " << v4.size() << endl;
    cout << "v4 vector contents: " << endl;
    for (int i = 0; i != v4.size(); ++i) {
        cout << v4[i] << " ";
    }
    cout << endl;

    return 0;
}

Output:

user@ubuntu:~/cpp/part_1/chapter_4$ ./vector_init 
v1 size: 4
v1[2]: 3
v2 size: 0
v3 size: 23
v3 vector contents: 
v4 size: 8
v4 vector contents: 
9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9

A second example using the vector container:

/* g++ -std=c++11 -ggdb -o vector_usage vector_usage.cpp */

#include <vector>
#include <string>
#include <iostream>


using namespace std;


void print_names(const vector<string>& names)
{
    for (int i = 0; i < names.size(); ++i) {
        cout << names[i] << endl;
    }
}


int main()
{
    vector<string> names = {"bad", "bytes", "yee"};
    print_names(names);
    return 0;
}

Output:

user@ubuntu:~/cpp/part_1/chapter_4$ ./vector_usage 
bad
bytes
yee

I stated earlier that stdlib vector stores elements contiguously in memory, giving it O(1) access time since you hold a starting pointer, then access elements by offset. Lets head into a debugger and make sure that the vector class is actually storing elements contiguously. If you want to replicate this, I suggest adding set print asm-demangle on in .gdbinit. This makes gdb show cleaner function names. For example,

<_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEED1Ev@plt>

becomes

<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string()@plt>.

Which lets be honest, still looks confusing, but its better. Disassembling main()

pwndbg> disas main
...  
   0x0000000000400e9c <+41>:	call   0x400cb0 <std::allocator<char>::allocator()@plt>
   0x0000000000400ea1 <+46>:	lea    rdx,[rbp-0xb4]
   0x0000000000400ea8 <+53>:	lea    rax,[rbp-0x90]
   0x0000000000400eaf <+60>:	mov    esi,0x401945
   0x0000000000400eb4 <+65>:	mov    rdi,rax
   0x0000000000400eb7 <+68>:	call   0x400ca0 <std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&)@plt>
   0x0000000000400ebc <+73>:	lea    rax,[rbp-0xb3]
   0x0000000000400ec3 <+80>:	mov    rdi,rax
   0x0000000000400ec6 <+83>:	call   0x400cb0 <std::allocator<char>::allocator()@plt>
   0x0000000000400ecb <+88>:	lea    rax,[rbp-0xb3]
   0x0000000000400ed2 <+95>:	lea    rdx,[rbp-0x90]
   0x0000000000400ed9 <+102>:	lea    rcx,[rdx+0x20]
   0x0000000000400edd <+106>:	mov    rdx,rax
   0x0000000000400ee0 <+109>:	mov    esi,0x401949
   0x0000000000400ee5 <+114>:	mov    rdi,rcx
   0x0000000000400ee8 <+117>:	call   0x400ca0 <std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&)@plt>
   0x0000000000400eed <+122>:	lea    rax,[rbp-0xb2]
   0x0000000000400ef4 <+129>:	mov    rdi,rax
   0x0000000000400ef7 <+132>:	call   0x400cb0 <std::allocator<char>::allocator()@plt>
   0x0000000000400efc <+137>:	lea    rax,[rbp-0xb2]
   0x0000000000400f03 <+144>:	lea    rdx,[rbp-0x90]
   0x0000000000400f0a <+151>:	lea    rcx,[rdx+0x40]
   0x0000000000400f0e <+155>:	mov    rdx,rax
   0x0000000000400f11 <+158>:	mov    esi,0x40194f
   0x0000000000400f16 <+163>:	mov    rdi,rcx
   0x0000000000400f19 <+166>:	call   0x400ca0 <std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&)@plt>
   0x0000000000400f1e <+171>:	lea    rax,[rbp-0x90]
   0x0000000000400f25 <+178>:	mov    r12,rax
   0x0000000000400f28 <+181>:	mov    r13d,0x3
   0x0000000000400f2e <+187>:	lea    rax,[rbp-0xb1]
   0x0000000000400f35 <+194>:	mov    rdi,rax
   0x0000000000400f38 <+197>:	call   0x401174 <std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >::allocator()>
   0x0000000000400f3d <+202>:	lea    rdx,[rbp-0xb1]


pwndbg> x/1s 0x401945
0x401945:	"bad"
pwndbg> x/1s 0x401949
0x401949:	"bytes"
pwndbg> x/1s 0x40194f
0x40194f:	"yee"

It looks like the 3 strings that I put into the vector container are initially in .rodata based on objdump.

user@ubuntu:~/cpp/part_1/chapter_4$ objdump -h vector_usage
...
15 .rodata       00000013  0000000000401940  0000000000401940  00001940  2**2
                 CONTENTS, ALLOC, LOAD, READONLY, DATA

All 3 strings are copied from read-only data onto main’s stack frame.

pwndbg> x/1s 0x7fffffffdc00
0x7fffffffdc00:	"bad"
pwndbg> x/1s 0x7fffffffdc20
0x7fffffffdc20:	"bytes"
pwndbg> x/1s 0x7fffffffdc40
0x7fffffffdc40:	"yee"

pwndbg> info proc mapping
...
            0x604000           0x636000    0x32000        0x0 [heap]
...
      0x7ffffffde000     0x7ffffffff000    0x21000        0x0 [stack]

I expected these strings to be placed on the heap by the container, but in this example they are on the stack. Next, the values in the iterator are looped over and printed, then the container is destructed. This confirms that the elements are actually stored consecutively allowing for O(1) access based on offset, but this use case is very simple and has tiny strings.

pwndbg> x/40s 0x7fffffffdc00
0x7fffffffdc00:	"bad"
0x7fffffffdc04:	""
0x7fffffffdc05:	""
0x7fffffffdc06:	""
0x7fffffffdc07:	""
0x7fffffffdc08:	"0\334\377\377\377\177"
0x7fffffffdc0f:	""
0x7fffffffdc10:	" \334\377\377\377\177"
0x7fffffffdc17:	""
0x7fffffffdc18:	"\005"
0x7fffffffdc1a:	""
0x7fffffffdc1b:	""
0x7fffffffdc1c:	""
0x7fffffffdc1d:	""
0x7fffffffdc1e:	""
0x7fffffffdc1f:	""
0x7fffffffdc20:	"bytes"
0x7fffffffdc26:	""
0x7fffffffdc27:	""
0x7fffffffdc28:	"\377\377"
0x7fffffffdc2b:	""
0x7fffffffdc2c:	"\001"
0x7fffffffdc2e:	""
0x7fffffffdc2f:	""
0x7fffffffdc30:	"@\334\377\377\377\177"
0x7fffffffdc37:	""
0x7fffffffdc38:	"\003"
0x7fffffffdc3a:	""
0x7fffffffdc3b:	""
0x7fffffffdc3c:	""
0x7fffffffdc3d:	""
0x7fffffffdc3e:	""
0x7fffffffdc3f:	""
0x7fffffffdc40:	"yee"
0x7fffffffdc44:	""
0x7fffffffdc45:	""
0x7fffffffdc46:	""
0x7fffffffdc47:	""
0x7fffffffdc48:	"\r\031@"

Let’s see what happens with strings 100,000 chars long:

/* g++ -std=c++11 -ggdb -o vector_usage vector_usage.cpp */

#include <vector>
#include <string>
#include <iostream>


using namespace std;


int main()
{

    string a(100000, 'A');
    string b(100000, 'B');
    string c(100000, 'C');

    vector<string> names = {a, b, c};

    return 0;
}

In this case, heap space is allocated by the container.

pwndbg> 
0x0000000000400c87	12	    string a(100000, 'A');
LEGEND: STACK | HEAP | CODE | DATA | RWX | RODATA
───────────────────────────────────────────────────────────────────────────────[ REGISTERS ]───────────────────────────────────────────────────────────────────────────────
 RAX  0x615c20 ◂— 0x4141414141414141 ('AAAAAAAA')

...

pwndbg> malloc_chunk a
0x615c20 PREV_INUSE {
  prev_size = 4702111234474983745, 
  size = 4702111234474983745, 
  fd = 0x4141414141414141, 
  bk = 0x4141414141414141, 
  fd_nextsize = 0x4141414141414141, 
  bk_nextsize = 0x4141414141414141
}

The same procedure occurs for variables b and c. Now to make sure they are contiguous:

pwndbg> malloc_chunk a
0x615c20 PREV_INUSE {
  prev_size = 4702111234474983745, 
  size = 4702111234474983745, 
  fd = 0x4141414141414141, 
  bk = 0x4141414141414141, 
  fd_nextsize = 0x4141414141414141, 
  bk_nextsize = 0x4141414141414141
}
pwndbg> malloc_chunk b
0x62e2d0 IS_MMAPED {
  prev_size = 4774451407313060418, 
  size = 4774451407313060418, 
  fd = 0x4242424242424242, 
  bk = 0x4242424242424242, 
  fd_nextsize = 0x4242424242424242, 
  bk_nextsize = 0x4242424242424242
}
pwndbg> malloc_chunk c
0x646980 PREV_INUSE IS_MMAPED {
  prev_size = 4846791580151137091, 
  size = 4846791580151137091, 
  fd = 0x4343434343434343, 
  bk = 0x4343434343434343, 
  fd_nextsize = 0x4343434343434343, 
  bk_nextsize = 0x4343434343434343
}

...

user@ubuntu:~/cpp/part_1/chapter_4$ python 
>>> hex(0x615c20 + 100000)
'0x62e2c0'
>>> hex(0x62e2d0 + 100000)
'0x646970'


...

pwndbg> x/100xg 0x62e2c0
0x62e2c0:	0x0000000000000000	0x00000000000186b1
0x62e2d0:	0x4242424242424242	0x4242424242424242
0x62e2e0:	0x4242424242424242	0x4242424242424242
0x62e2f0:	0x4242424242424242	0x4242424242424242
0x62e300:	0x4242424242424242	0x4242424242424242
pwndbg> x/100xg 0x646970
0x646970:	0x0000000000000000	0x00000000000186b1
0x646980:	0x4343434343434343	0x4343434343434343
0x646990:	0x4343434343434343	0x4343434343434343
0x6469a0:	0x4343434343434343	0x4343434343434343
0x6469b0:	0x4343434343434343	0x4343434343434343
0x6469c0:	0x4343434343434343	0x4343434343434343

And there they are, nice and contiguous in memory. A good next experiment may be to create wildly different sized vector members and see how it handles that. Since the heap tries to separate allocations based on size, I wonder how vector maintains contiguous members. It probably requests a block large enough to hold all its members regardless of their comparable size, so that a member of size 100 can still be contiguous to a member of size 100000. Otherwise, the OS memory allocator will most likely place those in different heap bins. I took a look at the vector source code but its too complex for the time I have. If you want to read it, see stl_vector.h. This was a bit of an aside, stay tuned for part 4.