Data Structures Simplified

You may have heard that computers are basically pretty simple in design. The general description of the the standard Von Neumann architecture, on which virtually all modern computers are based, is that of a central processing unit (CPU), memory (whether RAM or some other kind), a data bus (wires) connecting them, and in addition some kind of input and output which allows the contents of memory to be read from or written to by some kind of user interface devices. (I/O devices are only necessary because you can’t see the contents of electrical memory devices nor write to them by hand. Once upon a time, that was not the case. Well, it’s also hard for people to read in binary. Although, once upon a time, there were some people for whom that was not the case, either.) The CPU starts reading memory at a predetermined location and follows instructions found therein. These instructions correspond to functional parts of the CPU, which perform operations on registers, which are a special storage space inside the CPU. (Registers are necessary for both speed and wiring concerns.)

But that’s the machinery. It’s basically like a big pattern of falling dominoes. However, to write any kind of useful software, you need guidelines for how to organize groups of instructions in such a way that they do useful things. The first guideline was the logical distinction between two kinds of contents that can be found in memory. I’ve told you the first one: instructions. The other is just called data. The truth is that instructions are just a special kind of data, which is why you can put them in the same memory as other data, what you might call raw data. Instructions inevitably describe operations, and raw data is generally operated upon, but this distinction is completely arbitrary, and is made in order to reduce the complexity of software to the point that it can be comprehended by mere mortals. To a computer, the distinction is irrelevant. The only time a computer notices the difference is if you feed it raw data in the guise of an instruction that it does not recognize, in which case it is usually designed to crash in some obvious way.

What I’m getting at is that software is just structure applied to what is otherwise a completely uniform physical structure: banks of memory made of very many tiny on-off switches. The structure is dependent on context, and the context is usually provided by the language of the computer. From there, what begins very simple quickly becomes fiendishly complicated. The job of programmers is to do create lists of nice, orderly instructions which do something useful, while avoiding the tendency of programs to become so complicated that no one can understand how they work.

To that end, programmers have to inflict limitations on themselves. They also have to do their best to impose explicit structure on their programs and the raw data with which those programs work. Now, I’m being quite abstract. How do they limit themselves? How do they impose structure? There are countless ways, but the first is the language itself.

There are many computer languages. There are dozens in common usage right now. I suppose there are a couple of dozen major languages. They have names like BASIC, Fortran, Perl, PHP, Python, Ruby, Java, ML, Prolog, LISP, Smalltalk, C and a family based on C, including C++, C#, Objective-C, and D. All of these are high-level languages, although some are “higher” than others. The only real languages are the ones spoken by CPUs, those instructions I mentioned earlier.

CPUs can only do a few kinds of things. Well, basically four. Only two kinds of them are real work. Read memory (copy into registers). Write memory (copy from registers). Bit manipulation (operate on a single register). Arithmetic and logical comparisons (operate on multiple registers). An example of bit manipulation is to toggle every switch (the bits), which is to say, turn off the ones that are on, and turn on the ones that are off. Shifting bits is also very common. Arithmetic is obvious: add, subtract, multiply and divide. It’s all done with wires and junctions called gates. Gates have one or more wires in, and one wire out, and all they do is read the voltage on the incoming wire(s) and either put a voltage on the outgoing wire or not. But when you combine a few million gates and wires, you get a pretty fancy CPU. But it all comes down to gates.

I’ve left out one important operation: there’s an instruction which tells the CPU where to get its next instruction. The default is usually just the next one in line, but you can tell it to jump forward or back, instead.

Likewise, software all comes down to simple operations and simple structures. By combining a number—a few, hundreds, thousands or millions—of these operations, you can make a program that either follows standard rules, and is understandable, or is completely anarchistic and incomprehensible. There is an entire science to trying to classify programs based on their comprehensibility, including their predictability. My favourite programs are not predictable, because they would be no fun. But they are not unpredictable because of bad software design, but because they are interactive (as opposed to non-interactive, or batch), and give users lots of freedom to put lots of unpredictable data into them. However, programs like that are, in fact (if designed well), usually made up of many smaller programs working together, each of which should be very predictable.

As instructions can be organized according to sensible rules in order to keep them easy to understand, so too can raw data be organized to the same effect. Or not. In a computer science course devoted to data structures, students encounter many different kinds of structure: stack, queue, dequeue, hash, tree, graph, and others. There is only one basic kind of data structure, however, and if you think about memory banks, you will realize it’s not for any kind of obscure, magic reason. It is the list. Every other data structure is just a means of combining lists and lists of lists.

CPUs read instructions one by one, in order, and such a sequence of instructions is basically a list. And just as the CPU can be instructed to deviate from simple list-processing behaviour and jump around in memory to different locations to find its next instruction to follow, so, too, can data be re-ordered by telling the CPU to jump around while looking for the next byte of data to process. All of this jumping around is done with something that is called a “pointer” in those languages that have them (many do not, at least, not explicitly), but a pointer is simply an address, or, alternately, an offset, which is a pompous computer engineering term for a number of steps forward or backward through memory. It’s kind of like in a board game when you get sent forward or backward some number of squares. Your game piece is the pointer, although in the CPU, it is remembered as an address in a dedicated register. (If the board game had every square numbered, it would be almost exactly like computer memory.)

A list is just what it sounds like: an number of items, one after another, contiguously arranged in memory. Sometimes they are called an array. There are two kinds: fixed length and arbitrary length. A fixed length list has its length stored somewhere so that the CPU can determine how long it is. An arbitrary length list has either a prefix which contains its length or, more commonly, has a special marker at the end to signify the end. Unstructured lists can’t provide much flexibility, which is why we need to be able to mix it up by creating lists of lists. That one simple addition to the basic list, made possible by pointers, is the key to every kind of different data, including the most familiar kind, the file or document. Actually, there is one other tweak to the basic list which is also necessary. Essentially it, too, is just a list of lists, but it isn’t thought of as such by programmers any more. It’s called a “record”, as in a series of data values of different kinds, as most commonly found in a database, or written on an index card.

(What makes a record a list of lists is that, unlike a pure list, which uniform and doesn’t require any special knowledge of its structure, a record needs a master record template, or specification. The template itself is just a list, but it describes the elements in the record, usually including a name and a type, and possibly a length, in case the element is a fixed-length list. The fact is that different types of data take up different amounts of space in memory, and so are themselves like little lists of bytes, but they are not only fixed-length but pre-determined length, and the CPU always knows how long they are in advance, as part of its absolute knowledge of its universe.)

If you then combine lists and records, you can make arbitrarily complicated structures of data, and thus end up with colour pictures, sounds, word processing documents, databases and on and on and on.

Comments are closed.