Welcome to Linux Forums! With a comprehensive Linux Forum, information on various types of Linux software and many Linux Reviews articles, we have all the knowledge you need a click away, or accessible via our knowledgeable members.
Find the answer to your Linux question:
Site Navigation
Linux Forums
Linux Articles
Product Showcase
Linux Downloads
Linux Hosting
Free Magazines
Job Board
IRC Chat
RSS Feeds
Free Publications


This is the first in a series of articles, which explains those bits of Unix you need to know, in order to successfully write code which extracts data from large databases. Here, we cover the Unix kernel. Subsequently, we will explain multithreading, dynamic memory allocation, embedded SQL, pipes. sockets etc.

Introduction to a series of articles about data extraction from large databases

Introduction

 

If we need to manipulate data in databases where several hundred tables each contain several million rows, and we need to finish the task before the office opens in the morning, then well need to do more than write optimised SQL.

Its extremely frustrating, if we watch our software take several hours to grind from start to finish, while the performance meter shows that the machine is 60% idle,.

Its even more frustrating, when we see our carefully optimised, multi-threaded program sitting stalwartly on one CPU, leaving the other eleven CPUs doing virtually nothing.

Clearly, what is needed, is a way of accessing the unused capacity of the machine, and harnessing it, to focus on one application. The way we do this, is to take advantage of the fact that Unix is a true multi-tasking operating system. As such, it has a sophisticated process control algorithm, which can be persuaded to devote as much of the machines CPU resources as are free at any given instant, to a given application.

The word application is chosen carefully, to convey the meaning of an operation, or set of operations, which we, the users, want the machine to complete. This is to distinguish it from any computer operation such as a process, task or job, since our application will probably be spread across several of these latter. A simple example of such an application, might be that of extracting data from a few database tables, performing some operations on the data, and putting the results into a different database.

The software architecture, used to achieve this objective, we call a multi-dimensional architecture, to distinguish it from simple programs, which execute all of their code within one process, and on one CPU. These latter programs can be described as possessing a one-dimensional architecture.

What Is Multi-Dimensional Programming?

Fundamentally, multi-dimensional programming is a method of achieving parallel processing, but without the portability problems associated with parallelising compilers, or the need to use a parallel database engine. A multi-dimensional program can be compiled on a single-CPU machine, running a standard Oracle database engine, and the same executable can be transferred to a twelve-CPU machine, where it will automatically yield improved performance, because it is truly hardware-scaleable.

A well-known commercial example of multi-dimensional architecture, is that employed in the design of the Teradata database engine.

Designing an application program to the multi-dimensional paradigm does require a fundamentally different top-level approach, since we have to superimpose the framework of the architecture onto the main body of application code. However, it does not require any fundamental changes in design methodology. This means there are no major upheavals in the way a project is planned or executed.

The architectural framework, which we need to design, is code which permits our application program to interact fairly intimately with the operating system.  In order to achieve this, we will require an in-depth understanding of:

  • The inner workings of the Unix kernel, with particular attention to the process scheduling algorithm.
  • Kernel system calls associated with the creation and manipulation of processes.
  • Most of the Unix inter-process communication methods.
  • The principles of applying multi-threading techniques, using pthreads.
  • The use of hash tables and binary searches for fast access to data.

Additional code which we will need to write, in order to support the above operations, will require, at least, a nodding acquaintance with some of the following:

  • Performance-optimal methods of dynamic memory allocation and reallocation.
  • Preferred methods of writing embedded SQL, with a view to maximising the amount of CPU time, given to the application by the Unix kernel
  • The design of interrupt handlers
  • The design of process schedulers
  • The mapping of files to memory, for faster repeated access.

 

All of the above topics are covered in this article, with full code examples, most of which can be re-used, perhaps, as modules in a library, with little or no modification.

Although this article aims to be as database-neutral, machine-neutral and language-neutral as possible, the example code is, quite obviously, Oracle Pro*C, written to run on a multi-processor Sun Solaris machine. However, there is very little either Oracle-specific or Solaris-specific code included and, where this is unavoidable, it is mentioned as such.

Most of the major database vendors support the ANSI standard embedded SQL format, with minor syntactic variations in the way cursors are defined, and connections to the database engine are made and broken. With very little porting effort, most of the example code will work on DB2, Informix and, possibly, others.

For devotees of Pro*C++, we offer the following advice. The Unix operating system is written in C, which is why the kernel system call interface is a C interface. If you can afford the five to nine times performance hit, resulting from wrapping all the system code described here in a C++ layer, then that is your choice, but its a bit like tuning your car for speed, then making it pull a trailer.

Who Should Not Read This Article?

It may be evident, from the above, that this is not an article for those new to either programming, or to the Unix operating system. The design of software architecture occurs at the level of abstraction just above coding, so we have to assume a fairly expert knowledge of, at least, the C programming language. Also, although we will devote a little time to the construction of embedded SQL programs, we will assume that the reader is proficient in the design of efficient SQL queries. Clearly, this implies a high level of familiarity with the workings of a relational database, such as Oracle, DB2, Sybase or Informix.

 

Many articles on programming techniques are centred around an example application, which is usually too specific, or too simplistic to be of any use in the real world, and which the reader is invited to code and compile, as part of the learning process. We have departed from this model slightly, in so far as we will define a generic, database data extraction application, and derive a series of code modules, which can be reconfigured to suit individual variations.

We will look at two basic architectures, which feature fairly heavily in the information technology field.

  • The message hub

This architecture receives data messages, as files, TCP/IP streams, or any other format. It performs some kind of processing on the data, then passes it on to the next destination not necessarily a database.

  • The data manipulation hub

The functionality embodied in this architecture,  is a super-set of that in the message hub. This process extracts data from a database, performs some kind of intermediate processing, then puts the data back into either the same, or a different database.

This is the application on which we will concentrate our coding efforts.

Since either of the above may be implemented as a daemon or background process, which wakes up at a particular time to do its processing, we will also examine the design of a simple, efficient scheduling mechanism.

The Generic Application

It is always easier to relate to an application, if it happens to coincide with ones own line of work. In view of this, we make the following suggestions, as to what our generic application actually does, and invite readers to select that which is most suitable.

  1. A bank database, comprising customer data, account details, details of financial services, transactions etc
  2. A major Telco database, comprising customer data, account details, details of telecommunication services etc
  3. A retail chain database, comprising customer data, account details, details of stock, stores, prices etc
  4. Any other field, where databases of over a terabyte need to be interrogated, or supplied with data.

Part 1. The Unix Kernel

The Unix Kernel

It is usually a source of wonderment to PC users, that the whole of the Unix operating system is in one executable. Instead of a hodge-podge of dlls, drivers, and various occasionally-cooperating executables, everything is done by the Unix kernel.

When Unix was first introduced, it was usually described as having a shell, or user interface, which surrounded a kernel which interpreted the commands passed to it from the shell.

With the passage of time, and the advent of graphical window systems on Apollo and Sun computers, in 1983, this model became a bit strained at the edges. However, it still provides a useful mental image of the system, and window systems can be thought of as a candy coat around the shell. In fact, it isnt just X-windows, which has a direct path to the kernel, since TCP/IP also falls into this category/

.

The following is not intended to be an exhaustive treatise on the inner workings of the Unix kernel, nor is it specific to any particular brand of Unix. However, it is essential to broadly understand certain important functions of the kernel, before we can take advantage of some of its features, to improve the way in which it handles our process.

The Unix kernel possesses the following functionality, much of which is of interest to us, in pursuit of this goal:

 

  • System calls.

All of the most basic operating system commands are performed directly by the kernel. These include:

  • open()
  • close()
  • dup()
  • read()
  • write()
  • fcntl()
  • ioctl()
  • fork()
  • exec()
  • kill()

Since the above commands are actually executed by the kernel, the C compiler doesnt need to generate any actual machine code to perform the function. It merely places a hook in the executable, which instructs the kernel where to find the function.

Modern, third party compilers, however, are ported to a variety of operating systems, and will generate machine code for a dummy function, which itself contains the hook.

  • Process scheduling and control

The kernel determines which processes will run, when and for how long. We will examine this mechanism, in detail, later.

  • Networking

That, which became networking, was originally designed so that processes could communicate. It is this ability, to pass information quickly and efficiently from one running process to another, which makes  the Unix operating system uniquely capable of multi-dimensional operation. All of the most important communication commands are system calls.

  • socket()
  • connect()
  • bind()
  • listen()
  • accept()
  • send()
  • recv()
  • Device drivers for all supported hardware devices.

Unlike other operating systems, where device drivers are separate programs, which are individually loaded into memory, the Unix kernel inherently contains all of the machines drivers. Contrary to what may be supposed, the entries in the /dev directory are not drivers, they are just access points into the appropriate kernel routine. We will not concern ourselves unduly with the Unix device drivers, as they are outside the scope of this book.

 

Anatomy of a process

  • Single-threaded:

 

 

Data segment

 

 

 

Text segment

 

Stack

 

 

 

 

 

 

 

 

  • Multi-threaded:

 

Data segment

 

 

 

Text segment

Thread 1

Stack

Thread 2

Stack

Thread 3

Stack

 

When an executable is invoked, the following events occur, although not necessarily in this order.

Process Loading

  1. The loader
  • Fetches the executable file from the disk
  • Allocates memory for all of the global variables and data structures, (the data segment) and loads the variables into that area of memory.
  • Loads the machine code of the executable itself (the text segment) into memory. With demand-paged executables, this is not strictly the case, as the amount of code actually loaded into memory is several 4k pages. The remainder is put into the swap area of the disk.
  • Searches the header portion of the executable, for any dynamically-linked libraries or modules.
  • Checks to see if these are already loaded and, If not, the modules are loaded into memory, otherwise, their base addresses are noted.
  • Makes available the base addresses of dynamically-linked modules/libraries to the process.
  • Allocates an area of memory for the stack. If the process is multi-threaded, a separate stack area is allocated for each thread.
  1. The kernel
  • Sets the program counter to the first byte of executable code.
  • Allocates a slot in the process table to the new process
  • Allocates a process ID to the new process
  • Allocates table space for any file descriptors
  • Allocates table space for any interrupts.
  • Sets the ready to run flag of the process.

All of the above resources, allocated to a given process, constitute the context of a process. Each time the kernel activates a new process, it performs a context switch, by replacing the resources of the previously running process with those of the current one.

At this point, the scheduling algorithm takes over.

The Process Scheduling algorithm

While the process is running, it runs in one of two modes:

  1. Kernel mode
  2. All system calls are executed by the kernel, and not by machine code within the process. Kernel mode has one very desirable characteristic, and that is the fact that system calls are atomic and, hence, cannot be interrupted. One of the most important factors in writing code for high-performance applications, is to ensure that your process executes as much in kernel mode as possible. This way, you can guarantee the maximum CPU time for a given operation.

    Kernel threads, such as pthreads, run in kernel mode. It is sometimes worth using a thread, even when doing so doesnt constitute parallel operation, purely to get the advantage of running in kernel mode.

    If an interrupt occurs, while in this mode, the kernel will log the signal in the interrupt table, and examine it after the execution of the current system call. Only then will the signal be actioned.

  3. User mode

The ordinary machine code, which makes up much of the executable, runs in user mode. There are no special privileges associated with user mode, and interrupts are handled as they arrive.

It may be seen that, during the time that a process runs, it is constantly switching between kernel mode and user mode. Since the mode switch occurs within the same process context, it is not a computational burden.

A Unix process has one of the following states:

  • Sleep
  • Run
  • Ready to Run
  • Terminated

The scheduling algorithm is a finite-state machine, which moves the status of the process between states, depending on certain conditions.

Basically, what happens is this.

A process begins to execute. It runs until it needs to perform I/O, then, having initiated the I/O, puts itself to sleep. At this point, the kernel examines the next process table slot and, if the process is ready to run, it enables its execution.

If a process never performs I/O, such as processes which perform long series of floating point calculations, the kernel permits it to only run for a fixed time period, of between 20 and 50 milliseconds, before pre-empting it, and enabling the next eligible process.

When the time comes for a process to run, an additional algorithm determines the priority of one process over another. The system clock sends the kernel an interrupt, once per second, and it is at this time, that the kernel calculates the priorities of each process. Leaving aside the user-level priority weighting, determined by nice the kernel determines priority based on several parameters, of which these are significant::

  • How much CPU time the process has previously used
  • Whether it is waking up from an I/O wait or not
  • Whether it is changing from kernel mode to user mode or not

 

Swapping and Paging

Of course, the execution of a process is never that straightforward. Only a portion of the code is loaded into memory, meaning that it can only run until another page needs to be fetched from the disk. When this occurs, the process generates a page fault which causes the pager to go and fetch the appropriate page. A similar situation occurs when a branch instruction is executed, which takes the execution point to a page other than those stored in memory. The paging mechanism is fairly intelligent, and contains algorithms similar to those found in CPU machine instruction pipeline controllers.  It tries to anticipate branch instructions, and pre-fetch the anticipated page, more or less successfully, depending on the code structure.

Although there are certain similarities, paging, as described above, which is a natural result of process execution, should not be confused with swapping.

If the number of processes grows, to the extent that all available memory becomes used up, the addition of another process will trigger the swapper, and cause it to take a complete process out of memory, and place it in the swap area of the disk.

This is, computationally, an extremely expensive operation, as the entire process, together with its context, has to be written to disk then, when it is permitted once again to run, another process must be swapped out, to make space for it to be reloaded into memory.

So, what does all of this have to do with Performance?

It may be seen from the above, that processes, which are designed for performance-critical applications, should avoid doing physical I/O until it is absolutely necessary, in order to maximise the amount of contiguous CPU time. If it is at all possible, all of the I/O operations should be saved up, until all other processing has completed, and then performed in one operation, preferably, by a separate thread.

As far as threads are concerned, let us consider what happens, when we launch a number of threads, to perform some tasks in parallel.

First, the threads are each allocated a separate stack, but they are not allocated a separate process table slot. This means that, although there are several tasks executing in parallel, this only occurs during the active time of that slot. When the kernel pre-empts the process, execution stops. Multi-threading will not give your application any more system resources.

Further, if we consider a situation, where we have 100 processes running on a machine, and one of them is ours, then we would expect to use 1% of the CPU time. However, if 25 of them are ours, we would be eligible to use 25% of the CPU time.

Thus, If an application can split itself into several processes, running concurrently, then, quite apart from the obvious advantages of parallelism, we will capture more of the machines resources, just because each child process occupies a separate process table slot.

This also helps, when the kernel assigns priorities to processes. Even though we may be penalised for using a lot of CPU time, the priority of each process is rated against that of other processes. If many of these belong to one application then even though the kernel may decide to give one process priority over another, the application, as a whole, will still get more CPU time.

Additionally, if we are running on a multi-processor machine, then we can almost guarantee to be given a separate CPU for each child process. The kernel may juggle these processes over different CPUs, as a part of its load-balancing operations, but each child will still have its own processor.

The incorporation of the above techniques, into our software architecture forms the cornerstone of multi-dimensional programming.

In Summary

  • Each child process gets a CPU different to that used by the parent.
  • The more processes contribute to the running of your application, the more CPU time it will get.
  • Multi-threading creates multiple execution paths within one process table slot. It may permit parallel execution paths, but it will not get the application more CPU time, or a new CPU. 

Therefore:

  • Find parallelism within your application. This will make your software run more efficiently, anyway.
  • Employ multi-threading where it is not possible to fork a separate process, or where you need to refer to global information, as in the parent process.
  • Having decided how the children will communicate the data back to the parent, launch a separate child process for every possible parallel function,to gain the maximum CPU time

 

 

System calls

  1. fork()
  2. Under pre-Unix operating systems, starting a process from within another process was traditionally performed as a single operation. One command magically placed the executable into memory, and handed over control and ownership to the operating system, which made the new process run.

    Unix doesnt do that.

    Each process has a hierarchical relationship, with its parent, which is the process which brought it to life, and with its child or children which, in turn, are processes which it, itself, started. All such related processes are part of a process group.

    The basic mechanism, which initiates the birth of a new process is fork().

    The fork() system call makes a running copy of the process which called it. All memory addresses are re-mapped, and all open file descriptors remain open. Also, file pointers maintain the same file position in the child as they do in the parent.

    Consider the following code fragment:

     

    pid_t pid;

        switch((pid = fork()){

            case 1:

                printf("fork failedn");

            break;

            case 0:

                printf("Child process runningn");

                some_child_function();

            break;

            default:

                printf("Parent process executes this coden");

            break;

        }

    At the time that the fork() system call is called, there is only one process in existence, that of the expectant parent. The local variable pid is on the stack, probably uninitialised.

    The system call is executed and, now, there are two identical running processes, both executing the same code. The parent, and the new child process both simultaneously check the variable pid, on the stack. The child finds that the value is zero, and knows, from this, that it is the child. It then executes some_child_function(), and continues on a separate execution path.

    The parent does not see zero, so it executes the default part of the switch() statement. It sees the process ID of the new child process, and drops through the bottom of the switch(). Note, that, if we do not call a different function in the case 0: section of the switch, that both parent and child will continue to execute the same code, since the child will also drop through the bottom of the switch().

    Programmers who know little about Unix, will have a piece of folklore rattling around in their heads, which says a fork() is expensive. You have to copy an entire process in memory, which is slow, if the process is large.

    This is true, as far as it goes. There is a memory-to-memory copy of that part of the parent, which is resident in memory, so you may have to wait a few milliseconds.

    However, we are not concerned with trivial processes whose total run time is affected by those few milliseconds. We are dealing exclusively with processes whose run times are measured in hours, so we consider a one-time penalty of a few milliseconds to be insignificant.

    When a parent forks a child process, on a multi-processor machine, the Unix kernel places the child process onto its own, separate CPU. If the parent forks twelve children, on a twelve CPU machine, each child will run on one of the twelve CPUs.

    In an attempt to perform load-balancing, the kernel will shuffle the processes around the CPUs but, basically, they will remain on separate processors.

    The fork() system call is one of the most useful tools, for the  full utilisation of a multi-processor machines resources, and it should be used whenever one or more functions are called, which can proceed their tasks in parallel.  Not only is the total run time reduced to that of the longest-running function, but each function will execute on its own CPU.

  3. vfork()
  4. There is a BSD variant of fork(), which was designed to reduce the memory usage overhead associated with copying, possibly, a huge process in memory.

    The semantics of vfork() are exactly the same as those of fork(), but the operation is slightly different. Vfork() only copies the page, of the calling process. which is currently in memory but, due to a bug (or feature) permits the two processes to share the same stack. As a result, if the child makes any changes to variables local to the function which called vfork(), the changes will be visible to the parent.

    Knowledge of this fact has enabled experienced programmers to make use of the advantages of vfork(), while avoiding the pitfalls. However, far more subtle bugs also exist, and most Unix vendors recommend that vfork() only be used, if it is immediately followed by an exec().

  5. exec()

The original thinking behind fork(), was that its primary use would be to create new processes not just copies of the parent process. The exec() system call achieves this, by overlaying the memory image of the calling process with the new process.

There is a very good reason for separating fork() and exec(), rather than having the equivalent of VMSs spawn() function, which combines the two. That reason is, because it is sometimes necessary, or convenient, to perform some operations in between fork() and exec(). For example, is may be necessary to run the child process as a different user, like root, or to change directory, or both.

There is, in fact, no such call as exec(), but there are two main variants, execl() and execv().

The semantics of execl() are as follows:

execl(char *path, char *arg0, char *arg1char *argn, (char *) NULL)

execv(char *path, char *arg0, char **argv)

It may be seen, that the principal difference between the two variants, is that, whereas the execl() family takes a path, followed by separate arguments, in a NULL terminated, comma-separated list, the execv() variants take a path, and a vector, similar to the argv[] vector, passed to a main() function.

The first variant of execl() and execv(), adds an environment vector to the end of the argument list:

execle(char *path, char *arg0, .char *argn, (char *) NULL, char **envp)

execve(char *path, char *arg0, char **argv,  char **envp)

The second variant replaces the path argument, with a file argument. If this latter contains a slash, it is used as a path, otherwise, the PATH environment variable of the calling process is used to find the file.

execlp(char *file, char *arg0, .char *argn, (char *) NULL, char **envp)

execvp(char *file, char *arg0, char **argv,  char **envp)

We can now combine fork() and exec(), to execute lpr from the parent process, In order to print a file:

    pid_t pid;

    switch((pid = fork()){

        case 1:

            printf("fork failedn");

        break;

        case 0:

            printf("Child process runningn");

            execl("/usr/ucb/lpr", "lpr", "/tmp/file", (char *) NULL);

        break;

        default:

            printf("Parent process has executed lpr to print a filen");

        break;

    }

The above code only has one problem. If the parent process quits, the child process will become an orphan, and be adopted by the init process. When lpr has run to completion, it will become a zombie process, and waste a slot in the process table. The same happens if the child prematurely exits, due to some fault.

There are two solutions to this problem:

  1. We execute one of the wait() family of system calls.
  2. A waited-for child does not become a zombie, but the parent must suspend processing, until the child terminates, which may or may not be a disadvantage. There are options, which allow processing to continue, during the wait, but the parent needs to poll waitpid(), which makes our second solution, described below, a much better option.

    If we are waiting for a specific process, the most convenient call is to waitpid(). The synopsis of this call is:

    pid_t waitpid(pid_t pid, int *status, int options);

    The call to waitpid() returns the process ID of the child for which we are waiting, whose process ID is passed in as the first argument, pid.

    The second argument, status is the returned child process exit status, and options is the bitwise-OR of the following flags:

    WNOHANG : prevents waitpid() from causing the parent process to hang, if there is no immediate return.

    WNOWAIT : keeps the process, whose status is returned, in a waitable state, so that it may be waited for again.

    The options flags are of no use to us, so we set them to zero. The status word, however, provides useful information on how our child terminated, and can be decoded with the macros, described in the man page for wstat.

    pid_t pid;

    int status;

        switch((pid = fork()){

            case 1:

                printf("fork failedn");

            break;

            case 0:

                printf("Child process runningn");

                execl("/usr/ucb/lpr", "lpr", "/tmp/file", (char *) NULL);

            break;

            default:

                printf("Parent process has executed lpr to print a filen");

    if(waitpid(pid, &status, 0) == pid){

          printf("lpr has now finishedn");

    }

            break;

        }

     

  3. If we dont wish to poll waitpid() repeatedly, but need to do other processing, while the child process goes about its business, then we need to, effectively, disown the child process.

As soon as the child has successfully forked, we must disassociate it from the process group of the parent. This is accomplished by executing the system call setpgrp(), or setsid(), both of which have the same functionality. These calls create a new process session group, make the child process the session leader, and set the process group ID to the process ID of the child.

The complete code is as below:

pid_t pid;

    switch((pid = fork()){

        case 1:

            printf("fork failedn");

        break;

        case 0:

if(setpgrp() == -1){

                printf("Can't set pgrpn");

            }

printf("Independent child process runningn");

            execl("/usr/ucb/lpr", "lpr", "/tmp/file", (char *) NULL);

        break;

        default:

            printf("Parent process has executed lpr to print a filen");

        break;

    }

 

 

  1. open() close() dup() read() write()

These system calls are primarily concerned with files but, since Unix treats almost everything as a file, most of them can be used on any byte-orientated device, including sockets and pipes.

  • int open(char *file, int how, int mode);

open() returns a file descriptor to a file, which it opens for reading, writing or both. The file argument is the file name, with or without a path, while how is the bitwise-OR of some of the following flags defined in fcntl.h:

O_RDONLY Read only

O_WRONLY Write only

O_RDWR Read/write

O_TRUNC Truncate on opening

O_CREAT Create if non-existent

The mode argument is optional, and defines the permissions on the file, using the same flags as chmod.

  • Int close(int fd);

Closes the file, which was originally opened with the file descriptor fd.

  • Int dup(int fd);

Returns a file descriptor, which is identical to that passed in as an argument, but with a different number. This call seems fairly useless, at first glance but, in fact, it permits some powerful operations, like bi-directional pipes, where we need a pipe descriptor to become a standard input or output. Also, client-server systems, need to listen for incoming connections on a fixed socket descriptor, while handling existing connections on different descriptors.

 






Rate This Article: poorexcellent
 
Comments about this article

Comment title: * please do not put your response text here