Categoria: Linux


This is an excelent (!!!) article describing in general terms how the process of linking (static and dynamic) and loading elf programs on linux works. This is a very valuable article.

The original is found here: http://www.linuxjournal.com/article/6463?page=0,0
Discussing how compilers, links and loaders work and the benefits of shared libraries.
Linking is the process of combining various pieces of code and data together to form a single executable that can be loaded in memory. Linking can be done at compile time, at load time (by loaders) and also at run time (by application programs). The process of linking dates back to late 1940s, when it was done manually. Now, we have linkers that support complex features, such as dynamically linked shared libraries. This article is a succinct discussion of all aspects of linking, ranging from relocation and symbol resolution to supporting position-independent shared libraries. To keep things simple and understandable, I target all my discussions to ELF (executable and linking format) executables on the x86 architecture (Linux) and use the GNU compiler (GCC) and linker (ld). However, the basic concepts of linking remain the same, regardless of the operating system, processor architecture or object file format being used.

Compiler, Linker and Loader in Action: the Basics

Consider two program files, a.c and b.c. As we invoke the GCC on a.c b.c at the shell prompt, the following actions take place:

gcc a.c b.c
  • Run preprocessor on a.c and store the result in intermediate preprocessed file.
    cpp other-command-line options a.c /tmp/a.i
  • Run compiler proper on a.i and generate the assembler code in a.s
    cc1 other-command-line options /tmp/a.i  -o /tmp/a.s
  • Run assembler on a.s and generate the object file a.o
    as other-command-line options /tmp/a.s  -o /tmp/a.o

cpp, cc1 and as are the GNU’s preprocessor, compiler proper and assembler, respectively. They are a part of the standard GCC distribution.

Repeat the above steps for file b.c. Now we have another object file, b.o. The linker’s job is to take these input object files (a.o and b.o) and generate the final executable:

   ld other-command-line-options /tmp/a.o /tmp/b.o -o a.out

The final executable (a.out) then is ready to be loaded. To run the executable, we type its name at the shell prompt:

./a.out

The shell invokes the loader function, which copies the code and data in the executable file a.out into memory, and then transfers control to the beginning of the program. The loader is a program called execve, which loads the code and data of the executable object file into memory and then runs the program by jumping to the first instruction.

a.out was first coined as the Assembler OUTput in a.out object files. Since then, object formats have changed variedly, but the name continues to be used.

Linkers vs. Loaders

Linkers and loaders perform various related but conceptually different tasks:

  • Program Loading. This refers to copying a program image from hard disk to the main memory in order to put the program in a ready-to-run state. In some cases, program loading also might involve allocating storage space or mapping virtual addresses to disk pages.
  • Relocation. Compilers and assemblers generate the object code for each input module with a starting address of zero. Relocation is the process of assigning load addresses to different parts of the program by merging all sections of the same type into one section. The code and data section also are adjusted so they point to the correct runtime addresses.
  • Symbol Resolution. A program is made up of multiple subprograms; reference of one subprogram to another is made through symbols. A linker’s job is to resolve the reference by noting the symbol’s location and patching the caller’s object code.

So a considerable overlap exists between the functions of linkers and loaders. One way to think of them is: the loader does the program loading; the linker does the symbol resolution; and either of them can do the relocation.

Object Files

Object files comes in three forms:

  • Relocatable object file, which contains binary code and data in a form that can be combined with other relocatable object files at compile time to create an executable object file.
  • Executable object file, which contains binary code and data in a form that can be directly loaded into memory and executed.
  • Shared object file, which is a special type of relocatable object file that can be loaded into memory and linked dynamically, either at load time or at run time.

Compilers and assemblers generate relocatable object files (also shared object files). Linkers combine these object files together to generate executable object files.

Object files vary from system to system. The first UNIX system used the a.out format. Early versions of System V used the COFF (common object file format). Windows NT uses a variant of COFF called PE (portable executable) format; IBM uses its own IBM 360 format. Modern UNIX systems, such as Linux and Solaris use the UNIX ELF (executable and linking format). This article concentrates mainly on ELF.

ELF Header
.text
.rodata
.data
.bss
.symtab
.rel.text
.rel.data
.debug
.line
.strtab

The above figure shows the format of a typical ELF relocatable object file. The ELF header starts with a 4-byte magic string, \177ELF. The various sections in the ELF relocatable object file are:

  • .text, the machine code of the compiled program.
  • .rodata, read-only data, such as the format strings in printf statements.
  • .data, initialized global variables.
  • .bss, uninitialized global variables. BSS stands for block storage start, and this section actually occupies no space in the object file; it is merely a placer holder.
  • .symtab, a symbol table with information about functions and global variables defined and referenced in the program. This table does not contain any entries for local variables; those are maintained on the stack.
  • .rel.text, a list of locations in the .text section that need to be modified when the linker combines this object file with other object files.
  • .rel.data, relocation information for global variables referenced but not defined in the current module.
  • .debug, a debugging symbol table with entries for local and global variables. This section is present only if the compiler is invoked with a -g option.
  • .line, a mapping between line numbers in the original C source program and machine code instructions in the .text section. This information is required by debugger programs.
  • .strtab, a string table for the symbol tables in the .symtab and .debug sections.


Symbols and Symbol Resolution

Every relocatable object file has a symbol table and associated symbols. In the context of a linker, the following kinds of symbols are present:

  • Global symbols defined by the module and referenced by other modules. All non-static functions and global variables fall in this category.
  • Global symbols referenced by the input module but defined elsewhere. All functions and variables with extern declaration fall in this category.
  • Local symbols defined and referenced exclusively by the input module. All static functions and static variables fall here.

The linker resolves symbol references by associating each reference with exactly one symbol definition from the symbol tables of its input relocatable object files. Resolution of local symbols to a module is straightforward, as a module cannot have multiple definitions of local symbols. Resolving references to global symbols is trickier, however. At compile time, the compiler exports each global symbol as either strong or weak. Functions and initialized global variables get strong weight, while global uninitialized variables are weak. Now, the linker resolves the symbols using the following rules:

  1. Multiple strong symbols are not allowed.
  2. Given a single strong symbol and multiple weak symbols, choose the strong symbol.
  3. Given multiple weak symbols, choose any of the weak symbols.

For example, linking the following two programs produces linktime errors:

/* foo.c */               /* bar.c */
int foo () {               int foo () {
   return 0;                  return 1;
}                          }
                           int main () {
                              foo ();
                           }

The linker will generate an error message because foo (strong symbol as its global function) is defined twice.

gcc foo.c bar.c
/tmp/ccM1DKre.o: In function 'foo':
/tmp/ccM1DKre.o(.text+0x0): multiple definition of 'foo'
/tmp/ccIhvEMn.o(.text+0x0): first defined here
collect2: ld returned 1 exit status

Collect2 is a wrapper over linker ld that is called by GCC.

Linking with Static Libraries

A static library is a collection of concatenated object files of similar type. These libraries are stored on disk in an archive. An archive also contains some directory information that makes it faster to search for something. Each ELF archive starts with the magic eight character string !<arch>\n, where \n is a newline.

Static libraries are passed as arguments to compiler tools (linker), which copy only the object modules referenced by the program. On UNIX systems, libc.a contains all the C library functions, including printf and fopen, that are used by most of the programs.

gcc foo.o bar.o /usr/lib/libc.a /usr/lib/libm.a

libm.a is the standard math library on UNIX systems that contains the object modules for math functions such as like sqrt, sin, cos and so on.

During the process of symbol resolution using static libraries, linker scans the relocatable object files and archives from left to right as input on the command line. During this scan, linker maintains a set of O, relocatable object files that go into the executable; a set U, unresolved symbols; and a set of D, symbols defined in previous input modules. Initially, all three sets are empty.

  • For each input argument on the command line, linker determines if input is an object file or an archive. If input is a relocatable object file, linker adds it to set O, updates U and D and proceeds to next input file.
  • If input is an archive, it scans through the list of member modules that constitute the archive to match any unresolved symbols present in U. If some archive member defines any unresolved symbol that archive member is added to the list O, and U and D are updated per symbols found in the archive member. This process is iterated for all member object files.
  • After all the input arguments are processed through the above two steps, if U is found to be not empty, linker prints an error report and terminates. Otherwise, it merges and relocates the object files in O to build the output executable file.

This also explains why static libraries are placed at the end of the linker command. Special care must be taken in cases of cyclic dependencies between libraries. Input libraries must be ordered so each symbol is referenced by a member of an archive and at least one definition of a symbol is followed by a reference to it on the command line. Also, if an unresolved symbol is defined in more than one static library modules, the definition is picked from the first library found in the command line.


Relocation

Once the linker has resolved all the symbols, each symbol reference has exactly one definition. At this point, linker starts the process of relocation, which involves the following two steps:

  • Relocating sections and symbol definitions. Linker merges all the sections of the same type into a new single section. For example, linker merges all the .data sections of all the input relocatable object files into a single .data section for the final executable. A similar process is carried out for the .code section. The linker then assigns runtime memory addresses to new aggregate sections, to each section defined by the input module and also to each symbol. After the completion of this step, every instruction and global variable in the program has a unique loadtime address.
  • Relocating symbol reference within sections. In this step, linker modifies every symbol reference in the code and data sections so they point to the correct loadtime addresses.

Whenever assembler encounters an unresolved symbol, it generates a relocation entry for that object and places it in the .relo.text/.relo.data sections. A relocation entry contains information about how to resolve the reference. A typical ELF relocation entry contains the following members:

  • Offset, a section offset of the reference that needs to be relocated. For a relocatable file, this value is the byte offset from the beginning of the section to the storage unit affected by relocation.
  • Symbol, a symbol the modified reference should point to. It is the symbol table index with respect to which the relocation must be made.
  • Type, the relocation type, normally R_386_PC32, that signifies PC-relative addressing. R_386_32 signifies absolute addressing.

The linker iterates over all the relocation entries present in the relocatable object modules and relocates the unresolved symbols depending on the type. For R_386_PC32, the relocating address is calculated as S + A – P; for R_386_32 type, the address is calculated as S + A. In these calculations, S denotes the value of the symbol from the relocation entry, P denotes the section offset or address of the storage unit being relocated (computed using the value of offset from relocation entry) and A is the address needed to compute the value of the relocatable field.

Dynamic Linking: Shared Libraries

Static libraries above have some significant disadvantages; for example, consider standard functions such as printf and scanf. They are used almost by every application. Now, if a system is running 50-100 processes, each process has its own copy of executable code for printf and scanf. This takes up significant space in the memory. Shared libraries, on the other hand, address the disadvantages of static libraries. A shared library is an object module that can be loaded at run time at an arbitrary memory address, and it can be linked to by a program in memory. Shared libraries often are called as shared objects. On most UNIX systems they are denoted with a .so suffix; HP-UX uses a .sl suffix and Microsoft refer to them as DLLs (dynamic link libraries).

To build a shared object, the compiler driver is invoked with a special option:

gcc -shared -fPIC -o libfoo.so a.o b.o

The above command tells the compiler driver to generate a shared library, libfoo.so, comprised of the object modules a.o and b.o. The -fPIC option tells the compiler to generate position independent code (PIC).

Now, suppose the main object module is bar.o, which has dependencies on a.o and b.o. In this case, the linker is invoked with:

gcc bar.o ./libfoo.so

This command creates an executable file, a.out, in a form that can be linked to libfoo.so at load time. Here a.out does not contain the object modules a.o and b.o, which would have been included had we created a static library instead of a shared library. The executable simply contains some relocation and symbol table information that allow references to code and data in libfoo.so to be resolved at run time. Thus, a.out here is a partially executable file that still has its dependency in libfoo.so. The executable also contains a .interp section that contains the name of the dynamic linker, which itself is a shared object on Linux systems (ld-linux.so). So, when the executable is loaded into memory, the loader passes control to the dynamic linker. The dynamic linker contains some start-up code that maps the shared libraries to the program’s address space. It then does the following:

  • relocates the text and data of libfoo.so into memory segment; and
  • relocates any references in a.out to symbols defined by libfoo.so.

Finally, the dynamic linker passes control to the application. From this point on, location of shared object is fixed in the memory.

Loading Shared Libraries from Applications

Shared libraries can be loaded from applications even in the middle of their executions. An application can request a dynamic linker to load and link shared libraries, even without linking those shared libraries to the executable. Linux, Solaris and other systems provides a series of function calls that can be used to dynamically load a shared object. Linux provides system calls, such as dlopen, dlsym and dlclose, that can be used to load a shared object, to look up a symbol in that shared object and to close the shared object, respectively. On Windows, LoadLibrary and GetProcAddress functions replace dlopen and dlsym, respectively.

Tools for Manipulating Object Files

Here’s a list of Linux tools that can be used to explore object/executable files.

  • ar: creates static libraries.
  • objdump: this is the most important binary tool; it can be used to display all the information in an object binary file.
  • strings: list all the printable strings in a binary file.
  • nm: lists the symbols defined in the symbol table of an object file.
  • ldd: lists the shared libraries on which the object binary is dependent.
  • strip: deletes the symbol table information.
Suggested Reading

Linkers and Loaders by John Levine

Linkers and Libraries Guide from Sun

Sandeep Grover works for QuickLogic Software (India) Pvt. Ltd.

This text was found here: http://ssvb.github.com/2011/08/23/yet-another-oprofile-tutorial.html

Recently it came as a surprise to me that many people don’t know how to use oprofile efficiently when working on performance optimizations. I’m not going to duplicate the oprofile manual here in details, but at least will try to explain some basic usage.

A bit of theory

Oprofile does its magic by using statistical sampling. The processor gets interrupted at regular intervals (the interrupts happen after a certain amount of time has elapsed, or some hardware performance counter accumulated a certain amount of events) and oprofile driver identifies which code had control at that moment. The part of code which was ‘lucky’ to be interrupted by oprofile, gets an oprofile sample attributed to it. The parts of code which take a lot of execution time are naturally more likely to accumulate many oprofile samples. In fact, the amount of collected oprofile samples for some function tends to be directly proportional to the execution time taken by this function. This all is somewhat similar to Monte Carlo method.

The collection of samples done by oprofile for each individual function is a Poisson process. Standard deviation forPoisson distribution is the square root of the number of samples. So the more samples got collected, the lower is the relative error. The following diagram shows the confidence intervals for normal distribution (because Poisson distribution is approximately normal for the large number of samples):

Standard_deviation_diagram.svg from wikipedia, created by Petter Strandmark and licensed under CC BY 2.5

Using the 3-sigma rule, we can be fairly confident that the actual time spent in each function (measured in oprofile samples) is within 3*sqrt(N) interval for each function. Where N is the number of samples reported by oprofile for that function.

A simple profiling and code optimization workflow

Let’s suppose that we have some small command line tool, which can be run to do something useful. And we want to optimize this tool to spend less time to do the same work. First of all, it makes sense to identify the parts of the program, which are the performance bottlenecks and can be optimized. This can be easily done using oprofile:

# opcontrol --deinit 
# opcontrol --separate=kernel 
# opcontrol --init 
# opcontrol --reset 
# opcontrol --start 
# ./test-program 
# opcontrol --stop 
# opreport -l ./test-program 

Going through all of the above steps will configure and start oprofile, then execute the program to be profiled (./test-program), then finally stop oprofile and show the profiling report (and this report contains exactly the information we want, and its interpretation is explained a bit in the next section). The opcontrol tool needs to be run as root or via sudo. It is also quite important to use –separate=kernel option. This option is described in details here, but basically it ensures that all the CPU activity happening in the kernel and in the shared libraries is also attributed to the test program and shown in the log.

After having oprofile report, it is only a matter of checking what parts of code are reported to take a lot of time, improving them and finally running oprofile again to verify the results. This process can be repeated multiple times. That’s quite simple. Though there are two main cases when it may be difficult to interpret oprofile logs:

  1. Oprofile reports that just one large function (possibly even ‘main’) is taking most of the time.
  2. Oprofile reports a million of tiny functions, each taking only a small fraction of time.

In the former case it is a good idea to split the large function into a few smaller ones. If the large function is already calling some other functions which aree inlined, then naturally disabiling inlining will provide a bit more interesting profiling report. Another alternative is to use source annotation. But be sure to read about all the caveats in theoprofile manual. In the latter case, generating a callgraph may provide some insights. Some nice callgraph pictures can be generated by Gprof2Dot from the data collected by oprofile.

A real practical example

I’m going to use one of my old performance patches as an example. Oprofile report for the ‘sbcenc’ program looked like this before optimization:

samples % image name symbol name 
26083 25.0856 sbcenc sbc_pack_frame 
21548 20.7240 sbcenc sbc_calc_scalefactors_j 
19910 19.1486 sbcenc sbc_analyze_4b_8s_neon 
14377 13.8272 sbcenc sbc_calculate_bits 
9990 9.6080 sbcenc sbc_enc_process_input_8s_be 
8667 8.3356 no-vmlinux /no-vmlinux 
2263 2.1765 sbcenc sbc_encode 
696 0.6694 libc-2.10.1.so memcpy 

Because of the use of –separate=kernel option, we can see ~8% of cpu time attributed to no-vmlinux image, which is the time spent in the kernel mostly doing input/output activity (reading the input file from disk). Also less than 1% is spent in memcpy function which belongs to libc-2.10.1.so shared library. Without –separate=kernel option, this information would not be present in the log.

Now our focus is on sbc_calc_scalefactors_j function, which got 21548 oprofile samples collected, and they represent ~20.7% of time spent in ‘sbcenc’ process. Please note again, that this percentage would not be a realistic estimate without also having kernel and libc information in the picture. In the case if the CPU consumption is dominated by the library functions or by the kernel, the statistics could be severely skewed.

After performing the optimizations, we get a new profiling report:

samples % image name symbol name 
26234 29.9625 sbcenc sbc_pack_frame 
20057 22.9076 sbcenc sbc_analyze_4b_8s_neon 
14306 16.3393 sbcenc sbc_calculate_bits 
9866 11.2682 sbcenc sbc_enc_process_input_8s_be 
8506 9.7149 no-vmlinux /no-vmlinux 
5219 5.9608 sbcenc sbc_calc_scalefactors_j_neon 
2280 2.6040 sbcenc sbc_encode 
661 0.7549 libc-2.10.1.so memcpy 

Which shows that now we have sbc_calc_scalefactors_j_neon function taking 5219 samples instead of 21548 samples for sbc_calc_scalefactors_j earlier. It is approximately ~4.1x speedup for this particular function. Samples are more important than percents in the log, because the absolute number of samples represents the actual time spent in the function, and the percents are relative to the whole process (as the whole program takes less time to execute after optimization, the percents may naturally drift).

For another example we can look at the sbc_pack_frame function statistics in both logs. The number of samples remained about the same: 26083 vs. 26234 (see the 3-sigma rule from the ‘A bit of theory’ section). But the percentage of the time relative to the whole program increased from ~25% to ~30% even though this function has not changed itself. That’s a nice side affect of optimizations: after eliminating the obvious bottlenecks, the other functions are becoming more attractive optimization targets too :)

The precision of measurements can be always increased by running the test program more than one time between ‘opcontrol –start’ and ‘opcontrol –stop’ invocations, because more samples will get accumulated and the relative error will become smaller.

Still the other methods of benchmarking the code may be more suitable for very tiny performance tweaks, such as just saving maybe a few CPU cycles. Some tricks for benchmarking small sequences of instructions are described in my older Discovering instructions scheduling secrets blog post.

ARM Cortex-A8 performance monitoring unit disaster

If you tried to follow the instructions described above, but got bizarre results, then the chances are quite high that you are using some hardware with ARM Cortex-A8 processor. The problem is that ARM Cortex-A8 has a broken performance monitoring unit (this is described as erratum #628216 in ARM Cortex-A8 errata list). Earlier revisions were badly broken. The last revisions are a bit better, but still not suitable for use with oprofile.

For collecting samples, oprofile relies on the interrupts generated by the performance monitoring unit. The interrupts are supposed to happen on overflows of the 32-bit hardware performance counters. But with the older ARM Cortex-A8 revisions (for example, used in beagleboard), the PMU state may be occasionally messed up on the counter overflow. With the newer ARM Cortex-A8 revisions (for example, used in beagleboard-xm), the counter may just overflow without triggering an interrupt. The outcome is disasterious in both cases. Skipped interrupt may be difficult to notice because it takes slightly more than 4 seconds to count from 0 to 0xFFFFFFFF on a 1GHz processor. So the performance monitoring unit recovers itself, but each skipped interrupt results in approximately 4 seconds dropped out from the profiling session. Longer profiling runs have a higher chance of triggering this hardware bug eventually. And considering that it is important to collect really a lot of samples for getting good precision, Cortex-A8 performance monitoring unit using cycle counter is a really bad option.

The solution for all these troubles is simple: use the timer interrupt, Luke :) Hardware performance counters are actually more like a red herring. Timer interrupt works perfectly fine for the simple profiling tasks, so there is no point trying to use the performance monitoring unit no matter what. Admittedly, I have wasted quite a lot of time myself trying to workaround this pesky issue.

In order to override Cortex-A8 performance monitoring unit with a simple timer driver, adding “oprofile.timer=1″ to the kernel command line can be used. Or using “timer=1″ module parameter if oprofile is built as a module.

Also when using the simple timer driver, it makes sense to tweak it a bit if we don’t want to collect samples at something like a pitiful default 128 Hz rate. The following hack can be applied to the linux kernel to solve this:

diff --git a/drivers/oprofile/timer_int.c b/drivers/oprofile/timer_int.c
index 3ef4462..56fb6c3 100644
--- a/drivers/oprofile/timer_int.c
+++ b/drivers/oprofile/timer_int.c
@@ -20,13 +20,15 @@
 
 #include "oprof.h"
 
+#define OPROFILE_TIMER_TICK_NSEC 244141 /* ~4096 Hz */
+
 static DEFINE_PER_CPU(struct hrtimer, oprofile_hrtimer);
 static int ctr_running;
 
 static enum hrtimer_restart oprofile_hrtimer_notify(struct hrtimer *hrtimer)
 {
    oprofile_add_sample(get_irq_regs(), 0);
-  hrtimer_forward_now(hrtimer, ns_to_ktime(TICK_NSEC));
+  hrtimer_forward_now(hrtimer, ns_to_ktime(OPROFILE_TIMER_TICK_NSEC));
    return HRTIMER_RESTART;
 }
@@ -40,7 +42,7 @@ static void __oprofile_hrtimer_start(void *unused)
    hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
    hrtimer->function = oprofile_hrtimer_notify;
 
-  hrtimer_start(hrtimer, ns_to_ktime(TICK_NSEC),
+  hrtimer_start(hrtimer, ns_to_ktime(OPROFILE_TIMER_TICK_NSEC),
              HRTIMER_MODE_REL_PINNED);
 }

Additional verification for the Poisson based stddev estimate (added on 2011-08-28)

Let’s take the following profiling session as an example:

Profiling through timer interrupt
samples  %        image name               symbol name
52105    40.0715  djpeg                    jpeg_idct_islow
41281    31.7473  djpeg                    ycc_rgb_convert
15126    11.6327  djpeg                    decode_mcu
15001    11.5366  djpeg                    h2v1_fancy_upsample
2029      1.5604  djpeg                    decompress_onepass
1470      1.1305  libc-2.12.2.so           memset
1118      0.8598  no-vmlinux               /no-vmlinux
967       0.7437  libc-2.12.2.so           _wordcopy_fwd_dest_aligned
333       0.2561  djpeg                    jpeg_fill_bit_buffer
69        0.0531  libc-2.12.2.so           fwrite
69        0.0531  libc-2.12.2.so           write

Poisson gives us a theoretical estimate for the standard deviation as the square root of the number of samples. But just to be sure, we can verify it by running the same profiling session 10 times and calculating sample standard deviation for the number of samples attributed to each function.

function time spent in the function, measured in oprofile samples mean sample
stddev
sqrt(mean)
jpeg_idct_islow 52105 52171 51968 52243 52389 52126 52347 52217 52078 52543 52218.7 169.2 228.5
decode_mcu 15126 15119 15315 15060 15108 15397 15227 15017 15175 15138 15168.2 115.8 123.2
decompress_onepass 2029 2042 2070 2012 2057 2127 2022 2074 2048 1992 2047.3 37.98 45.25
fill_bit_buffer 333 311 333 311 334 297 309 309 336 304 317.7 14.63 17.82

By comparing the last two columns in the table, we can see that the values there are reasonably close to each other. So assuming a stable test environment with no background activity from the other processes, etc., we can run just one profiling session and already have a good estimate for the precision of the measurement for each function. Still it is a good idea to repeat profiling at least one more time and check if the results are consistent between runs in order to rule out any possible interference from the external factors or the problems in the whole setup (see the ‘ARM Cortex-A8 performance monitoring unit disaster’ section). If the results are not consistent across runs, it makes sense identifying and eliminating the source of this noise.

Also the applicability of the Poisson based standard deviation estimate is limited to the functions which take a reasonably small percentage of time (as wikipedia article says: “The Poisson distribution can be applied to systems with a large number of possible events, each of which is rare. A classic example is the nuclear decay of atoms”). And if taking a corner case as an example, if oprofile log shows that all the samples belong only to a single function (‘main’), then the precision of this measurement would be very high and only depend on the timer resolution. The number of samples would be equal to the time taken by the process multiplied by the oprofile samples collection frequency. But on a positive side, sqrt(N) still provides a reliable pessimistic estimate, with the real standard deviation being lower than that.

This text was found here: http://brokenthorn.com/Resources/OSDevX86.html

Introduction

This chapter covers IA32 machine language programming. The information provided here is for information purposes only and is not needed for the development of a basic operating system or executive software. Understanding the instruction format for the IA32 (and IA64) instructions can help debugging improperly assembled instructions, v86 monitors that are required for supporting v8086 mode, emulating instructions (which is required for emulating certain FPU instructions or when developing assemblers, emulators, virtual machines, and some other types of software), and when developing certain system software like debuggers and compilers.

This chapter is also for testing a new editor being used to write the new chapters that should help improve formatting and resolve spelling errors. If this test is successful, all of the new and earlier chapters will be updated to reflect the new format. Please send any feedback if you encounter any errors.

Machine Language Overview

Machine language, also known as machine code, native code, and byte code, is the set of raw instructions and data that can be executed by a central processing unit (CPU). It allows a CPU to interpret a certain set of byte sequences as an “instruction” to perform a task. These tasks are very small, such as copying small amounts of data or arithmetic. The act of building a byte sequence that represents a CPU instruction is known as coding. The definition of coding has evolved as programming languages evolved. Originally the term referred to the actual coding of the byte sequence for an instruction; today it applies to many forms of programming in second, third, and fourth generation programming languages. Computer programs, also known as software, is the collection of machine code and data that performs a complicated task, such as word processing or playing Halo®. Machine language is often interpreted by popular media as a “series of 1′s and 0′s”. This is an accurate description—to an extent.

Digital Logic

Digital logic is a field of electronics that utilizes logic gates that allows the electronics to make decisions. Some examples of logic gates includeAND gates, OR gates, NOR gates, NAND gates, NOT gates and XOR gates. These gates reflect their binary operations: AND gates perform a binary AND, XOR performs a binary XOR, and so on. In order for these gates to be meaningful, a standard needed to be adopted in order to make sense of what is true and false. For example, AND gates only make sense if there are two inputs and one output. The two inputs are the two items to test for equality: if either one is false, the output is false else the output is true. The standard is to define a line with low electrical current as false and a line with higher electrical current as true. This is what connects the binary number system to digital logic electronics. In the binary number system, 0 is often denoted as false and 1 is often true. Machine language is often represented in binary because of its tight connection with the CPU instruction decoding mechanism and how its stored in RAM for instruction fetching.

Program loading

Programs are loaded into memory by the operating system, executive, or firmware. The IA32 and IA64 family of CPU’s can execute programs fromRead Only Memory (ROM) and Random Access Memory (RAM). This is made possible through a common system bus and physical address space (PAS) shared by firmware and program images. Because firmware and program images can both be executed directly by the CPU cores, they utilize the same machine language byte code. Machine language is different then microcode (see chapter 7) that firmware might use, however, the actual firmware is still machine language.

Assembly language

Assembly language is a second generation programming language. It allows the programmer to write a program in a well defined language that uses mnemonics to aid the programmers in developing the software. For example, MOV is a common mnemonic common in a lot of assembly languages for different architectures. MOV represents an instruction that copies data from a source to a destination. It is also an example of a data movement instruction.

The mnemonics gave a symbolic name to the instructions and instruction forms, allowing each assembly language instruction to be translated to a single (in some cases, several possible) machine language byte sequences. The program that translates assembly language instructions into machine language is known as an assembler. Assemblers are sometimes incorrectly called compilers.

Machine Instruction overview

machine instruction is a single byte sequence that performs a specific task for the CPU. A set of machine instructions has been previously defined as a machine language. Machine instructions are defined by a CPU manufacturer in an instruction set that documents all of the CPU instructions the manufacturer implemented. Instruction sets also typically include suggestive assembly language mnemonics for assembler developers to use. Instruction sets are documented in CPU specifications.

The CPU manufacture implements the machine instructions supported by a particular CPU and how the CPU interprets each instruction. This allows the CPU to “execute” machine instructions that the manufacture intended. Bugs in the CPU hardware or firmware, however, can cause the CPU to “execute” instructions that are not valid instructions. These are undocumented instructions. Some assemblers might define mnemonics to undocumented instructions that are well known. Some undocumented instructions that have had benefits have later become documented as real instructions. Some instructions might be left undocumented for manufacturer testing use only, such as the IA32 LOADALL instruction (this bug has since been fixed.) Some instructions might also have bad effects, such as halting the system or damaging the CPU (these are known as Halt and Catch Fire (HCF) instructions.)

There is an instruction set for every CPU architecture. Due to the evolving nature of the software industry, certain trends in instruction sets have become common. Understanding these trends can help with understanding IA32 machine language.

CISC and RISC

Instruction sets typically fall in two categories: Complex Instruction Set Computing (CISC) and Reduced Instruction Set Computing (RISC). Examples of RISC include PPC and ARM architectures. An example of CISC is the IA32 architecture. RISC architectures utilize a simplistic instruction set format over CISC. RISC architectures typically uses a standard encoding format for each instruction that allows each instruction to be of the same number of bytes. CISC architectures also follow a standard encoding format, but allows variable length instructions.

Operation code

An Operating Code (OPCode) is a single byte identifier that the CPU utilizes to determine the instruction type. For example, a MOV instruction has an opcode identifier that lets the CPU know information about the instruction, such as type (MOV). Many instruction sets use opcode to distinguish one instruction from another. Some instructions may have multi-byte opcodes and extended opcodes. This gives the instruction set more flexibility.

Addressing mode

An Addressing Mode defines a method for the CPU to be able to reference addresses. The addresses may be virtual or physical depending on the architecture. Instructions, such as data movement instructions, need to be able to tell the CPU how to reference addresses to obtain data. For example, many CPU’s support a direct addressing mode that allows an instruction to tell the CPU to reference (read or write) data at a specific address. For example, in IA32 assembly language:

mov eax, dword [0xa0000]

This instruction tells the CPU to use the direct addressing mode to read from address 0xa0000 in the current address space. Another common addressing mode is indirect addressing, which allows an instruction to tell the CPU to reference data using a pointer. For example, in IA32 assembly language:

mov eax, [ebp]

This tells the CPU to read a dword from the address stored in the EBP register into the EAX register. There are many more addressing modes that architectures may utilize.

IA32 and IA64 Instruction encoding

We are now ready to begin looking at IA32 and IA64 machine instruction encoding. In order to save space, we will use IA64 to mean IA32 and IA64 instruction sets. IA32 is a subset of IA64 and thus shares a large part of the IA32 instruction set. The IA64 instruction set implements a CISC encoding. This means that each instruction follows a specific encoding structure and is variable in length. IA32 and IA64 instructions can range from 1 byte to 12 bytes in size.

Register codes

The CPU identifies internal registers by a numerical value. Many registers share the same code; the CPU decides what register to use based on the instruction being used and the current operating mode (real, protected, or long modes). The operand size override prefix is also used when determining what register to use. We will cover this prefix later on.

Register codes are used in the instruction encoding to let the CPU know what registers the instruction operates on. The registers use the following codes:

REX.r = 0

Code
0
1
2
3
4
5
6
7
No REX
REX
REG16
REG32
REG64
MM
XMM
YMM
SSEG
AL
AL
AX
EAX
RAX
MM0
XMM0
YMM0
ES
CR0
DR0
CL
CL
CX
ECX
RCX
MM1
XMM1
YMM1
CS
CR1
DR1
DL
DL
DX
EDX
RDX
MM2
XMM2
YMM2
SS
CR2
DR2
BL
BL
BX
EBX
RBX
MM3
XMM3
YMM3
DS
CR3
DR3
AH
SPL
SP
ESP
RSP
MM4
XMM4
YMM4
ES
CR4
DR4
CH
BPL
BP
EBP
RBP
MM5
XMM5
YMM5
GS
CR5
DR5
DH
SIL
SI
ESI
RSI
MM6
XMM6
YMM6

CR6
DR6
BH
DIL
DI
EDI
RDI
MM7
XMM7
YMM7

CR7
DR7

For example, in the instruction mov bx, 0×5 we would store 3 as the register code for BX. The instruction mov ss, ax would require storing 2 as the register code for SS and 0 for the register code of AX. Different instructions utilize different types of registers so there will never be a conflict between needing to choose between multiple registers of the same code. For example, the instruction mov REG16, IMM16 will always use a 16 bit general purpose register as an operand. For another example, the instruction movups xmm, xmm/m128 will always take an XMM register only.

Long mode adds additional registers to this list. In order to support the above registers, long mode has a special flag set that allows instructions to select other registers using the same register codes. This is the REX.r field in the REX prefix byte that will be explained later on. When this bit is set, the register table will look like this:

REX.r=1

Code
0
1
2
3
4
5
6
7
No REX
REX
REG16
REG32
MM
XMM
YMM
SSEG
R8B
R8W
R8D
R8
MM0
XMM8
YMM8
ES
CR8
DR8
R9B
R9W
R9D
R9
MM1
XMM9
YMM9
CS
CR9
DR9
R10B
R10W
R10D
R10
MM2
XMM10
YMM10
SS
CR10
DR10
R11B
R11W
R11D
R11
MM3
XMM11
YMM11
DS
CR11
DR11
R12B
R12W
R12D
R12
MM4
XMM12
YMM12
FS
CR12
DR12
R13B
R13W
R13D
R13
MM5
XMM13
YMM13
GS
CR13
DR13
R14B
R14W
R14D
R14
MM6
XMM14
YMM14

CR14
DR14
R15B
R15W
R15D
R15
MM7
XMM15
YMM15

CR15
DR15

Instruction Encoding

An IA64 instruction follows a well defined structure that originated from the 8085 CPU. Each instruction follows the following format:

Prefix bytes (0-4) REX prefix (1) Operation (0-3) Mod R/M (1) SIB (1) Displacement (0-4) Immediate (0-4)

For compactness, the number in parentheses is the number of bytes of the component. A number of 0 indicates that the byte is optional. For example, the prefix bytes can be from 0 to 4 bytes in an instruction. This means that an instruction can have 0, 1, 2, 3, or 4 prefix bytes. The REX prefix is only valid in IA64 and long modes. The only required field is an operation code. All other fields are optional and depend on if the instruction requires them. For example, the INT (interrupt) instruction requires an operation code and immediate byte while a MOVinstruction might utilize all of the above fields.

To provide another example, lets take a look at the INT instruction in more detail. The INT instruction has a form:

INT imm8

where imm8 is an 8 bit immediate value and INT is the mnemonic for the operation code 0xCD. Knowing the format of the instruction encoding, we can encode a INT 5 instruction like this:

0xCD 0x05

The first byte, 0xCD, is the operation code and is in brown. Because prefix bytes are optional, and INT 5 does not require them, we do not need it. Mod R/M and SIB bytes are not needed either. Displacements are only used with memory addressing modes so the only other field that we need is the immediate field. The immediate field is a 0-4 byte field. We know to use it as a 1 byte field because of the instruction form INT imm8. The purpose of this example is to demonstrate that certain fields are optional and not needed; the fields that are needed by an instruction depends on the instruction. The order of these fields never changes. For example, note above how we chose to omit the fields that are not needed but kept the order of the fields: the operation code field comes before the immediate value field. In the next few sections, we will cover each of these fields in more detail.

Prefix fields

The prefix bytes allow an instruction to give more information to the CPU. For example, they allow the instruction to have the CPU lock the bus or to utilize a different segment register in a data movement instruction. Many of these prefixes have assembly language mnemonics. The prefix bytes are identified in 4 classes. An instruction can use at most 1 prefix byte from each of the 4 classes.

Class 1 prefix
0xF0 LOCK prefix
0xF2 REPNE, REPNZ prefix
0xF3 REP, REPZ, REPE prefix

 

Class 2 prefix
0x2E CS Segment override
0×36 SS Segment override
0x3E DS Segment override
0×26 ES Segment override
0×64 FS Segment override
0×65 GS Segment override
Class 3 prefix
0×66 Operand size override
Class 4 prefix
0×67 Address size override

We assume the reader knows IA32 assembly language so will omit describing these prefixes in detail. A machine instruction can only have 1 prefix byte from any of the 4 classes. Due to their being 4 classes, an instruction can have 0 to 4 prefix bytes. If an instruction attempts to use 2 or more prefix bytes from a single class, the CPU will throw an invalid instruction exception.

LOCK prefix

If the LOCK prefix is used on an instruction that does not support LOCK the CPU will trigger an invalid instruction exception. Some assemblers allow using LOCK on invalid instructions without giving a warning to the programmer. Due to this, we will present the list of valid instructions here.

The LOCK prefix can only be used on the following instructions: ADC, ADD, AND, BTC, BTR, BTS, CMPXCHG, CMPXCHG8B, CMPXCHG16B, DEC, INC, NEG, NOT, OR, SBB, SUB, XADD, XCHG, and XOR.

Operand size override

The operand size override allows the CPU to select between 16 bit and 32 bit operands. Assemblers typically allow the programmer to select a specific operand size indirectly using directives like bits16 or use32. The IA32 and IA64 instruction sets provide two operand sizes: legacy 16 bit and a native size that is 32 bit. The native size depends on the processors current operation mode.

Operation mode CS.d REX.w Native Operand override
Real mode 16 bit 16 bit
V8086 mode 16 bit 16 bit
Protected mode 0 16 bit 32 bit
Protected mode 1 32 bit 16 bit
Long mode 0 32 bit 16 bit

For an example, lets look at the ADD AX/EAX, IMM16/IMM32 instruction. This instruction has operation code 0×05. In protected mode code, the CPU will interpret this as an ADD EAX, IMM32 instruction by default. However we can override the default behavior and copy a 16 bit immediate value by using an operand override prefix. We do this in assembly language like this:

add eax, 5 ; MOV EAX, IMM32
add ax, 5  ; MOV AX, IMM16

The first instruction will assemble to:

0x05 0x05 0x00 0x00 0x00

The second instruction will assemble to:

0x66 0x050x05 0x00

Notice the only differences between these two instructions are the following: (1) The first instruction uses a 32 bit immediate value and the second instruction uses a 16 bit immediate value (these are in red), and (2) the second instruction uses the operand override prefix (this is inblack). This tells the CPU to use the 16 bit operand form. For completeness, the values in brown are the operation codes.

Address size override

The address size override prefix byte is very similar to the operand override prefix byte. Assemblers allow programmers to be able to select between address sizes by using keywords such as byte ptr and dword ptr. Due to the function being very similar to the operand override prefix, we will omit describing its purpose because it is the same but applies to address modes.

Operation mode CS.d REX.w Native Address override
Real mode 16 bit 16 bit
V8086 mode 16 bit 16 bit
Protected mode 0 16 bit 32 bit
Protected mode 1 32 bit 16 bit
Long mode 0 64 bit 32 bit
Long mode 1 64 bit 32 bit

For example, the instruction mov eax, [0xa000] when assembled in protected mode would not need an address size override. The assembler would treat 0xa000 as a 32 bit displacement. However, if we used mov ax, word [0xa000] the assembler would add an address size override prefix to the instruction to select the 16 bit address form.

REX prefix

The REX prefix enables certain 64 bit specific features. It has the following format:

| 0 | 1 | 0 | 0 | w | r | x | b |
+---+---+---+---+---+---+---+---+
7                               0
REX.w Operand size. 0: Default, 1: 64 bit
REX.r ModRM.reg extension
REX.x SIB.index extension
REX.b ModRM.rm extension

Prefix Order

The order of the prefix bytes when used in conjunction with other prefix bytes does not matter. For example, you can use 0xF3 0x2E in the machine code to select REP and CS override. You can also use 0x2E 0xF3 to do the same thing.

Operation code field

The operation code field can be 1-3 bytes in length. All operation codes are unique; they identify the instruction to use and its operands. For example, the operation code 0 identifies the ADD REG/MEM8, REG8 instruction. The operation code 1 identifies an ADD REG/MEM16/MEM32, REG16/REG32 instruction. The IA32 and IA64 CPU manuals outline each instruction and their respective operation code.

Primary Opcode

The primary opcode is a single byte that is required in all instructions. It is the base of the operation code fields used to identify the instruction. The primary opcode can also take on one of the following formats depending on instruction.

|   |   |   |   |   |  d | s | w |
+---+---+---+---+---+----+---+---+
7                                0

|   |   |   |   |     tttn       |
+---+---+---+---+---+----+---+---+
7                                0

|   |   |   |   |   | register ID|
+---+---+---+---+---+----+---+---+
7                                0

|   |   |   |   |   |    |   mf  |
+---+---+---+---+---+----+---+---+
7                                0
PO.w Operand size
PO.s Sign extend
PO.d Direction
PO.tttn Used on some FPU instructions
PO.mf Memory format

Secondary OPCode

When the primary opcode byte is 0xf0, another byte follows called the secondary opcode byte. The secondary opcode then identifies different instructions and has the same functionality as above. These should be treated as two byte opcodes.

OPCode extension

Certain families of instructions has the same opcode but differ only by a special field called an opcode extension. This is a 3 bit extension that is stored in the Mod R/M.reg field. The Mod R/M byte will be explained in more detail in the next section.

Multi-byte OPCodes

The primary opcode field can be 1-3 bytes in length. Although most instructions only use 1 byte of the primary opcode field, there are some that can utilize all 3 bytes. All of these instructions also uses a secondary opcode byte (0xf0).

Mod R/M field

The Mod R/M (Register/Memory) field is used by instructions that require memory or register operands. The Mod R/M field has the following format.

| mod   |    reg    |    rm     |
+---+---+---+---+---+---+---+---+
7                               0

The Mod R/M field is slightly different depending on if the CPU is running in real, protected or long modes.

Real mode                                                   Protected and Long modes
ModRM.mod
00: [Memory]
01: [Memory+DISP8]
10: [Memory+DISP16]
11: Register
ModRM.mod
00: [Memory]
01: [Memory+DISP8]
10: [Memory+DISP32]
11: Register
ModRM.reg
Register code
ModRM.reg
Register code
ModRM.rm
If ModRM.mod = 11: register code
000: [BX+SI]
001: [BX+DI]
010: [BP+SI]
011: [BP+DI]
100: [SI]
101: [DI]
110: [BP] or [DISP16] when ModRM.mod=0
111: [BX]
ModRM.rm
If ModRM.mod = 11: register code
REX.b=0             REX.b=1
000: [RAX]
001: [RCX]
010: [RDX]
011: [RBX]
100: [SIB]
101: [RBP][DISP32]
110: [RSI]
111: [RDI]
000: [R8]
001: [R9]
010: [R10]
011: [R11]
100: [SIB]
101: [DISP32]
110: [R14]
111: [R15]

The ModRM.mod field is combined with Mod.rm field to determine the addressing mode. For example, the instruction mov ax, [0xa000] would use (in real mode) ModRM.mod = 0 (Memory) and ModRM.rm = 6 (DISP16). ModRM.reg would contain the register code for AX. If we are to usemov ax, [bx+0xa000] instead, ModRM would be 2 (Memory+DISP16) and ModRM.rm would be 7. Assemblers would treat 0xa000 here as a DISP16 rather then a DISP8 due to 0xa000 being a word size displacement. We can select the DISP8 form by using an address size overrideprefix.

Looking at the above tables we can deduce that there is not many registers can be used for indirect addressing. For example, [BP] is not a valid addressing mode, but assemblers can assemble this fine in instructions like mov ax, [bp]. A common trick used by these assemblers is to set ModRM.mod = 1 (Memory+DISP8) and ModRM.rm = 6 (BP). In other words, the assemblers translate this into a [BP+DISP8] addressing mode, setting the displacement to 0. So mov ax, [bp] is assembled into mov ax, [bp+0].

Protected and long modes introduce another addressing mode, [SIB] which gives more capabilities. SIB addressing modes can be combined with ModRM.rm modes. For example, in protected mode, mov eax, [ebx+edi*2+0xa000] would be translated to ModRM.mod = 2 (Memory+DISP32) and ModRM.rm = 4 (SIB byte). The SIB byte tells us how to extract the edx+edi*2 in this instruction so will be explained in the next section.

Certain instructions utilize an extended opcode field. Extended opcodes are identifiers that are used with the primary opcode when identifying instructions. This is a 3 bit field and is stored in Mod R/M.reg field for these instructions. Instructions that use an extended opcode field might still use ModRM.rm and ModRM.mod to store a register operand or memory addressing mode.

SIB field

The Scale Index Base (SIB) byte follows a Mod R/M byte only if Mod R/M.rm = 4 and the CPU is in protected or long modes. The byte provides additional addressing modes to the IA32 and IA64 architectures. SIB addressing is combined with Mod R/M addressing in order to create a wide array of additional addressing modes.

| Scale |   Index   |   Base    |
+---+---+---+---+---+---+---+---+
7                               0
SIB.Scale
00: Factor 1
01: Factor 2
10: Factor 4
11: Factor 8
SIB.Index
Uses standard register codes
If VSIB, uses VR register codes
If REX.x = 1, uses 64 bit register codes
If REX.x = 1 and VSIB, uses VR register codes
SIB.Base
Uses standard register codes
If REX.b=1, uses 64 bit register codes

Despite the names of the register fields in the SIB byte, you can technically use any register code. For example, you can put an index register in SIB.Base.

Lets combine the SIB byte with Mod R/M again to demonstrate how they work together. Using our previous example, mov eax, [ebx+edi*2+0xa000]. We assume the CPU is running in protected mode for this example. EAX is the non-memory register, so that will be in Mod R/M.reg. We also have to set Mod R/M.mod = 2 to enable [Memory+DISP32] and Mod R/M.rm = 4 [SIB byte]. ebx is our base register, edi is our index register. The register code for EBX is 3 and the register code for EDI is 7. Using this, we can set SIB.index = 7 and SIB.base = 3. The scale factor, 2 goes into SIB.scale.

Putting this together, we have a Mod R/M byte of 10 000 100 binary and an SIB byte of 10 111 011 binary. Knowing we have a 32 bit displacement of 0xa000 and the operation code being 0×89, we can translate our example instruction into:

0x89 0x84 0xbb 0x00 0xa0 0x00 0x00

This would be the correct translation of mov eax, [ebx+edi*2+0xa000]. To ease readability, the operation code is in brown, Mod R/M and SIB bytes are in red, and the displacement field is in black. Note that this follows the exact format of an instruction: first is the primary opcode, next is the Mod R/M byte, next is the SIB byte, and the displacement byte follows.

If the displacement byte in the above instruction looks odd, please consider that the IA32 and IA64 architectures are little endian.

Displacement field

The displacement field is only valid if Mod R/M.mod is mode 1 (Memory+DISP8) or mode 2 (Memory+DISP16 or Memory+DISP32). The displacement can be a byte, word or dword value and is used in conjunction with the Mod R/M and SIB bytes to add displacements to addressing modes. The displacement field always follows the Mod R/M or SIB byte.

Immediate field

The immediate field is only valid if the instruction requires it as an operand. Instructions might require an 8, 16, or 32 bit immediate value. This field must then be present as the last field in the instruction. Instructions that allow both 16 and 32 bit values depend on if an operand override size prefix is present to determine the size of this field.


Instruction tables

An instruction look-up table is utilized by certain software to help facilitate the machine language translation of instructions. The tables reflect that of a generic instruction table that provides all of the operational codes and operand types for all of the instructions. We can use a resource, such as the IA32 manuals or an online reference, to construct the tables or to help with building machine instructions. The design of these tables varies considerably; it is important to read the documentation on how to read these look-up tables.

The tables would present an instruction in a form similar to the following.

| 0x10 | ADC | R/MEM8 | R8 |
+------+-----+--------+----+

This represents an ADC instruction whose operation code is 0×10. The R/MEM8 is the first operand, and R8 is the second operand. Operands can be represented in different ways depending on the tables’ design. The table may also present additional information such as effected flags the instruction sets, what form of the opcode byte the instruction might use (such as if it stores a register ID in the opcode field), supported processors, and so on. These tables can get really large in size but they all provide the same basic information typically presented in the above form.

R/MEM8 in the above means that the first operand is a “register” or “8 bit memory location”. The “R8” means the second operand is an 8 bit register. If an instruction has a memory operand, then a Mod R/M (and possibly an SIB byte) must follow. Also, if an instruction takes two register operands, a Mod R/M byte must follow. The Mod R/M byte will store the memory addressing mode information or both register codes in Mod R/M.rm and Mod R/M.reg. The CPU will know the register code is for an 8 bit register because of the opcode. The opcode tells the CPU not only what instruction to execute, but also what operands the instruction requires. An instruction may utilize different types of operands, because of this the same instruction can occupy multiple opcodes. For example, the above instruction form uses opcode 0×10. Other forms include but are not limited to the following.

| 0x11 | ADC | R/MEM16/MEM32 | R16/REG32     |
+------+-----+---------------+---------------+
| 0x12 | ADC | R8            | R/REG16/REG32 |
+------+-----+---------------+---------------+
| 0x13 | ADC | R16/REG32     | R/MEM16/MEM32 |
+------+-----+---------------+---------------+

If an instruction uses an operand like REG16/REG32, you need to deduce the operand to use based on if an operand size override prefix is present and the current CPU operation mode (that is, if it is running in protected mode, real mode, long mode, and so on). For example, if the instruction is an ADC ax, word ptr [0] and we are running this in protected mode (or, in assembly language terminology, we used bits32 oruse32 directives), we can use opcode 0×13 for the instruction. We know this instruction takes the form ADC REG16, MEM16/MEM32. AX is the first operand, which is a 16 bit register (REG16). But what is [0]? In order to find out, we take into consideration that there is no address size override and that we are in protected mode. Due to there being no address size override we are to use the native size, which is 32 bit memory addressing. (Please see the section on the address size override prefix for more information.) Using this, we conclude that we select the ADC REG16, MEM32 form. (If an address size override did exist, we would select the ADC REG16, MEM16 form).

If an instruction only has a single register operand, verify if the operand is stored in the OPCode.reg field. (Please see the Operational code section for more information.) Some instructions do this to save space, this is what allows single byte instructions. Some instructions might utilize two registers for operands storing them in OPCode.reg and Mod R/M.reg or Mod R/M.rm.

To complete this example, we note the following: OPCode 0×13, AX register code (0), Addressing mode is [Memory] with a displacement of 0. Due to is being in protected mode, we use the 32 bit Mod R/M form. Mod R/M.reg = 0 (selecting AX), Mod R/M.rm = 5 (DISP32) and Mod R/M.mod = 0 ([MEMORY]). This creates a Mod R/M value of 00 000 101 binary. Due to us not using a [SIB] mode, we do not need to use an SIB byte. (For an example that does use the SIB, please see our previous example for disassembling mov eax, [ebx+edi*2+0xa000]). Using this information, we can build the machine code.

0x66 0x13 0x05 0x00 0x00 0x00 0x00

The OPCode is in brown and Mod R/M byte is in red. The displacement byte is a DISP32 (due to Mod R/M.rm) so must be a dword. This is identified in black. The 0×66 is an operand size override prefix which is in blue. We use an operand override size prefix in order to select the REG16 and MEM16 form rather then the REG32 and MEM32 form. If we omit the prefix, we will get the following instead.

0x13 0x05 0x00 0x00 0x00 0x00

In protected mode, this is an adc eax, dword ptr [0] instruction which was not what we wanted. Please see the operand size override prefix section for more information.

If we wanted to turn ADC ax, word ptr [0] into REP ADC ax, word ptr ES:[0] we can use an ES override prefix and REP prefix:

0xf3 0x26 0x66 0x13 0x05 0x00 0x00 0x00 0x00

The order of the prefix bytes do not matter.

Resources

The following resources are presented for supplemental reading. Please note that we do not provide support for these resources.

http://ref.x86asm.net/ IA32 and IA64 instruction table
http://www.sandpile.org/ Instruction format tables
http://wiki.osdev.org/X86-64_Instruction_Encoding IA32 and IA64 Instruction encoding

Conclusion

This chapter provided an overview of machine language programming and the instruction encoding on IA32 and IA64 architectures. A goal of this chapter is to present the material in a new way to encourage the development of debuggers and tool-chains. This chapter can also be used as a reference with an instruction table when emulating certain instructions.

Please let me know if there are any questions or comments,

~Mike ();

OS Development Series Editor

This text was found here http://www.alexonlinux.com/how-debugger-works

Introduction

In this article, I’d like to tell you how real debugger works. What happens under the hood and why it happens. We’ll even write our own small debugger and see it in action.

I will talk about Linux, although same principles apply to other operating systems. Also, we’ll talk about x86 architecture. This is because it is the most common architecture today. On the other hand, even if you’re working with other architecture, you will find this article useful because, again, same principles work everywhere.

Kernel support

Actual debugging requires operating system kernel support and here’s why. Think about it. We’re living in a world where one process reading memory belonging to another process is a serious security vulnerability. Yet, when debugging a program, we would like to access a memory that is part of debugged process’s (debuggie) memory space, from debugger process. It is a bit of a problem, isn’t it? We could, of course, try somehow to use same memory space for both debugger and debuggie, but then what if debuggie itself creates processes. This really complicates things.

Debugger support has to be part of the operating system kernel. Kernel able to read and write memory that belongs to each and every process in the system. Furthermore, as long as process is not running, kernel can see value of its registers and debugger have to be able to know values of the debuggie registers. Otherwise it won’t be able to tell you where the debuggie has stopped (when we pressed CTRL-C in gdb for instance).

As we spoke about where debugger support starts we already mentioned several of the features that we need in order to have debugging support in operating system. We don’t want just any process to be able to debug other processes. Someone has to monitor debuggers and debuggies. Hence the debugger has to tell the kernel that it is going to debug certain process and kernel has to either permit or deny this request. Therefore, we need an ability to tell the kernel that certain process is a debugger and it is about to debug other process. Also we need an ability to query and set values from debuggie’s memory space. And we need an ability to query and set values of the debuggie’s registers, when it stops.

And operating system lets us to do all this. Each operating system does it in it’s manner of course. Linux provides single system call named ptrace() (defined in sys/ptrace.h), which allows to do all these operations and much more.

ptrace()

ptrace() accepts four arguments. First is one of the values from enum __ptrace_request that defined in sys/ptrace.h. This argument specifies what operation we would like to do, whether it is reading debuggie registers or altering values in its memory. Second argument specifies pid of the debuggie process. It’s not very obvious, but single process can debug several other processes. Thus we have to tell exactly what process we’re referring. Last two arguments are optional arguments for the call.

Starting to debug

One of the first things debuggers do to start debugging certain process is attaching to it or running it. There is a ptrace() operation for each one of these cases.

First called PTRACE_TRACEME, tells the kernel that calling process wants its parent to debug itself. I.e. me calling ptrace( PTRACE_TRACEME ) means I want my dad to debug me. This comes handy when you want debugger process to spawn the debuggie. In this case you do fork() creating a new process, then ptrace( PTRACE_TRACEME ) and then you call exec() or execve().

Second operation called PTRACE_ATTACH. It tells the kernel that calling process should become debugging parent of the process being called. Debugging parent means debugger and a parent process.

Debugger-debuggie synchronization

Alright. Now we told operating system that we are going to debug certain process. Operating system made it our child process. Good. This is a great time for us to have the debuggie stopped and us doing preparations before we actually start to debug. We may want to, for instance, analyze executable that we run and place a breakpoints before we actually start debugging. So, how do we stop the debuggie and let debugger do its thing?

Operating system does that for us using signals. Actually, operating system notifies us, the debugger, about all kinds of events that occur in debuggie and it does all that with signals. This includes the “debuggie is ready to shoot” signal. In particular, if we attach to existing process it receives SIGSTOP and we receive SIGCHLD once it actually stops. If we spawn a new process and it did ptrace( PTRACE_TRACEME ) it will receive SIGTRAP signal once it attempts to exec() or execve(). We will be notified with SIGCHLD about this, of course.

A new debugger was born

Now lets see code that actually demonstrates that. Complete listing can be found here.

The debuggie does the following…

if (ptrace( PTRACE_TRACEME, 0, NULL, NULL )) {
    perror( "ptrace" );
    return;
}

execve( “/bin/ls”, argv, envp );
Note the ptrace( PTRACE_TRACEME ) followed by execve(). This is what real debuggers do to spawn the process that going to be debugged. As you know, execve() replaces current executable image and memory of the current process with the executable and memory space belonging to program that being execve()‘d. Once kernel finishes this operation, it sends SIGTRAP to calling process and SIGCHLD to the debugger. The debugger receives appropriate notifications via signals and via wait() that returns. Here is the debugger’s code.

do {
    child = wait( &status );
    printf( "Debugger exited wait()\n" );
    if (WIFSTOPPED( status )) {
        printf( "Child has stopped due to signal %d\n",
        WSTOPSIG( status ) );
    }
    if (WIFSIGNALED( status )) {
        printf( "Child %ld received signal %d\n", (long)child, WTERMSIG(status) );
    }
} while (!WIFEXITED( status ));

Compiling and running listing1.c produces following output:

In debuggie process 14095
In debugger process 14094
Process 14094 received signal 17
Debugger exited wait()
Child has stopped due to signal 5

Here we can clearly see that debugger indeed receives a signal and gets notified via wait(). If we want to place a breakpoint before we start to debug the process, this is our chance. Lets talk about how we can do something like that.

The magic behind INT 3

It is time to dig a bit into subject that is not adored by most of the programmers and that is assembler language. I am afraid we don’t have much choice because breakpoints work on assembler level.

We have to understand that each our compiled program is actually a set of instructions that tells CPU what to do. Some of our C expressions translated into single instruction, while others may be translated into hundreds and even thousands of instructions. Instruction may be bigger or smaller. From 1 byte up to 15 bytes long for modern CPUs (Intel x86_64).

Debuggers mostly operate on CPU instruction level. The matter of fact that gdb understands C/C++ code and allows you to place breakpoints at certain C/C++ line is only an enhancement over gdb‘s basic ability to place breakpoints on certain instruction.

There are several ways to place breakpoints. The most widely used is the INT 3 instruction. It is a single byte operation code instruction that once reached by CPU, tells it to call special breakpoint interrupt handler, provided by operating system during its initialization. Since INT 3 instruction operation code is so small, we can safely substitute any instruction with it. Once operating system’s interrupt handler called, it figures what process reached a breakpoint and notifies it and its debugging process via signals.

Breakpoints hands on

Lets return to our debuggie/debugger friends. As we mentioned debugger does have a chance to place a breakpoint before letting the debuggie process to run. Lets see how this can be done.

Breakpoints placed with INT 3 instruction. Before writing the actual 0xcc (INT 3 operation code), we should figure where to place the instruction. For purpose of this article we will do it manually. On the contrary, real debuggers include complex logic that calculates where and when to place the breakpoints. gdb places several breakpoints by itself, without you even knowing about it. And obviously it has functionality that places breakpoints once you ask it to do so.

In our previous example we had our debuggie process executing ls. It is not suitable for our next demonstration. We will need a sample program that would let us easily demonstrate breakpoints in action. Here it is.

#include 

int main()
    printf( "~~~~~~~~~~~~> Before breakpoint\n" );
    // The breakpoint
    printf( "~~~~~~~~~~~~> After breakpoint\n" );
    return 0;
}

And here is the disassembler output of the main() routine.

0000000000400508 :
400508: 55 push %rbp
400509: 48 89 e5 mov %rsp,%rbp
40050c: bf 18 06 40 00 mov $0x400618,%edi
400511: e8 12 ff ff ff callq 400428
400516: bf 2a 06 40 00 mov $0x40062a,%edi
40051b: e8 08 ff ff ff callq 400428
400520: b8 00 00 00 00 mov $0x0,%eax
400525: c9 leaveq
400526: c3 retq

We can see that if we will place a breakpoint at address 0×400516, we will see a printout before reaching the breakpoint and right after reaching it. For the sake of our demonstration, we will place a breakpoint at this address. Once we will reach the breakpoint, we will sleep and then let the debuggie running. We should see debuggie producing first printout, then sleeping for a few seconds and then producing second printout.

We’ll achieve our goal in several steps.

First of all, we should fork() off the debuggie. We already did something similar. Next step is to intercept the execve() call in debuggie. Been there, done that. Here’s something new. We should modify a byte at address 0×400516 from 0xbf to 0xcc, saving original value (0xbf). This is how we place the breakpoint. Next, we’re going to wait() for the process. Once it will reach the breakpoint, we’ll be notified. Once the debuggie reaches the breakpoint we want to restore the code we broke with our 0xcc to its original state.

In addition, we want to fix value of RIP register. This register tells CPU what is the location in memory of next meaningful instruction for it to execute. It’s value will be 0×400517, one byte after 0xcc that we placed. We want to set the RIP register to 0×400516 value because we don’t want the CPU to skip over that MOV instruction that we broke with our 0xcc.

Finally, we want to wait five seconds for the sake of demonstration and let the debuggie continue running.
First things first. Lets see how we do step 3.

addr = 0x400516;

data = ptrace( PTRACE_PEEKTEXT, child, (void *)addr, NULL );
orig_data = data;
data = (data & ~0xff) | 0xcc;
ptrace( PTRACE_POKETEXT, child, (void *)addr, data );

Again, we can see how ptrace() does the job for us. First we peek 8 (sizeof( long )) bytes from address 0×400516. On some architectures this could cause lots of headache because of unaligned memory access. Luckily, we’re on x86_64 and unaligned memory accesses are permitted. Next we set the lowest byte to be 0xcc – INT 3 instruction. Finally, we place 8 bytes back to their place.

We’ve seen how we can wait for certain event in debuggie. Also, we now know how to restore the original value at address 0×400516. So we can skip over steps 4-5 and jump right into step 6. This is something that we haven’t done so far.

What we have to do is to read debuggie registers, change them and write them back. Again ptrace() does all the job for us.

struct user_regs_struct regs;
ptrace( PTRACE_GETREGS, child, NULL, &regs );
regs.rip = addr;
ptrace( PTRACE_SETREGS, child, NULL, &regs );

Things are not too well documented here. For instance ptrace() documentation never mentions struct user_regs_struct, however this is what ptrace() system call expects to receive in kernel. Once we know what we should use as ptrace() arguments, it is easy. We use PTRACE_GETREGS operation to obtain values of debuggie’s registers, we modify the RIP register and write them back with PTRACE_SETREGS operation. Clear and simple.

Lets see how things actually work. You can find complete listing of debugger process here. Compiling and running listing2.c, produces following output.

In debuggie process 29843
In debugger process 29842
Process 29842 received signal 17
~~~~~~~~~~~~> Before breakpoint
Process 29842 received signal 17
RIP before resuming child is 400517
Time before debugger falling asleep: 1206346035
Time after debugger falling asleep: 1206346040. Resuming debuggie...
~~~~~~~~~~~~> After breakpoint
Process 29842 received signal 17
Debuggie exited...
Debugger exiting...

You can see that “Before breakpoint” printout appears 5 seconds before “After breakpoint” printout. The “RIP before resuming child is 400517″ clearly indicates that the debuggie has stopped on address 0×400517, as we expected.

Single steps

After seeing how easy to place a breakpoint, you can guess that stepping over one line of C/C++ code is simply a matter of placing a breakpoint on the next line of code. This is exactly what gdb does when you want it to single step over some expression.

Conclusion

Debuggers and how they work often associated with some kind of magic.

Debuggers, and gdb as an example, are exceptionally complicated piece of software. Placing breakpoints and single stepping is only a small fraction of what it is able to do. gdb in particular works on dozens of hardware architectures. It supports remote debugging. It is perhaps the most advanced and complicated executable analyzer. It knows when a program loads dynamic library and analyzes the code of that library automatically. It supports bunch of programming languages – from C/C++ to ADA. And these are just few out of its features.

On the contrary, we’ve seen how easy to start debugging certain process, place a breakpoint, etc. The basic functionality that allows debugging is in the operating system and in the CPU, waiting for us to use it.

========================================================================
			LINUX ASSEMBLER TUTORIAL

				   by

			      Robin Miyagi
			      
				   @

http://www.geocities.com/SiliconValley/Ridge/2544/

========================================================================
	   
start@: Thu Feb 03 02:14:37 UTC 2000

update: Fri Jul 30 23:52:23 UTC 2000

update: Fri Sep 15 22:39:17 UTC 2000 :

    - This tutorial  now explains  Linux assembler in  terms of  the GNU
      assembler `as'.

    - Information  about  Binutils programs  such  as  Objdump, and  ld.
      Discussion     on     Debugging     and    `gdb'     is     added.

update: Thu Jan 11 20:13:06 UTC 2001 :
      
========================================================================

* Introduction
------------------------------------------------------------------------

    When programming in  assembler for Linux (or any  other Unix variant
    for  that matter),  it  is important  to  remember that  Linux is  a
    protected mode  operating system  (on i386 machines,  Linux operates
    the  CPU in  protected mode).   This means  that ordinary  user mode
    processes are not allowed to  do certain things, such as access DMA,
    or access IO ports.  Writing  Linux kernel modules on the other hand
    (which  operate in  kernel  mode), are  allowed  to access  hardware
    directly  (Read the Assembler-HOWTO  on my  assembler page  for more
    information on this issue).  User mode processes may access hardware
    using  device files.   Device files  actually access  kernel modules
    which  access hardware directly.   This file  will be  restricted to
    user mode operation.  See my pages on kernel module programming.

    Please email me comments  and suggestions regarding this tutorial at
    penguin@dccnet.com .

* System Calls
------------------------------------------------------------------------

    In  programming  in assembler  for  DOS  you  probably made  use  of
    software interrupts,  especially the  int 0x21 functions  which were
    the DOS system calls.  In Linux, system calls are made via int 0x80.
    The sytem call number is passed via register EAX, and the parameters
    to the  system call  are passed via  the remaining  registers.  This
    discussion only  applies if there  are no more than  five parameters
    passed to  the system  call.  If there  are more than  5 parameters.
    The parameters  must be located in  memory (e.g. on  the stack), and
    EBX must contain the address of the beginning of the parameters.

    If you  would like a  list of the  system call numbers, look  at the
    contents   of   /usr/include/asm/unistd.h.    If  you   would   like
    information about a specific system  call (e.g. write ()), type `man
    2 write'  at the prompt.   Section 2 of  the linux man  pages covers
    sytem calls.

    If you  look at the contents of  /usr/include/asm/unistd.h, you will
    see the following line near the top of the file;

    #define __NR_write		4

    This indicates that  register EAX must be set to 4  in order to call
    the  write  () system  call.   Now,  if  you execute  the  following
    command;

    $ man 2 write

    you  get  the following  function  description  (under the  SYNOPSIS
    heading).

    ssize_t write(int fd, const void *buf, size_t count);

    This indicates that ebx is equal  to the file descriptor of the file
    you want  to write to, ecx  is a pointer  of the string you  want to
    write, and edx  contains the length of the string.   If there were 2
    more parameters  to this system call,  they would be  placed in esi,
    and edi respectively.

    How do I know  the file discriptor for stdout is 1.   If you look at
    your /dev directory, you will  notice that /dev/stdout is a symbolic
    link  that  points to  /proc/self/fd/1.   Therefore  stdout is  file
    descriptor 1.

    I leave looking up the _exit system call as an exercise.
    
    In linux, system calls are processed by the kernel.

* GNU Assembler
------------------------------------------------------------------------

    On  most Linux systems,  you will  usually find  the GNU  C compiler
    (gcc).  This compiler  uses an assembler called `as'  as a back-end.
    This means that the C compiler translates the C code into assembler,
    which in turn is assembled by `as' to an object file (*.o).

    `As'  uses  the AT&T  syntax.   Experienced  intel syntax  assembler
    programmers find  AT&T `really weird'.  It  is really no  more or no
    less difficult than  intel syntax.  I switched over  to `as' because
    there is  less ambiguity, works  better with the  standard GNU/Linux
    programs such as gdb  (supports the gstabs format), objdump (objdump
    dissassembles  code in  `as' syntax).   In short,  it is  a standard
    component of a GNU Linux system with programming tools installed.  I
    will explain debugging and objdump later in this tutorial.

    If  you would  like more  information about  `as' look  in  the info
    documentation under  as (e.g. type  `info as' at the  shell prompt).
    Also look  in the info  documentation on the Binutils  package (this
    package contains such programming tools as objdump, ld, etc.).

    
** GNU assembler v.s. Intel Syntax
------------------------------------------------------------------------

    Since most assembler documentation  for the i386 platform is written
    using  intel syntax,  some comparison  between the  2 formats  is in
    order.  Here is a summarized list of the differences;

      - In `as' the source comes before the the destination, opposite to
        the intel syntax.

      - The opcodes  are suffixed with  a letter indicating the  size of
        the opperands (e.g. `l' for dword, `w' for word, `b' for byte).

      - Immediate values must be prefixed with a `$', and registers must
        be prefixed with a `%'.

      - Effective      addresses     use     the      General     syntax
        DISP(BASE,INDEX,SCALE).  A concrete example would be;

	    movl mem_location(%ebx,%ecx,4), %eax

	Which is equivelent to the following in intel syntax;

	    mov eax, [eax + ecx*4 + mem_location]

    Now  for an example  illustrating the  difference (intel  version in
    comments);

        movl %eax, %ebx		# mov %ebx, %eax
	movw $0x3c4a, %ax
    
    Now for our little program;
------------------------------------------------------------------------   
	## hello-world.s

	## by Robin Miyagi
	## http://www.geocities.com/SiliconValley/Ridge/2544/

	## Compile Instructions:
	## -------------------------------------------------------------
	## as -o hello-world.o hello-world.s
	## ld -o hello-world -O0 hello-world.o

	## This  file is  a basic  demonstration of  the  GNU assembler,
	## `as'.
	
	## This program  displays a friendly string on  the screen using
	## the write () system call
########################################################################
	.section .data
hello:	
	.ascii 	"Hello, world!\n"
hello_len:
	.long 	. - hello
########################################################################
	.section .text
	.globl _start
	
_start:
	## display string using write () system call
	xorl %ebx, %ebx		# %ebx = 0
	movl $4, %eax		# write () system call
	xorl %ebx, %ebx		# %ebx = 0
	incl %ebx		# %ebx = 1, fd = stdout
	leal hello, %ecx	# %ecx ---> hello
	movl hello_len, %edx	# %edx = count
	int $0x80		# execute write () system call
	
	## terminate program via _exit () system call 
	xorl %eax, %eax		# %eax = 0
	incl %eax		# %eax = 1 system call _exit ()
	xorl %ebx, %ebx		# %ebx = 0 normal program return code
	int $0x80		# execute system call _exit ()

------------------------------------------------------------------------

    In the above program, notice the use of `#' to start comments.  `As'
    also supports the `/* C comment *' syntax.  If you use the C comment
    syntax, it works exactly the same  as for C (multiple lines, as well
    as inline commenting).  I always use the `#' comment syntax, as this
    works better with  emacs' asm-mode.  The double `##'  is allowed but
    not neccessary (this is only because of a quirk of emacs asm-mode).

    Notice the names  of the sections .text, and  .data.  these are used
    in ELF  files to tell  the linker where  the code and  data segments
    are.  There  is also the  .bss section to store  uninitialized data.
    It  is  only  these  sections  that occupy  memory  durring  program
    execution.


* Accessing Command Line Arguments and Environment Variables

    When an  ELF executable starts  running, the command  line arguments
    and environment variables are  available on the stack.  In assembler
    this means that  you may access these via the  pointer stored in ESP
    when  the program  starts execution.   See the  documentation  on my
    assembler programming page relating to the ELF binary format.

    So how  is this  data arranged on  the stack?  Quite  simple really.
    The  number of  command line  arguments (including  the name  of the
    program)  are stored as  an integer  at [esp].   Then, at  [esp+4] a
    pointer to the first command line argument (which is the name of the
    program)  is stored.   If  there were  any  additional command  line
    parameters,  their pointers  would be  stored in  [esp+8], [esp+12],
    etc.  After  all the  command line argument  pointers, comes  a NULL
    pointer.   After  the NULL  pointer  are  all  the pointers  to  the
    environment variables,  and then finally a NULL  pointer to indicate
    the end of the environment variables have been reached.

    A summary of the initial ELF stack is shown below;

    (%esp)	 argc, count of arguments (integer)
    4(%esp)	 char *argv (pointer to first command line argument)
       ...	 pointers to the rest of the command line arguments
    ?(%esp) NULL pointer
       ...	 pointers to environment variables
    ??(%esp)	 NULL pointer

    Now for our little program;
------------------------------------------------------------------------
	## stack-param.s ###############################################

	## Robin Miyagi ################################################
	## http://www.geocities.com/SiliconValley/Ridge/2544/ ##########

	## This file  shows how one  can access command  line parameters
	## via the stack at process  start up.  This behavior is defined
	## in the ELF specification.

	## Compile Instructions:
	## -------------------------------------------------------------
	## as -o stack-param.o stack-param.s
	## ld -O0 -o stack-param stack-param.o
########################################################################
	.section .data

new_line_char:
	.byte 0x0a
########################################################################
	.section .text

	.globl _start

	.align 4
_start:
	movl %esp, %ebp		# store %esp in %ebp
again:
	addl $4, %esp		# %esp ---> next parameter on stack
	movl (%esp), %eax	# move next parameter into %eax
	testl %eax, %eax	# %eax (parameter) == NULL pointer?
	jz end_again		# get out of loop if yes
	call putstring		# output parameter to stdout.
	jmp again		# repeat loop
end_again:
	xorl %eax, %eax		# %eax = 0
	incl %eax		# %eax = 1, system call _exit ()
	xorl %ebx, %ebx		# %ebx = 0, normal program exit.
	int $0x80		# execute _exit () system call

	## prints string to stdout
putstring:	.type @function
	pushl %ebp
	movl %esp, %ebp
	movl 8(%ebp), %ecx
	xorl %edx, %edx
count_chars:
	movb (%ecx,%edx,$1), %al
	testb %al, %al
	jz done_count_chars
	incl %edx
	jmp count_chars
done_count_chars:
	movl $4, %eax
	xorl %ebx, %ebx
	incl %ebx
	int $0x80
	movl $4, %eax
	leal new_line_char, %ecx
	xorl %edx, %edx
	incl %edx
	int $0x80 
	movl %ebp, %esp
	popl %ebp
	ret
		
------------------------------------------------------------------------

* The Binutils Package
------------------------------------------------------------------------

    Binutils stands  for binary utilities,  and includes a lot  of tools
    useful to programmers, especially durring debugging.

    I will now address some of these utilities.

** Objdump
------------------------------------------------------------------------

    Objdump  diplays information  about  1 or  more  object files.   For
    example, to  see information  about param-stack, type  the following
    command  at  shell  prompt   (be  sure  working  directory  contains
    param-stack);

        objdump -x param-stack | less

    Since the  information is likely to  span more than  one screen, the
    output  of objdump  is piped  to the  standard input  of  the paging
    command  `less'.   the option  `-x'  tells  objdump  to display  the
    numeric information in hexadecimal.  Here is the output of the above
    command;

        ----------------------------------------------------------------
	stack-param:     file format elf32-i386
	stack-param
	architecture: i386, flags 0x00000112:
	EXEC_P, HAS_SYMS, D_PAGED
	start address 0x08048074
	
	Program Header:
	    LOAD off    0x00000000 vaddr 0x08048000 paddr 0x08048000 align 2**12
		 filesz 0x000000be memsz 0x000000be flags r-x
	    LOAD off    0x000000c0 vaddr 0x080490c0 paddr 0x080490c0 align 2**12
		 filesz 0x00000001 memsz 0x00000004 flags rw-
	
	Sections:
	Idx Name          Size      VMA       LMA       File off  Algn
	  0 .text         0000004a  08048074  08048074  00000074  2**2
			  CONTENTS, ALLOC, LOAD, READONLY, CODE
	  1 .data         00000001  080490c0  080490c0  000000c0  2**2
			  CONTENTS, ALLOC, LOAD, DATA
	  2 .bss          00000000  080490c4  080490c4  000000c4  2**2
			  ALLOC
	SYMBOL TABLE:
	08048074 l    d  .text	00000000 
	080490c0 l    d  .data	00000000 
	080490c4 l    d  .bss	00000000 
	00000000 l    d  *ABS*	00000000 
	00000000 l    d  *ABS*	00000000 
	00000000 l    d  *ABS*	00000000 
	080490c0 l       .data	00000000 new_line_char
	08048076 l       .text	00000000 again
	08048087 l       .text	00000000 end_again
	0804808e l       .text	00000000 putstring
	08048096 l       .text	00000000 count_chars
	080480a0 l       .text	00000000 done_count_chars
	00000000       F *UND*	00000000 
	080480be g     O *ABS*	00000000 _etext
	08048074 g       .text	00000000 _start
	080490c1 g     O *ABS*	00000000 __bss_start
	080490c1 g     O *ABS*	00000000 _edata
	080490c4 g     O *ABS*	00000000 _end

	----------------------------------------------------------------

    Notice the  Information provided from the program  header (ELF files
    have  header  information  at  the  beginning  of  the  file  giving
    information to the kernel on how to load the file into memory etc.).

    ELF files also contain  information about the sections (contained in
    section tables).  Notice that  the .text section contains 0x4a bytes
    of information, is located 0x74  bytes into the file, and is aligned
    at a  4 byte  boundary (4  == 2 **  2), has  memory allocated  to it
    (ALLOC), is readoly, and contains  code (the segment selector cs for
    this  process  points to  this  section  (handled  by the  operating
    system)).

    Information  about   the  symbols   is  also  provided.    All  this
    information  is used  by debuggers  and other  programming  tools to
    examine binary files.

    Objdump can also be used to dissasemble binary executables.  Typeing
    the following command will  dissassemble the file to standard output
    (this does  nothing to the actual  file, as objdump  only reads from
    the file);

        objdump -d stack-param | less

    Here is the output of the above command;

        ----------------------------------------------------------------
	stack-param:     file format elf32-i386

	Disassembly of section .text:
	
	08048074 :
	 8048074:	89 e5                	movl   %esp,%ebp
	
	08048076 :
	 8048076:	83 c4 04             	addl   $0x4,%esp
	 8048079:	8b 04 24             	movl   (%esp,1),%eax
	 804807c:	85 c0                	testl  %eax,%eax
	 804807e:	74 07                	je     8048087 
	 8048080:	e8 09 00 00 00       	call   804808e 
	 8048085:	eb ef                	jmp    8048076 
	
	08048087 :
	 8048087:	31 c0                	xorl   %eax,%eax
	 8048089:	40                   	incl   %eax
	 804808a:	31 db                	xorl   %ebx,%ebx
	 804808c:	cd 80                	int    $0x80
	
	0804808e :
	 804808e:	55                   	pushl  %ebp
	 804808f:	89 e5                	movl   %esp,%ebp
	 8048091:	8b 4d 08             	movl   0x8(%ebp),%ecx
	 8048094:	31 d2                	xorl   %edx,%edx
	
	08048096 :
	 8048096:	8a 04 11             	movb   (%ecx,%edx,1),%al
	 8048099:	84 c0                	testb  %al,%al
	 804809b:	74 03                	je     80480a0 
	 804809d:	42                   	incl   %edx
	 804809e:	eb f6                	jmp    8048096 
	
	080480a0 :
	 80480a0:	b8 04 00 00 00       	movl   $0x4,%eax
	 80480a5:	31 db                	xorl   %ebx,%ebx
	 80480a7:	43                   	incl   %ebx
	 80480a8:	cd 80                	int    $0x80
	 80480aa:	b8 04 00 00 00       	movl   $0x4,%eax
	 80480af:	8d 0d c0 90 04 08    	leal   0x80490c0,%ecx
	 80480b5:	31 d2                	xorl   %edx,%edx
	 80480b7:	42                   	incl   %edx
	 80480b8:	cd 80                	int    $0x80
	 80480ba:	89 ec                	movl   %ebp,%esp
	 80480bc:	5d                   	popl   %ebp
	 80480bd:	c3                   	ret    
        ----------------------------------------------------------------

    The `-d' tells objdump to  disassemble sections that are expected to
    contain  code (usually the  .text section).   Using the  `-D' option
    will disassemble all  sections.  Objdump was able to  give the names
    of labels  in the code because  of the information  contained in the
    symbols table.

    The first column  displays the virtual memory address  for each line
    of code.  The second  column displays the machine code corresponding
    to its  respective assembler line of  code, and finally  the code in
    assembler is contained in the 3rd column.

    For more information look in the info documentation system.

** Getting the amount of memory used with size
------------------------------------------------------------------------

    If you do an `ls -l stack-param' you get the following

        -rwxrwxr-x    1 robin    robin         932 Sep 15 18:21 stack-param

    This tells you  that the file is 932 bytes  long.  However this file
    also contains header tables, section tables, symbol tables etc.  The
    amount of memory that this program will use durring run time will be
    less than this.  To find out actual memory use, type the following;

        size stack-param

    The above will result in the following output;

   	text	   data	    bss	    dec	    hex	filename
   	  74	      1	      0	     75	     4b	stack-param

    This tells you that .text  occupies 74 bytes, and .data occupies one
    byte, for a total of 75 bytes memory use.

** Getting rid of symbol information with strip
------------------------------------------------------------------------

    The strip command can be used  to get rid of the symbol information.
    With no options, this command  only strips symbols that are not used
    for debugging.  With the `--stip-all' option provided, it will strip
    all  symbol  information, including  those  used  for debugging.   I
    recommend not doing this, as  this makes the files harder to analyse
    with the standard  programming tools.  This command is  used only if
    file size is of paramount importance.

* debugging and gdb
------------------------------------------------------------------------

    Perhaps  the  most difficult  aspect  of  programming is  debugging.
    Quite  often  the  error   that  caused  the  program  to  terminate
    abnormally  is not  at the  line where  the program  terminated (the
    example later on will show this).

    Program that exits with SIG_SEGV
------------------------------------------------------------------------
	## stack-param-error.s #########################################

	## Robin Miyagi ################################################
	## http://www.geocities.com/SiliconValley/Ridge/2544/ ##########

	## This file  shows how one  can access command  line parameters
	## via the stack at process  start up.  This behavior is defined
	## in the ELF specification.

	## Compile Instructions:
	## -------------------------------------------------------------
	## as --gstabs -o stack-param-error.o stack-param-error.s
	## ld -O0 -o stack-param-error stack-param-error.o
########################################################################
	.section .data

new_line_char:
	.byte 0x0a
########################################################################
	.section .text

	.globl _start

	.align 4
_start:
	movl %esp, %ebp		# store %esp in %ebp
again:
	addl $4, %esp		# %esp ---> next parameter on stack
	leal (%esp), %eax	# move next parameter into %eax
	testl %eax, %eax	# %eax (parameter) == NULL pointer?
	jz end_again		# get out of loop if yes
	call putstring		# output parameter to stdout.
	jmp again		# repeat loop
end_again:
	xorl %eax, %eax		# %eax = 0
	incl %eax		# %eax = 1, system call _exit ()
	xorl %ebx, %ebx		# %ebx = 0, normal program exit.
	int $0x80		# execute _exit () system call

	## prints string to stdout
putstring:	.type @function
	pushl %ebp
	movl %esp, %ebp
	movl 8(%ebp), %ecx
	xorl %edx, %edx
count_chars:
	movb (%ecx,%edx,$1), %al
	testb %al, %al
	jz done_count_chars
	incl %edx
	jmp count_chars
done_count_chars:
	movl $4, %eax
	xorl %ebx, %ebx
	incl %ebx
	int $0x80
	movl $4, %eax
	leal new_line_char, %ecx
	xorl %edx, %edx
	incl %edx
	int $0x80 
	movl %ebp, %esp
	popl %ebp
	ret
------------------------------------------------------------------------

    Notice  that the  above  program is  assembled  with the  `--gstabs'
    option of  `as'.  This make  as put debugging information  in output
    file,  such as  the  original source  file,  debugging symbols  etc.
    Using  `objdump  -x stack-param-error  |  less'  will  show you  the
    inclusion of debugging symbols.

    Now to find out where our error occurred type the following command;

        gdb stack-param-error

    this will get you to the gdb prompt `(gdb)';

        (gdb) run eat my shorts
	/home/robin/programming/asm-tut/stack-param-error
	eat
	my
	shorts
	Program recieved SIGSEGV, segmentation fault
	count_chars () at stack-param-error.s:47

	47 	movb (%ecx,%edx,$1), %al
	Current language: auto; currently asm
	(gdb) q
	[~]$ _

	(gdb will output more than this, I just wanted to highlight what
	is important).

    This  tells us that  the segmentation  fault occured  at line  47 of
    param-stack-error.s.  However the problem was caused in line 29.  If
    you look  at line 29 of  stack-param.s, you will see  that this line
    reads `movl (%esp), %eax'.  This is due to the way intel i386 opcode
    lea handles  NULL pointers.  EAX was  never loaded with 0  on a null
    pointer (just some invalid pointer),  which caused line 47 to access
    an  area  of  memory  not  available  to  this  process  (hence  the
    segmentation fault).  The loop  in _start () never stopped normally,
    as the condition for breaking out  of the loop is eax being 0, which
    never happened.

    Debugging is an art that  comes with practice.  For more information
    about gdb, look  in the info pages (e.g. `info  gdb').  You can also
    type `help' at the (gdb) prompt.

    The only  reason gdb was  able to tell  you what line number  in the
    source  code the  error occured  is that  the debugging  symbols and
    source code was included in the output file (recall that we used the
    `--gstabs' option).

    --------------------------------------------------------------------
    Comments and suggestions 

========================================================================

You are free to make verbatim copies of this file, providing that this
notice is preserved.
This text was found here: http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html

GNU gprof

This manual describes the GNU profiler, gprof, and how you can use it to determine which parts of a program are taking most of the execution time. We assume that you know how to write, compile, and execute programs. GNU gprof was written by Jay Fenlason.

This manual was edited January 1993 by Jeffrey Osier.

Copyright (C) 1988, 1992 Free Software Foundation, Inc.

Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies.

Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one.

Permission is granted to copy and distribute translations of this manual into another language, under the same conditions as for modified versions.

Why Profile

Profiling allows you to learn where your program spent its time and which functions called which other functions while it was executing. This information can show you which pieces of your program are slower than you expected, and might be candidates for rewriting to make your program execute faster. It can also tell you which functions are being called more or less often than you expected. This may help you spot bugs that had otherwise been unnoticed.

Since the profiler uses information collected during the actual execution of your program, it can be used on programs that are too large or too complex to analyze by reading the source. However, how your program is run will affect the information that shows up in the profile data. If you don’t use some feature of your program while it is being profiled, no profile information will be generated for that feature.

Profiling has several steps:

The next three chapters explain these steps in greater detail.

The result of the analysis is a file containing two tables, the flat profile and the call graph (plus blurbs which briefly explain the contents of these tables).

The flat profile shows how much time your program spent in each function, and how many times that function was called. If you simply want to know which functions burn most of the cycles, it is stated concisely here. See section How to Understand the Flat Profile.

The call graph shows, for each function, which functions called it, which other functions it called, and how many times. There is also an estimate of how much time was spent in the subroutines of each function. This can suggest places where you might try to eliminate function calls that use a lot of time. See section How to Read the Call Graph.

Compiling a Program for Profiling

The first step in generating profile information for your program is to compile and link it with profiling enabled.

To compile a source file for profiling, specify the `-pg’ option when you run the compiler. (This is in addition to the options you normally use.)

To link the program for profiling, if you use a compiler such as cc to do the linking, simply specify `-pg’ in addition to your usual options. The same option, `-pg’, alters either compilation or linking to do what is necessary for profiling. Here are examples:

cc -g -c myprog.c utils.c -pg
cc -o myprog myprog.o utils.o -pg

The `-pg’ option also works with a command that both compiles and links:

cc -o myprog myprog.c utils.c -g -pg

If you run the linker ld directly instead of through a compiler such as cc, you must specify the profiling startup file `/lib/gcrt0.o' as the first input file instead of the usual startup file `/lib/crt0.o'. In addition, you would probably want to specify the profiling C library, `/usr/lib/libc_p.a', by writing `-lc_p’instead of the usual `-lc’. This is not absolutely necessary, but doing this gives you number-of-calls information for standard library functions such as read andopen. For example:

ld -o myprog /lib/gcrt0.o myprog.o utils.o -lc_p

If you compile only some of the modules of the program with `-pg’, you can still profile the program, but you won’t get complete information about the modules that were compiled without `-pg’. The only information you get for the functions in those modules is the total time spent in them; there is no record of how many times they were called, or from where. This will not affect the flat profile (except that the calls field for the functions will be blank), but will greatly reduce the usefulness of the call graph.

Executing the Program to Generate Profile Data

Once the program is compiled for profiling, you must run it in order to generate the information that gprof needs. Simply run the program as usual, using the normal arguments, file names, etc. The program should run normally, producing the same output as usual. It will, however, run somewhat slower than normal because of the time spent collecting and the writing the profile data.

The way you run the program–the arguments and input that you give it–may have a dramatic effect on what the profile information shows. The profile data will describe the parts of the program that were activated for the particular input you use. For example, if the first command you give to your program is to quit, the profile data will show the time used in initialization and in cleanup, but not much else.

You program will write the profile data into a file called `gmon.out' just before exiting. If there is already a file called `gmon.out', its contents are overwritten. There is currently no way to tell the program to write the profile data under a different name, but you can rename the file afterward if you are concerned that it may be overwritten.

In order to write the `gmon.out' file properly, your program must exit normally: by returning from main or by calling exit. Calling the low-level function _exitdoes not write the profile data, and neither does abnormal termination due to an unhandled signal.

The `gmon.out' file is written in the program’s current working directory at the time it exits. This means that if your program calls chdir, the `gmon.out' file will be left in the last directory your program chdir‘d to. If you don’t have permission to write in this directory, the file is not written. You may get a confusing error message if this happens. (We have not yet replaced the part of Unix responsible for this; when we do, we will make the error message comprehensible.)

gprof Command Summary

After you have a profile data file `gmon.out', you can run gprof to interpret the information in it. The gprof program prints a flat profile and a call graph on standard output. Typically you would redirect the output of gprof into a file with `>’.

You run gprof like this:

gprof options [executable-file [profile-data-files...]] [> outfile]

Here square-brackets indicate optional arguments.

If you omit the executable file name, the file `a.out' is used. If you give no profile data file name, the file `gmon.out' is used. If any file is not in the proper format, or if the profile data file does not appear to belong to the executable file, an error message is printed.

You can give more than one profile data file by entering all their names after the executable file name; then the statistics in all the data files are summed together.

The following options may be used to selectively include or exclude functions in the output:

-a
The `-a’ option causes gprof to suppress the printing of statically declared (private) functions. (These are functions whose names are not listed as global, and which are not visible outside the file/function/block where they were defined.) Time spent in these functions, calls to/from them, etc, will all be attributed to the function that was loaded directly before it in the executable file. This option affects both the flat profile and the call graph.
-e function_name
The `-e function‘ option tells gprof to not print information about the function function_name (and its children…) in the call graph. The function will still be listed as a child of any functions that call it, but its index number will be shown as `[not printed]‘. More than one `-e’ option may be given; only one function_name may be indicated with each `-e’ option.
-E function_name
The -E function option works like the -e option, but time spent in the function (and children who were not called from anywhere else), will not be used to compute the percentages-of-time for the call graph. More than one `-E’ option may be given; only one function_name may be indicated with each `-E’option.
-f function_name
The `-f function‘ option causes gprof to limit the call graph to the function function_name and its children (and their children…). More than one `-f’option may be given; only one function_name may be indicated with each `-f’ option.
-F function_name
The `-F function‘ option works like the -f option, but only time spent in the function and its children (and their children…) will be used to determine total-time and percentages-of-time for the call graph. More than one `-F’ option may be given; only one function_name may be indicated with each `-F’option. The `-F’ option overrides the `-E’ option.
-k from... to...
The `-k’ option allows you to delete from the profile any arcs from routine from to routine to.
-v
The `-v’ flag causes gprof to print the current version number, and then exit.
-z
If you give the `-z’ option, gprof will mention all functions in the flat profile, even those that were never called, and that had no time spent in them. This is useful in conjunction with the `-c’ option for discovering which routines were never called.

The order of these options does not matter.

Note that only one function can be specified with each -e-E-f or -F option. To specify more than one function, use multiple options. For example, this command:

gprof -e boring -f foo -f bar myprogram > gprof.output

lists in the call graph all functions that were reached from either foo or bar and were not reachable from boring.

There are a few other useful gprof options:

-b
If the `-b’ option is given, gprof doesn’t print the verbose blurbs that try to explain the meaning of all of the fields in the tables. This is useful if you intend to print out the output, or are tired of seeing the blurbs.
-c
The `-c’ option causes the static call-graph of the program to be discovered by a heuristic which examines the text space of the object file. Static-only parents or children are indicated with call counts of `0′.
-d num
The `-d num‘ option specifies debugging options.
-s
The `-s’ option causes gprof to summarize the information in the profile data files it read in, and write out a profile data file called `gmon.sum', which contains all the information from the profile data files that gprof read in. The file `gmon.sum' may be one of the specified input files; the effect of this is to merge the data in the other input files into `gmon.sum'. See section Statistical Inaccuracy of gprof Output.Eventually you can run gprof again without `-s’ to analyze the cumulative data in the file `gmon.sum'.
-T
The `-T’ option causes gprof to print its output in “traditional” BSD style.

How to Understand the Flat Profile

The flat profile shows the total amount of time your program spent executing each function. Unless the `-z’ option is given, functions with no apparent time spent in them, and no apparent calls to them, are not mentioned. Note that if a function was not compiled for profiling, and didn’t run long enough to show up on the program counter histogram, it will be indistinguishable from a function that was never called.

This is part of a flat profile for a small program:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 33.34      0.02     0.02     7208     0.00     0.00  open
 16.67      0.03     0.01      244     0.04     0.12  offtime
 16.67      0.04     0.01        8     1.25     1.25  memccpy
 16.67      0.05     0.01        7     1.43     1.43  write
 16.67      0.06     0.01                             mcount
  0.00      0.06     0.00      236     0.00     0.00  tzset
  0.00      0.06     0.00      192     0.00     0.00  tolower
  0.00      0.06     0.00       47     0.00     0.00  strlen
  0.00      0.06     0.00       45     0.00     0.00  strchr
  0.00      0.06     0.00        1     0.00    50.00  main
  0.00      0.06     0.00        1     0.00     0.00  memcpy
  0.00      0.06     0.00        1     0.00    10.11  print
  0.00      0.06     0.00        1     0.00     0.00  profil
  0.00      0.06     0.00        1     0.00    50.00  report
...

The functions are sorted by decreasing run-time spent in them. The functions `mcount’ and `profil’ are part of the profiling aparatus and appear in every flat profile; their time gives a measure of the amount of overhead due to profiling.

The sampling period estimates the margin of error in each of the time figures. A time figure that is not much larger than this is not reliable. In this example, the `self seconds’ field for `mcount’ might well be `0′ or `0.04′ in another run. See section Statistical Inaccuracy of gprof Output, for a complete discussion.

Here is what the fields in each line mean:

% time
This is the percentage of the total execution time your program spent in this function. These should all add up to 100%.
cumulative seconds
This is the cumulative total number of seconds the computer spent executing this functions, plus the time spent in all the functions above this one in this table.
self seconds
This is the number of seconds accounted for by this function alone. The flat profile listing is sorted first by this number.
calls
This is the total number of times the function was called. If the function was never called, or the number of times it was called cannot be determined (probably because the function was not compiled with profiling enabled), the calls field is blank.
self ms/call
This represents the average number of milliseconds spent in this function per call, if this function is profiled. Otherwise, this field is blank for this function.
total ms/call
This represents the average number of milliseconds spent in this function and its descendants per call, if this function is profiled. Otherwise, this field is blank for this function.
name
This is the name of the function. The flat profile is sorted by this field alphabetically after the self seconds field is sorted.

How to Read the Call Graph

The call graph shows how much time was spent in each function and its children. From this information, you can find functions that, while they themselves may not have used much time, called other functions that did use unusual amounts of time.

Here is a sample call from a small program. This call came from the same gprof run as the flat profile example in the previous chapter.

granularity: each sample hit covers 2 byte(s) for 20.00% of 0.05 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]    100.0    0.00    0.05                 start [1]
                0.00    0.05       1/1           main [2]
                0.00    0.00       1/2           on_exit [28]
                0.00    0.00       1/1           exit [59]
-----------------------------------------------
                0.00    0.05       1/1           start [1]
[2]    100.0    0.00    0.05       1         main [2]
                0.00    0.05       1/1           report [3]
-----------------------------------------------
                0.00    0.05       1/1           main [2]
[3]    100.0    0.00    0.05       1         report [3]
                0.00    0.03       8/8           timelocal [6]
                0.00    0.01       1/1           print [9]
                0.00    0.01       9/9           fgets [12]
                0.00    0.00      12/34          strncmp <cycle 1> [40]
                0.00    0.00       8/8           lookup [20]
                0.00    0.00       1/1           fopen [21]
                0.00    0.00       8/8           chewtime [24]
                0.00    0.00       8/16          skipspace [44]
-----------------------------------------------
[4]     59.8    0.01        0.02       8+472     <cycle 2 as a whole>	[4]
                0.01        0.02     244+260         offtime <cycle 2> [7]
                0.00        0.00     236+1           tzset <cycle 2> [26]
-----------------------------------------------

The lines full of dashes divide this table into entries, one for each function. Each entry has one or more lines.

In each entry, the primary line is the one that starts with an index number in square brackets. The end of this line says which function the entry is for. The preceding lines in the entry describe the callers of this function and the following lines describe its subroutines (also called children when we speak of the call graph).

The entries are sorted by time spent in the function and its subroutines.

The internal profiling function mcount (see section How to Understand the Flat Profile) is never mentioned in the call graph.

The Primary Line

The primary line in a call graph entry is the line that describes the function which the entry is about and gives the overall statistics for this function.

For reference, we repeat the primary line from the entry for function report in our main example, together with the heading line that shows the names of the fields:

index  % time    self  children called     name
...
[3]    100.0    0.00    0.05       1         report [3]

Here is what the fields in the primary line mean:

index
Entries are numbered with consecutive integers. Each function therefore has an index number, which appears at the beginning of its primary line.Each cross-reference to a function, as a caller or subroutine of another, gives its index number as well as its name. The index number guides you if you wish to look for the entry for that function.
% time
This is the percentage of the total time that was spent in this function, including time spent in subroutines called from this function.The time spent in this function is counted again for the callers of this function. Therefore, adding up these percentages is meaningless.
self
This is the total amount of time spent in this function. This should be identical to the number printed in the seconds field for this function in the flat profile.
children
This is the total amount of time spent in the subroutine calls made by this function. This should be equal to the sum of all the self and children entries of the children listed directly below this function.
called
This is the number of times the function was called.If the function called itself recursively, there are two numbers, separated by a `+’. The first number counts non-recursive calls, and the second counts recursive calls.In the example above, the function report was called once from main.
name
This is the name of the current function. The index number is repeated after it.If the function is part of a cycle of recursion, the cycle number is printed between the function’s name and the index number (see section How Mutually Recursive Functions Are Described). For example, if function gnurr is part of cycle number one, and has index number twelve, its primary line would be end like this:

gnurr <cycle 1> [12]

Lines for a Function’s Callers

A function’s entry has a line for each function it was called by. These lines’ fields correspond to the fields of the primary line, but their meanings are different because of the difference in context.

For reference, we repeat two lines from the entry for the function report, the primary line and one caller-line preceding it, together with the heading line that shows the names of the fields:

index  % time    self  children called     name
...
                0.00    0.05       1/1           main [2]
[3]    100.0    0.00    0.05       1         report [3]

Here are the meanings of the fields in the caller-line for report called from main:

self
An estimate of the amount of time spent in report itself when it was called from main.
children
An estimate of the amount of time spent in subroutines of report when report was called from main.The sum of the self and children fields is an estimate of the amount of time spent within calls to report from main.
called
Two numbers: the number of times report was called from main, followed by the total number of nonrecursive calls to report from all its callers.
name and index number
The name of the caller of report to which this line applies, followed by the caller’s index number.Not all functions have entries in the call graph; some options to gprof request the omission of certain functions. When a caller has no entry of its own, it still has caller-lines in the entries of the functions it calls.If the caller is part of a recursion cycle, the cycle number is printed between the name and the index number.

If the identity of the callers of a function cannot be determined, a dummy caller-line is printed which has `<spontaneous>’ as the “caller’s name” and all other fields blank. This can happen for signal handlers.

Lines for a Function’s Subroutines

A function’s entry has a line for each of its subroutines–in other words, a line for each other function that it called. These lines’ fields correspond to the fields of the primary line, but their meanings are different because of the difference in context.

For reference, we repeat two lines from the entry for the function main, the primary line and a line for a subroutine, together with the heading line that shows the names of the fields:

index  % time    self  children called     name
...
[2]    100.0    0.00    0.05       1         main [2]
                0.00    0.05       1/1           report [3]

Here are the meanings of the fields in the subroutine-line for main calling report:

self
An estimate of the amount of time spent directly within report when report was called from main.
children
An estimate of the amount of time spent in subroutines of report when report was called from main.The sum of the self and children fields is an estimate of the total time spent in calls to report from main.
called
Two numbers, the number of calls to report from main followed by the total number of nonrecursive calls to report.
name
The name of the subroutine of main to which this line applies, followed by the subroutine’s index number.If the caller is part of a recursion cycle, the cycle number is printed between the name and the index number.

How Mutually Recursive Functions Are Described

The graph may be complicated by the presence of cycles of recursion in the call graph. A cycle exists if a function calls another function that (directly or indirectly) calls (or appears to call) the original function. For example: if a calls b, and b calls a, then a and b form a cycle.

Whenever there are call-paths both ways between a pair of functions, they belong to the same cycle. If a and b call each other and b and c call each other, all three make one cycle. Note that even if b only calls a if it was not called from agprof cannot determine this, so a and b are still considered a cycle.

The cycles are numbered with consecutive integers. When a function belongs to a cycle, each time the function name appears in the call graph it is followed by`<cycle number>’.

The reason cycles matter is that they make the time values in the call graph paradoxical. The “time spent in children” of a should include the time spent in its subroutineb and in b‘s subroutines–but one of b‘s subroutines is a! How much of a‘s time should be included in the children of a, when a is indirectly recursive?

The way gprof resolves this paradox is by creating a single entry for the cycle as a whole. The primary line of this entry describes the total time spent directly in the functions of the cycle. The “subroutines” of the cycle are the individual functions of the cycle, and all other functions that were called directly by them. The “callers” of the cycle are the functions, outside the cycle, that called functions in the cycle.

Here is an example portion of a call graph which shows a cycle containing functions a and b. The cycle was entered by a call to a from main; both a and b called c.

index  % time    self  children called     name
----------------------------------------
                 1.77        0    1/1        main [2]
[3]     91.71    1.77        0    1+5    <cycle 1 as a whole> [3]
                 1.02        0    3          b <cycle 1> [4]
                 0.75        0    2          a <cycle 1> [5]
----------------------------------------
                                  3          a <cycle 1> [5]
[4]     52.85    1.02        0    0      b <cycle 1> [4]
                                  2          a <cycle 1> [5]
                    0        0    3/6        c [6]
----------------------------------------
                 1.77        0    1/1        main [2]
                                  2          b <cycle 1> [4]
[5]     38.86    0.75        0    1      a <cycle 1> [5]
                                  3          b <cycle 1> [4]
                    0        0    3/6        c [6]
----------------------------------------

(The entire call graph for this program contains in addition an entry for main, which calls a, and an entry for c, with callers a and b.)

index  % time    self  children called     name
                                             <spontaneous>
[1]    100.00       0     1.93    0      start [1]
                 0.16     1.77    1/1        main [2]
----------------------------------------
                 0.16     1.77    1/1        start [1]
[2]    100.00    0.16     1.77    1      main [2]
                 1.77        0    1/1        a <cycle 1> [5]
----------------------------------------
                 1.77        0    1/1        main [2]
[3]     91.71    1.77        0    1+5    <cycle 1 as a whole> [3]
                 1.02        0    3          b <cycle 1> [4]
                 0.75        0    2          a <cycle 1> [5]
                    0        0    6/6        c [6]
----------------------------------------
                                  3          a <cycle 1> [5]
[4]     52.85    1.02        0    0      b <cycle 1> [4]
                                  2          a <cycle 1> [5]
                    0        0    3/6        c [6]
----------------------------------------
                 1.77        0    1/1        main [2]
                                  2          b <cycle 1> [4]
[5]     38.86    0.75        0    1      a <cycle 1> [5]
                                  3          b <cycle 1> [4]
                    0        0    3/6        c [6]
----------------------------------------
                    0        0    3/6        b <cycle 1> [4]
                    0        0    3/6        a <cycle 1> [5]
[6]      0.00       0        0    6      c [6]
----------------------------------------

The self field of the cycle’s primary line is the total time spent in all the functions of the cycle. It equals the sum of the self fields for the individual functions in the cycle, found in the entry in the subroutine lines for these functions.

The children fields of the cycle’s primary line and subroutine lines count only subroutines outside the cycle. Even though a calls b, the time spent in those calls tob is not counted in a‘s children time. Thus, we do not encounter the problem of what to do when the time in those calls to b includes indirect recursive calls back to a.

The children field of a caller-line in the cycle’s entry estimates the amount of time spent in the whole cycle, and its other subroutines, on the times when that caller called a function in the cycle.

The calls field in the primary line for the cycle has two numbers: first, the number of times functions in the cycle were called by functions outside the cycle; second, the number of times they were called by functions in the cycle (including times when a function in the cycle calls itself). This is a generalization of the usual split into nonrecursive and recursive calls.

The calls field of a subroutine-line for a cycle member in the cycle’s entry says how many time that function was called from functions in the cycle. The total of all these is the second number in the primary line’s calls field.

In the individual entry for a function in a cycle, the other functions in the same cycle can appear as subroutines and as callers. These lines show how many times each function in the cycle called or was called from each other function in the cycle. The self and children fields in these lines are blank because of the difficulty of defining meanings for them when recursion is going on.

Implementation of Profiling

Profiling works by changing how every function in your program is compiled so that when it is called, it will stash away some information about where it was called from. From this, the profiler can figure out what function called it, and can count how many times it was called. This change is made by the compiler when your program is compiled with the `-pg’ option.

Profiling also involves watching your program as it runs, and keeping a histogram of where the program counter happens to be every now and then. Typically the program counter is looked at around 100 times per second of run time, but the exact frequency may vary from system to system.

A special startup routine allocates memory for the histogram and sets up a clock signal handler to make entries in it. Use of this special startup routine is one of the effects of using `gcc … -pg’ to link. The startup file also includes an `exit’ function which is responsible for writing the file `gmon.out'.

Number-of-calls information for library routines is collected by using a special version of the C library. The programs in it are the same as in the usual C library, but they were compiled with `-pg’. If you link your program with `gcc … -pg’, it automatically uses the profiling version of the library.

The output from gprof gives no indication of parts of your program that are limited by I/O or swapping bandwidth. This is because samples of the program counter are taken at fixed intervals of run time. Therefore, the time measurements in gprof output say nothing about time that your program was not running. For example, a part of the program that creates so much data that it cannot all fit in physical memory at once may run very slowly due to thrashing, but gprof will say it uses little time. On the other hand, sampling by run time has the advantage that the amount of load due to other users won’t directly affect the output you get.

Statistical Inaccuracy of gprof Output

The run-time figures that gprof gives you are based on a sampling process, so they are subject to statistical inaccuracy. If a function runs only a small amount of time, so that on the average the sampling process ought to catch that function in the act only once, there is a pretty good chance it will actually find that function zero times, or twice.

By contrast, the number-of-calls figures are derived by counting, not sampling. They are completely accurate and will not vary from run to run if your program is deterministic.

The sampling period that is printed at the beginning of the flat profile says how often samples are taken. The rule of thumb is that a run-time figure is accurate if it is considerably bigger than the sampling period.

The actual amount of error is usually more than one sampling period. In fact, if a value is n times the sampling period, the expected error in it is the square-root of nsampling periods. If the sampling period is 0.01 seconds and foo‘s run-time is 1 second, the expected error in foo‘s run-time is 0.1 seconds. It is likely to vary this much on the average from one profiling run to the next. (Sometimes it will vary more.)

This does not mean that a small run-time figure is devoid of information. If the program’s total run-time is large, a small run-time for one function does tell you that that function used an insignificant fraction of the whole program’s time. Usually this means it is not worth optimizing.

One way to get more accuracy is to give your program more (but similar) input data so it will take longer. Another way is to combine the data from several runs, using the `-s’ option of gprof. Here is how:

  1. Run your program once.
  2. Issue the command `mv gmon.out gmon.sum’.
  3. Run your program again, the same as before.
  4. Merge the new data in `gmon.out' into `gmon.sum' with this command:
    gprof -s executable-file gmon.out gmon.sum
  5. Repeat the last two steps as often as you wish.
  6. Analyze the cumulative data using this command:
    gprof executable-file gmon.sum > output-file

Estimating children Times Uses an Assumption

Some of the figures in the call graph are estimates–for example, the children time values and all the the time figures in caller and subroutine lines.

There is no direct information about these measurements in the profile data itself. Instead, gprof estimates them by making an assumption about your program that might or might not be true.

The assumption made is that the average time spent in each call to any function foo is not correlated with who called foo. If foo used 5 seconds in all, and 2/5 of the calls to foo came from a, then foo contributes 2 seconds to a‘s children time, by assumption.

This assumption is usually true enough, but for some programs it is far from true. Suppose that foo returns very quickly when its argument is zero; suppose that aalways passes zero as an argument, while other callers of foo pass other arguments. In this program, all the time spent in foo is in the calls from callers other than a. But gprof has no way of knowing this; it will blindly and incorrectly charge 2 seconds of time in foo to the children of a.

We hope some day to put more complete data into `gmon.out', so that this assumption is no longer needed, if we can figure out how. For the nonce, the estimated figures are usually more useful than misleading.

Incompatibilities with Unix gprof

GNU gprof and Berkeley Unix gprof use the same data file `gmon.out', and provide essentially the same information. But there are a few differences.

  • For a recursive function, Unix gprof lists the function as a parent and as a child, with a calls field that lists the number of recursive calls. GNU gprofomits these lines and puts the number of recursive calls in the primary line.
  • When a function is suppressed from the call graph with `-e’, GNU gprof still lists it as a subroutine of functions that call it.
  • The blurbs, field widths, and output formats are different. GNU gprof prints blurbs after the tables, so that you can see the tables without skipping the blurbs.

Reading some ELF [1] and linux memory managing papers [2] I noticed the use of address 0×08048000 for the start of linear address but no one told why this address was chosen. Until now I didn’t find an reasonably explanation, below are some links about what I was reading and commenting about this misteriousssss number:

http://flint.cs.yale.edu/cs422/doc/ELF_Format.pdf

http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory

http://stackoverflow.com/questions/7187981/whats-the-memory-before-0×08048000-used-for-in-32-bit-machine

http://forum.osdev.org/viewtopic.php?f=13&t=24474

Abaixo segue um pequeno tutorial sobre ferramentas de programação (“assembler”) em linux. Achei o texto simples, direto e helpful então vale a pena compartilhar.

JT.

========================================================================
LINUX ASSEMBLER TUTORIAL

by

Robin Miyagi

@

http://www.geocities.com/SiliconValley/Ridge/2544/

========================================================================

start@: Thu Feb 03 02:14:37 UTC 2000

update: Fri Jul 30 23:52:23 UTC 2000

update: Fri Sep 15 22:39:17 UTC 2000 :

– This tutorial now explains Linux assembler in terms of the GNU
assembler `as’.

– Information about Binutils programs such as Objdump, and ld.
Discussion on Debugging and `gdb’ is added.

update: Thu Jan 11 20:13:06 UTC 2001 :

========================================================================

* Introduction
————————————————————————

When programming in assembler for Linux (or any other Unix variant
for that matter), it is important to remember that Linux is a
protected mode operating system (on i386 machines, Linux operates
the CPU in protected mode). This means that ordinary user mode
processes are not allowed to do certain things, such as access DMA,
or access IO ports. Writing Linux kernel modules on the other hand
(which operate in kernel mode), are allowed to access hardware
directly (Read the Assembler-HOWTO on my assembler page for more
information on this issue). User mode processes may access hardware
using device files. Device files actually access kernel modules
which access hardware directly. This file will be restricted to
user mode operation. See my pages on kernel module programming.

Please email me comments and suggestions regarding this tutorial at
penguin@dccnet.com .

* System Calls
————————————————————————

In programming in assembler for DOS you probably made use of
software interrupts, especially the int 0×21 functions which were
the DOS system calls. In Linux, system calls are made via int 0×80.
The sytem call number is passed via register EAX, and the parameters
to the system call are passed via the remaining registers. This
discussion only applies if there are no more than five parameters
passed to the system call. If there are more than 5 parameters.
The parameters must be located in memory (e.g. on the stack), and
EBX must contain the address of the beginning of the parameters.

If you would like a list of the system call numbers, look at the
contents of /usr/include/asm/unistd.h. If you would like
information about a specific system call (e.g. write ()), type `man
2 write’ at the prompt. Section 2 of the linux man pages covers
sytem calls.

If you look at the contents of /usr/include/asm/unistd.h, you will
see the following line near the top of the file;

#define __NR_write 4

This indicates that register EAX must be set to 4 in order to call
the write () system call. Now, if you execute the following
command;

$ man 2 write

you get the following function description (under the SYNOPSIS
heading).

ssize_t write(int fd, const void *buf, size_t count);

This indicates that ebx is equal to the file descriptor of the file
you want to write to, ecx is a pointer of the string you want to
write, and edx contains the length of the string. If there were 2
more parameters to this system call, they would be placed in esi,
and edi respectively.

How do I know the file discriptor for stdout is 1. If you look at
your /dev directory, you will notice that /dev/stdout is a symbolic
link that points to /proc/self/fd/1. Therefore stdout is file
descriptor 1.

I leave looking up the _exit system call as an exercise.

In linux, system calls are processed by the kernel.

* GNU Assembler
————————————————————————

On most Linux systems, you will usually find the GNU C compiler
(gcc). This compiler uses an assembler called `as’ as a back-end.
This means that the C compiler translates the C code into assembler,
which in turn is assembled by `as’ to an object file (*.o).

`As’ uses the AT&T syntax. Experienced intel syntax assembler
programmers find AT&T `really weird’. It is really no more or no
less difficult than intel syntax. I switched over to `as’ because
there is less ambiguity, works better with the standard GNU/Linux
programs such as gdb (supports the gstabs format), objdump (objdump
dissassembles code in `as’ syntax). In short, it is a standard
component of a GNU Linux system with programming tools installed. I
will explain debugging and objdump later in this tutorial.

If you would like more information about `as’ look in the info
documentation under as (e.g. type `info as’ at the shell prompt).
Also look in the info documentation on the Binutils package (this
package contains such programming tools as objdump, ld, etc.).

** GNU assembler v.s. Intel Syntax
————————————————————————

Since most assembler documentation for the i386 platform is written
using intel syntax, some comparison between the 2 formats is in
order. Here is a summarized list of the differences;

– In `as’ the source comes before the the destination, opposite to
the intel syntax.

– The opcodes are suffixed with a letter indicating the size of
the opperands (e.g. `l’ for dword, `w’ for word, `b’ for byte).

– Immediate values must be prefixed with a `$’, and registers must
be prefixed with a `%’.

– Effective addresses use the General syntax
DISP(BASE,INDEX,SCALE). A concrete example would be;

movl mem_location(%ebx,%ecx,4), %eax

Which is equivelent to the following in intel syntax;

mov eax, [eax + ecx*4 + mem_location]

Now for an example illustrating the difference (intel version in
comments);

movl %eax, %ebx # mov %ebx, %eax
movw $0x3c4a, %ax

Now for our little program;
————————————————————————
## hello-world.s

## by Robin Miyagi
## http://www.geocities.com/SiliconValley/Ridge/2544/

## Compile Instructions:
## ————————————————————-
## as -o hello-world.o hello-world.s
## ld -o hello-world -O0 hello-world.o

## This file is a basic demonstration of the GNU assembler,
## `as’.

## This program displays a friendly string on the screen using
## the write () system call
########################################################################
.section .data
hello:
.ascii “Hello, world!\n”
hello_len:
.long . – hello
########################################################################
.section .text
.globl _start

_start:
## display string using write () system call
xorl %ebx, %ebx # %ebx = 0
movl $4, %eax # write () system call
xorl %ebx, %ebx # %ebx = 0
incl %ebx # %ebx = 1, fd = stdout
leal hello, %ecx # %ecx —> hello
movl hello_len, %edx # %edx = count
int $0×80 # execute write () system call

## terminate program via _exit () system call
xorl %eax, %eax # %eax = 0
incl %eax # %eax = 1 system call _exit ()
xorl %ebx, %ebx # %ebx = 0 normal program return code
int $0×80 # execute system call _exit ()

————————————————————————

In the above program, notice the use of `#’ to start comments. `As’
also supports the `/* C comment *’ syntax. If you use the C comment
syntax, it works exactly the same as for C (multiple lines, as well
as inline commenting). I always use the `#’ comment syntax, as this
works better with emacs’ asm-mode. The double `##’ is allowed but
not neccessary (this is only because of a quirk of emacs asm-mode).

Notice the names of the sections .text, and .data. these are used
in ELF files to tell the linker where the code and data segments
are. There is also the .bss section to store uninitialized data.
It is only these sections that occupy memory durring program
execution.

* Accessing Command Line Arguments and Environment Variables

When an ELF executable starts running, the command line arguments
and environment variables are available on the stack. In assembler
this means that you may access these via the pointer stored in ESP
when the program starts execution. See the documentation on my
assembler programming page relating to the ELF binary format.

So how is this data arranged on the stack? Quite simple really.
The number of command line arguments (including the name of the
program) are stored as an integer at [esp]. Then, at [esp+4] a
pointer to the first command line argument (which is the name of the
program) is stored. If there were any additional command line
parameters, their pointers would be stored in [esp+8], [esp+12],
etc. After all the command line argument pointers, comes a NULL
pointer. After the NULL pointer are all the pointers to the
environment variables, and then finally a NULL pointer to indicate
the end of the environment variables have been reached.

A summary of the initial ELF stack is shown below;

(%esp) argc, count of arguments (integer)
4(%esp) char *argv (pointer to first command line argument)
… pointers to the rest of the command line arguments
?(%esp) NULL pointer
… pointers to environment variables
??(%esp) NULL pointer

Now for our little program;
————————————————————————
## stack-param.s ###############################################

## Robin Miyagi ################################################
## http://www.geocities.com/SiliconValley/Ridge/2544/ ##########

## This file shows how one can access command line parameters
## via the stack at process start up. This behavior is defined
## in the ELF specification.

## Compile Instructions:
## ————————————————————-
## as -o stack-param.o stack-param.s
## ld -O0 -o stack-param stack-param.o
########################################################################
.section .data

new_line_char:
.byte 0x0a
########################################################################
.section .text

.globl _start

.align 4
_start:
movl %esp, %ebp # store %esp in %ebp
again:
addl $4, %esp # %esp —> next parameter on stack
movl (%esp), %eax # move next parameter into %eax
testl %eax, %eax # %eax (parameter) == NULL pointer?
jz end_again # get out of loop if yes
call putstring # output parameter to stdout.
jmp again # repeat loop
end_again:
xorl %eax, %eax # %eax = 0
incl %eax # %eax = 1, system call _exit ()
xorl %ebx, %ebx # %ebx = 0, normal program exit.
int $0×80 # execute _exit () system call

## prints string to stdout
putstring: .type @function
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %ecx
xorl %edx, %edx
count_chars:
movb (%ecx,%edx,$1), %al
testb %al, %al
jz done_count_chars
incl %edx
jmp count_chars
done_count_chars:
movl $4, %eax
xorl %ebx, %ebx
incl %ebx
int $0×80
movl $4, %eax
leal new_line_char, %ecx
xorl %edx, %edx
incl %edx
int $0×80
movl %ebp, %esp
popl %ebp
ret

————————————————————————

* The Binutils Package
————————————————————————

Binutils stands for binary utilities, and includes a lot of tools
useful to programmers, especially durring debugging.

I will now address some of these utilities.

** Objdump
————————————————————————

Objdump diplays information about 1 or more object files. For
example, to see information about param-stack, type the following
command at shell prompt (be sure working directory contains
param-stack);

objdump -x param-stack | less

Since the information is likely to span more than one screen, the
output of objdump is piped to the standard input of the paging
command `less’. the option `-x’ tells objdump to display the
numeric information in hexadecimal. Here is the output of the above
command;

—————————————————————-
stack-param: file format elf32-i386
stack-param
architecture: i386, flags 0×00000112:
EXEC_P, HAS_SYMS, D_PAGED
start address 0×08048074

Program Header:
LOAD off 0×00000000 vaddr 0×08048000 paddr 0×08048000 align 2**12
filesz 0x000000be memsz 0x000000be flags r-x
LOAD off 0x000000c0 vaddr 0x080490c0 paddr 0x080490c0 align 2**12
filesz 0×00000001 memsz 0×00000004 flags rw-

Sections:
Idx Name Size VMA LMA File off Algn
0 .text 0000004a 08048074 08048074 00000074 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
1 .data 00000001 080490c0 080490c0 000000c0 2**2
CONTENTS, ALLOC, LOAD, DATA
2 .bss 00000000 080490c4 080490c4 000000c4 2**2
ALLOC
SYMBOL TABLE:
08048074 l d .text 00000000
080490c0 l d .data 00000000
080490c4 l d .bss 00000000
00000000 l d *ABS* 00000000
00000000 l d *ABS* 00000000
00000000 l d *ABS* 00000000
080490c0 l .data 00000000 new_line_char
08048076 l .text 00000000 again
08048087 l .text 00000000 end_again
0804808e l .text 00000000 putstring
08048096 l .text 00000000 count_chars
080480a0 l .text 00000000 done_count_chars
00000000 F *UND* 00000000
080480be g O *ABS* 00000000 _etext
08048074 g .text 00000000 _start
080490c1 g O *ABS* 00000000 __bss_start
080490c1 g O *ABS* 00000000 _edata
080490c4 g O *ABS* 00000000 _end

—————————————————————-

Notice the Information provided from the program header (ELF files
have header information at the beginning of the file giving
information to the kernel on how to load the file into memory etc.).

ELF files also contain information about the sections (contained in
section tables). Notice that the .text section contains 0x4a bytes
of information, is located 0×74 bytes into the file, and is aligned
at a 4 byte boundary (4 == 2 ** 2), has memory allocated to it
(ALLOC), is readoly, and contains code (the segment selector cs for
this process points to this section (handled by the operating
system)).

Information about the symbols is also provided. All this
information is used by debuggers and other programming tools to
examine binary files.

Objdump can also be used to dissasemble binary executables. Typeing
the following command will dissassemble the file to standard output
(this does nothing to the actual file, as objdump only reads from
the file);

objdump -d stack-param | less

Here is the output of the above command;

—————————————————————-
stack-param: file format elf32-i386

Disassembly of section .text:

08048074 :
8048074: 89 e5 movl %esp,%ebp

08048076 :
8048076: 83 c4 04 addl $0×4,%esp
8048079: 8b 04 24 movl (%esp,1),%eax
804807c: 85 c0 testl %eax,%eax
804807e: 74 07 je 8048087
8048080: e8 09 00 00 00 call 804808e
8048085: eb ef jmp 8048076

08048087 :
8048087: 31 c0 xorl %eax,%eax
8048089: 40 incl %eax
804808a: 31 db xorl %ebx,%ebx
804808c: cd 80 int $0×80

0804808e :
804808e: 55 pushl %ebp
804808f: 89 e5 movl %esp,%ebp
8048091: 8b 4d 08 movl 0×8(%ebp),%ecx
8048094: 31 d2 xorl %edx,%edx

08048096 :
8048096: 8a 04 11 movb (%ecx,%edx,1),%al
8048099: 84 c0 testb %al,%al
804809b: 74 03 je 80480a0
804809d: 42 incl %edx
804809e: eb f6 jmp 8048096

080480a0 :
80480a0: b8 04 00 00 00 movl $0×4,%eax
80480a5: 31 db xorl %ebx,%ebx
80480a7: 43 incl %ebx
80480a8: cd 80 int $0×80
80480aa: b8 04 00 00 00 movl $0×4,%eax
80480af: 8d 0d c0 90 04 08 leal 0x80490c0,%ecx
80480b5: 31 d2 xorl %edx,%edx
80480b7: 42 incl %edx
80480b8: cd 80 int $0×80
80480ba: 89 ec movl %ebp,%esp
80480bc: 5d popl %ebp
80480bd: c3 ret
—————————————————————-

The `-d’ tells objdump to disassemble sections that are expected to
contain code (usually the .text section). Using the `-D’ option
will disassemble all sections. Objdump was able to give the names
of labels in the code because of the information contained in the
symbols table.

The first column displays the virtual memory address for each line
of code. The second column displays the machine code corresponding
to its respective assembler line of code, and finally the code in
assembler is contained in the 3rd column.

For more information look in the info documentation system.

** Getting the amount of memory used with size
————————————————————————

If you do an `ls -l stack-param’ you get the following

-rwxrwxr-x 1 robin robin 932 Sep 15 18:21 stack-param

This tells you that the file is 932 bytes long. However this file
also contains header tables, section tables, symbol tables etc. The
amount of memory that this program will use durring run time will be
less than this. To find out actual memory use, type the following;

size stack-param

The above will result in the following output;

text data bss dec hex filename
74 1 0 75 4b stack-param

This tells you that .text occupies 74 bytes, and .data occupies one
byte, for a total of 75 bytes memory use.

** Getting rid of symbol information with strip
————————————————————————

The strip command can be used to get rid of the symbol information.
With no options, this command only strips symbols that are not used
for debugging. With the `–stip-all’ option provided, it will strip
all symbol information, including those used for debugging. I
recommend not doing this, as this makes the files harder to analyse
with the standard programming tools. This command is used only if
file size is of paramount importance.

* debugging and gdb
————————————————————————

Perhaps the most difficult aspect of programming is debugging.
Quite often the error that caused the program to terminate
abnormally is not at the line where the program terminated (the
example later on will show this).

Program that exits with SIG_SEGV
————————————————————————
## stack-param-error.s #########################################

## Robin Miyagi ################################################
## http://www.geocities.com/SiliconValley/Ridge/2544/ ##########

## This file shows how one can access command line parameters
## via the stack at process start up. This behavior is defined
## in the ELF specification.

## Compile Instructions:
## ————————————————————-
## as –gstabs -o stack-param-error.o stack-param-error.s
## ld -O0 -o stack-param-error stack-param-error.o
########################################################################
.section .data

new_line_char:
.byte 0x0a
########################################################################
.section .text

.globl _start

.align 4
_start:
movl %esp, %ebp # store %esp in %ebp
again:
addl $4, %esp # %esp —> next parameter on stack
leal (%esp), %eax # move next parameter into %eax
testl %eax, %eax # %eax (parameter) == NULL pointer?
jz end_again # get out of loop if yes
call putstring # output parameter to stdout.
jmp again # repeat loop
end_again:
xorl %eax, %eax # %eax = 0
incl %eax # %eax = 1, system call _exit ()
xorl %ebx, %ebx # %ebx = 0, normal program exit.
int $0×80 # execute _exit () system call

## prints string to stdout
putstring: .type @function
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %ecx
xorl %edx, %edx
count_chars:
movb (%ecx,%edx,$1), %al
testb %al, %al
jz done_count_chars
incl %edx
jmp count_chars
done_count_chars:
movl $4, %eax
xorl %ebx, %ebx
incl %ebx
int $0×80
movl $4, %eax
leal new_line_char, %ecx
xorl %edx, %edx
incl %edx
int $0×80
movl %ebp, %esp
popl %ebp
ret
————————————————————————

Notice that the above program is assembled with the `–gstabs’
option of `as’. This make as put debugging information in output
file, such as the original source file, debugging symbols etc.
Using `objdump -x stack-param-error | less’ will show you the
inclusion of debugging symbols.

Now to find out where our error occurred type the following command;

gdb stack-param-error

this will get you to the gdb prompt `(gdb)’;

(gdb) run eat my shorts
/home/robin/programming/asm-tut/stack-param-error
eat
my
shorts
Program recieved SIGSEGV, segmentation fault
count_chars () at stack-param-error.s:47

47 movb (%ecx,%edx,$1), %al
Current language: auto; currently asm
(gdb) q
[~]$ _

(gdb will output more than this, I just wanted to highlight what
is important).

This tells us that the segmentation fault occured at line 47 of
param-stack-error.s. However the problem was caused in line 29. If
you look at line 29 of stack-param.s, you will see that this line
reads `movl (%esp), %eax’. This is due to the way intel i386 opcode
lea handles NULL pointers. EAX was never loaded with 0 on a null
pointer (just some invalid pointer), which caused line 47 to access
an area of memory not available to this process (hence the
segmentation fault). The loop in _start () never stopped normally,
as the condition for breaking out of the loop is eax being 0, which
never happened.

Debugging is an art that comes with practice. For more information
about gdb, look in the info pages (e.g. `info gdb’). You can also
type `help’ at the (gdb) prompt.

The only reason gdb was able to tell you what line number in the
source code the error occured is that the debugging symbols and
source code was included in the output file (recall that we used the
`–gstabs’ option).

——————————————————————–
Comments and suggestions

========================================================================

You are free to make verbatim copies of this file, providing that this
notice is preserved.

Git Tutorial

Olá pessoal!

Digamos que você está escrevendo sua monografia/dissertação/tese/artigo, etc.. claro que você não quer correr o risco de perde-la não é mesmo? O que você faz? Backup! Mas aposto que você já teve também a vontade de restaurar o seu trabalho para uma estrutura/conteúdo que você tinha há algum tempo atrás ou quem sabe recuperar aquele texto que você sobrescreveu… Que tal resolver os dois problemas de uma vez só?

Recentemente tive a idéia de utilizar um sistema para controle de versão para fazer esse trabalho dos meus “documentos”, em geral códigos e textos, não demorou muito para eu descobrir que alguns colegas do lab já usavam isso há algum tempo… O.o Eu já conhecia o Svn, más comecei a utilizar o Git porquê já existe uma infraestrutura aqui suportando o Git, a primeira vista me parece bem seguro (é distribuído) e prático, além do mais diversos projetos open source o utilizam. Vale a pena dar uma olhada. Segue abaixo um tuto que gostei.

Abs.
John.

 

Git is a distributed version control system (dvcs) written in C. A version control system allows the creation of a history for a collection of files and includes the functionality to revert the collection of files to another state. Another state might be a different collection of files or different content in the files.

You may, for example, change the collection of files to a state from 2 days ago or you may switch between states for experimental features and production issues.

The collection of files is usually called “source code”. In a distributed version control system everyone has a complete copy of the source code (including the complete history of the source code) and can perform version control operations against this local copy. The use of a dvcs does not require a central code repository.

If you make changes to the source code you mark them as relevant for the version control (add them to the index / staging) and then add them to the repository (commit).

Git maintains all versions. Therefore you can revert to any point in your source code history using Git.

Git performs commits to your local repository and you can synchronize your repository with other (remote) repositories. Git allows you to clone repositories, e.g. create an exact copy of a repository including the complete history of the source code. Owners of repositories can synchronize changes via push (transferring changes to a remote repository) or pull (getting changes from a remote repository).

Git supports branching, e.g. you can have different versions of your source code. If you want to develop a new feature, you may open a branch in your source code and make the changes in this branch without affecting the main line of your code.

Continue reading at Vogella.de

Programmers with little or no exposure to parallelism have an opportunity to learn about multicore programming at the UPCRC Illinois Summer School to be held July 25-29, 2011 at the University of Illinois at Urbana-Champaign.

The week-long, intensive workshop will provide a solid foundation in the fundamentals of multicore programming, offer hands-on experience with the use of multicore languages and libraries, and introduce emerging research topics. Upon completion, participants will be equipped to choose the best multicore programming model for current and future projects.

Prerequisites for the summer school include solid programming experience (C, C++, C# or Java languages) and a demonstrated interest in applying multicore programming to academic or professional pursuits.

For more information about the summer school, visit the website at http://www.upcrc.illinois.edu/summer/2011/index.html.

 

Cheri Helregel, Outreach Coordinator
UPCRC Illinois
University of Illinois at Urbana-Champaign
Ph: (217) 244-6097
url | www.upcrc.illinois.edu

 

Blog no WordPress.com. | Tema: Motion até volcanic.
Seguir

Obtenha todo post novo entregue na sua caixa de entrada.