John Tortugo

Life, programming, math and much else ;-)

.PLT and .GOT – the key to code sharing and dynamic libraries

This text was found here: http://www.technovelty.org/linux/pltgot.html

The shared library is an integral part of a modern system, but often the mechanisms behind the implementation are less well understood. There are, of course, many guides to this sort of thing. Hopefully this adds another perspective that resonates with someone.

Let’s start at the beginning — – relocations are entries in binaries that are left to be filled in later — at link time by the toolchain linker or at runtime by the dynamic linker. A relocation in a binary is a descriptor which essentially says “determine the value of X, and put that value into the binary at offset Y” — each relocation has a specific type, defined in the ABI documentation, which describes exactly how “determine the value of” is actually determined.

Here’s the simplest example:

$ cat a.c
extern int foo;

int function(void) {
    return foo;
}

$ gcc -c a.c 
$ readelf --relocs ./a.o  

Relocation section '.rel.text' at offset 0x2dc contains 1 entries:
 Offset     Info    Type            Sym.Value  Sym. Name
00000004  00000801 R_386_32          00000000   foo

Read the rest of this entry »

Understanding Linux ELF RTLD internals

This text was found here: http://s.eresi-project.org/inc/articles/elf-rtld.txt

/*
Last update Sun Dec 22 06:55:39 2002 mayhem

- Version 0.1 May 2001
- Version 0.2 .::. 2002 (WIP) : 
  - Added stuff about rtld relocation .
  - Added stuff about rtld symbol resolution .
  - Various fixes and some links added .

This draft remained unreleased for one year, most of it is based on the 
glibc-2.2.3 implementation, information about the subject has been
disclosed on bugtraq and phrack in beg 2002 :

http://online.securityfocus.com/archive/1/274283/2002-05-29/2002-06-04/2
http://www.phrack.org/phrack/59/p59-0x08.txt

However, it still contains some kewl info, I'll try to keep it updated, 
hope this will help . I am also adding/clearing/correcting stuffs (and
giving credits) on demand, so dont hesitate to send comments, etc .
 
/mM [mayhem at devhell dot org]
*/

		Understanding Linux ELF RTLD internals
		~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Most of the paper has been developed in a security perspective, your
comments are always welcomed .

Actually there's many ELF documentation at this time, most of them
are virii coding or backdooring related . To be honest, I never found
any documentation on the dynamic linking sources, and thats why I wrote
this one . Sometimes it looks more like an internal ld.so reference or
a comments review on the ELF dynamic linking implementation in ld-linux.so .

It's not that unuseful since the dynamic linking is one of the worse
documented part of the Linux operating system . I also decided to write
a (tiny) chapter on ELF kernel handling code, because it is
really necessary to know some kernel level stuffs (like the stack 
initialisation) to understand the whole interpreting . 

You can find the last glibc sources on the GNU's FTP server :

ftp://ftp.gnu.org/pub/gnu/glibc/

If you dont know anything in ELF, you should read the reference before :

http://x86.ddj.com/ftp/manuals/tools/elf.pdf


Want to know more ? Go on !


     O] Prologue
		    A) Kernel handling code 
		    B) Introducting glibc macros 
     1] Dynamic linker implementation
		    A) Sources graphics
		    B) The link_map structure explained
		    C) Relocating the interpretor
		    D) Runtime GOT relocation
		    E) Symbol resolution
     2] FAQ, thanks and references


TODO	:
		    X) Stack information gathering 
		    X) SHT_DYNAMIC information gathering
		    X) PHT interpreting 
		    X) Loading shared libraries 
		    X) Shared libraries relocation 


 Read the rest of this entry »

An Emulator Writer’s HOWTO for Static Binary Translation

This is a very interesting article that I found at: http://www.gtoal.com/sbt/. This is a practical article showing to to craft a simple static binary translator and emulator.

There is a lot of Computer Science literature on binary translation, both of the sexy dynamic variety and the slightly duller (from the CS point of view) static variety. However most of what you’ll find when you research the subject is rather dry and somewhat academic. I don’t know of any accessible primers on static translation that someone from the emulation world can pick up and use in a practical project.
So… the aim of this HOWTO document is to give you a very practical method that you can adapt to your own area of expertise, which should pretty much guarantee that if you have already written or worked on a reasonable emulator, you will be able to write a translator which will take a specific program and generate a binary from it that runs on a different architecture, and does so faster than any emulator that you’re used to using.

And that’s why we’re doing this: you have a favorite old video game that you want to run on some system and the emulator for that system works but it just isn’t fast enough. Or perhaps you want to port an emulator which works OK on a 300MHz Pentium to a 16MHz ARM in your PDA. A straight port runs your favourite game at 2FPS and you don’t think the hardware is likely to get 25 times faster in the near future! Play your cards right and you may get that factor of 25 out of a static binary translator.

This document tries to explain things simply – perhaps too simply at times, and there are a lot of examples which differ from the previous example in small details. This makes the document rather long, but that’s deliberate; I don’t want to skip too many stages and risk having anyone lose track of the process. Apologies in advance to those of you who think I’m taking it too slowly or have made this document too long.

Read the rest of this entry »

A very kind introduction to GDB: customize it the way you want

You can find it here: http://haifux.org/lectures/211/index.html =D

Linkers and Loaders

This is an excelent (!!!) article describing in general terms how the process of linking (static and dynamic) and loading elf programs on linux works. This is a very valuable article.

The original is found here: http://www.linuxjournal.com/article/6463?page=0,0

Discussing how compilers, links and loaders work and the benefits of shared libraries.
Linking is the process of combining various pieces of code and data together to form a single executable that can be loaded in memory. Linking can be done at compile time, at load time (by loaders) and also at run time (by application programs). The process of linking dates back to late 1940s, when it was done manually. Now, we have linkers that support complex features, such as dynamically linked shared libraries. This article is a succinct discussion of all aspects of linking, ranging from relocation and symbol resolution to supporting position-independent shared libraries. To keep things simple and understandable, I target all my discussions to ELF (executable and linking format) executables on the x86 architecture (Linux) and use the GNU compiler (GCC) and linker (ld). However, the basic concepts of linking remain the same, regardless of the operating system, processor architecture or object file format being used.

How debugger works

This text was found here http://www.alexonlinux.com/how-debugger-works

Introduction

In this article, I’d like to tell you how real debugger works. What happens under the hood and why it happens. We’ll even write our own small debugger and see it in action.

I will talk about Linux, although same principles apply to other operating systems. Also, we’ll talk about x86 architecture. This is because it is the most common architecture today. On the other hand, even if you’re working with other architecture, you will find this article useful because, again, same principles work everywhere.

Kernel support

Actual debugging requires operating system kernel support and here’s why. Think about it. We’re living in a world where one process reading memory belonging to another process is a serious security vulnerability. Yet, when debugging a program, we would like to access a memory that is part of debugged process’s (debuggie) memory space, from debugger process. It is a bit of a problem, isn’t it? We could, of course, try somehow to use same memory space for both debugger and debuggie, but then what if debuggie itself creates processes. This really complicates things.

Debugger support has to be part of the operating system kernel. Kernel able to read and write memory that belongs to each and every process in the system. Furthermore, as long as process is not running, kernel can see value of its registers and debugger have to be able to know values of the debuggie registers. Otherwise it won’t be able to tell you where the debuggie has stopped (when we pressed CTRL-C in gdb for instance).

As we spoke about where debugger support starts we already mentioned several of the features that we need in order to have debugging support in operating system. We don’t want just any process to be able to debug other processes. Someone has to monitor debuggers and debuggies. Hence the debugger has to tell the kernel that it is going to debug certain process and kernel has to either permit or deny this request. Therefore, we need an ability to tell the kernel that certain process is a debugger and it is about to debug other process. Also we need an ability to query and set values from debuggie’s memory space. And we need an ability to query and set values of the debuggie’s registers, when it stops.

And operating system lets us to do all this. Each operating system does it in it’s manner of course. Linux provides single system call named ptrace() (defined in sys/ptrace.h), which allows to do all these operations and much more.

ptrace()

ptrace() accepts four arguments. First is one of the values from enum __ptrace_request that defined in sys/ptrace.h. This argument specifies what operation we would like to do, whether it is reading debuggie registers or altering values in its memory. Second argument specifies pid of the debuggie process. It’s not very obvious, but single process can debug several other processes. Thus we have to tell exactly what process we’re referring. Last two arguments are optional arguments for the call.

Starting to debug

One of the first things debuggers do to start debugging certain process is attaching to it or running it. There is a ptrace() operation for each one of these cases.

First called PTRACE_TRACEME, tells the kernel that calling process wants its parent to debug itself. I.e. me calling ptrace( PTRACE_TRACEME ) means I want my dad to debug me. This comes handy when you want debugger process to spawn the debuggie. In this case you do fork() creating a new process, then ptrace( PTRACE_TRACEME ) and then you call exec() or execve().

Second operation called PTRACE_ATTACH. It tells the kernel that calling process should become debugging parent of the process being called. Debugging parent means debugger and a parent process.

Debugger-debuggie synchronization

Alright. Now we told operating system that we are going to debug certain process. Operating system made it our child process. Good. This is a great time for us to have the debuggie stopped and us doing preparations before we actually start to debug. We may want to, for instance, analyze executable that we run and place a breakpoints before we actually start debugging. So, how do we stop the debuggie and let debugger do its thing?

Operating system does that for us using signals. Actually, operating system notifies us, the debugger, about all kinds of events that occur in debuggie and it does all that with signals. This includes the “debuggie is ready to shoot” signal. In particular, if we attach to existing process it receives SIGSTOP and we receive SIGCHLD once it actually stops. If we spawn a new process and it did ptrace( PTRACE_TRACEME ) it will receive SIGTRAP signal once it attempts to exec() or execve(). We will be notified with SIGCHLD about this, of course.

Read the rest of this entry »

Where 0x08048000 ELF address came from?

Reading some ELF [1] and linux memory managing papers [2] I noticed the use of address 0x08048000 for the start of linear address but no one told why this address was chosen. Until now I didn’t find an reasonably explanation, below are some links about what I was reading and commenting about this misteriousssss number:

http://flint.cs.yale.edu/cs422/doc/ELF_Format.pdf
http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory
http://stackoverflow.com/questions/7187981/whats-the-memory-before-0x08048000-used-for-in-32-bit-machine
http://forum.osdev.org/viewtopic.php?f=13&t=24474

Writing a compiler with SableCC

The original post of this text is http://www.brainycreatures.org/compiler/sablecc.asp

To create a compiler with SableCC we follow the following steps:

  1. We create a SableCC specification file containing the lexical definitions and the grammar for the language being designed.
  2. After creating the SableCC specification file, we generate the framework by launching SableCC on the specification file. .
  3. At this stage, we generate the working classes. This is the only code we have to write in JavaTM. It is in this step that we write the semantic analyzer, the code generator, and possibly the code optimizer. In the case of an interpreter, we write a single class. These working classes may be subclasses of one of the classes from the analysis subfolder generated by SableCC in the previous step.
  4. It is during this step that we create the driver class of the compiler. It is used to activate the lexer, parser and working classes.
  5. Finally, in this step, we compile the compiler with a JavaTM compiler.

To illustrate this, we are going to implement a subset of Pascal by following the steps outlined above. We are going to implement a subset instead of the full language, because we want to give the reader an idea of how SableCC works.

We are now going to start the implementation

SableCC specification file

Before we start implementing the SableCC specification file for the subset of Pascal, we are going to describe the syntax of SableCC using BNF. I am assuming that you have read the introduction to regular expressions and the context–free grammars.

The syntax of SableCC is as follows:

[] []
[] []
[] []

As we can see from the grammar, a SableCC specification file may by an empty file. Remember the meaning of anything between [ and ] ? This means that anything between [ and ] can only be the empty string or a single occurrence of whatever is between the brackets.

From the allowed sections in a SableCC specification file, we will not take into account the because we will not use it. But if the reader is interested to know how it works, he should refer to [1]. We will now describe the syntax of the other productions. The is used to name the destination root, that is, the directory where all the generated files will be placed. We declare the name of the package as follows:

Package uk.co.brainycreatures.smallpascal;

In this example, the root directory will be uk\co\brainycreatures\smallpascal. Meaning that all the genarated subfolders will be placed under this directory. A package name may be any sequence of identifiers, starting with a letter followed by zero or more letters or digits, separated by a dot. The works like constants in Pascal. And, as its name imply, it is used as a helper of something, in this case the section (more about this next). To see how a is helpful let’s examine the following regular expression:

id = [‘a’ .. ‘z’] ([‘a’ .. ‘z’] | [‘0’ .. ‘9’])*

In this example we could simplify this regular expression by defining digit = [‘0’ .. ‘9’] and letter = [‘a’ .. ‘z’], then our regular expression ID becomes

id = letter (letter | digit)*

Letter and digit are our helpers. Here’s how we declare helpers in SableCC:

Helpers
letter = [‘a’ .. ‘z’];
digit = [‘0’ .. ‘9’];

After the follows the , which is where we define our terminals or tokens. Lets use the example above to show how to declare tokens in SableCC:

Tokens
id = letter (letter | digit)*;

This is very similar to regular expressions as described in the Regular Expressions tutorial. A token defined by a string of characters is declared between quotes, e.g. program = ‘program’, and every declaration ends with a semicolon. Following the is the , in other words, the section where we declare the tokens to be ignored by the parser. For example, comments, blanks, carriage return, etc.

Here’s an example:

Helpers
any_charater = [0x0 .. 0xfffff];
nl = ‘\n’;
Tokens
comment = ‘⁄⁄’ any_character nl
blank = 10 | 10 13 | 9;
Ignored Tokens
comment,
blank;

In this case, comment and blank will be ignored by the parser.

Finally, in the section, we declare the productions grammar for the language. The productions are defined in BNF or EBNF (Extended Backus–Naur Form).

Here’s an example of how to declare productions:

Tokens
identifier = letter (letter | digit)*;
number = digit+;
plus = ‘+’;
Productions
expression = identifier plus identifier minus number ;

As we can see it is similar to the context-free grammars presented in the grammars tutorial. The only difference is that in SableCC we don‘t declare a nonterminal surrounded by , and we replace the → by =. In the productions section, it is sometimes required to precede a token by T. and a production by a P. This happens when we have a token and a production with the same name. For example if we have a production program and a token program, then in the production below we must precede program by T., because will not know if it is the token program or the production program.

Tokens
program = ‘program’;
semicolon = ‘;’ ;
Productions
program = T.program identifier semicolon ;

Now that we are familiar with the syntax of SableCC, let’s implement our subset.

The root of our package will be uk.co.brainycreatures.smallpascal. Hence in the package declaration we will have

Package uk.co.brainycreatures.smallpascal;After defining the package, we are going to declare our helpers, which are shown below:

Helpers
a = ‘a’ | ‘A’ ;
b = ‘b’ | ‘B’ ;
e = ‘e’ | ‘E’ ;
g = ‘g’ | ‘G’ ;
i = ‘i’ | ‘I’ ;
m = ‘m’ | ‘M’;
n = ‘n’ | ‘N’ ;
o = ‘o’ | ‘O’ ;
p = ‘p’ | ‘P’ ;
r = ‘r’ | ‘R’ ;
t = ‘t’ | ‘T’;
v = ‘v’ | ‘V’ ;
w = ‘w’ | ‘W’;
cr = 13 ; ⁄⁄ carriage return
lf = 10 ; ⁄⁄ line feed
tab = 9 ; ⁄⁄ tab char
ascii_char = [32 .. 127] ;
blank = ‘’ ;
digit = [‘0’ .. ‘9’] ;
letter = [[‘a’ .. ‘z’] + [‘A’ .. ‘Z’]] ;
l_brace = ‘{’ ;
r_brace = ‘}’ ;
l_paren_star = ‘(*’ ;
r_paren_star = ‘*)’ ;

Note the space between the letters in the definition of the keywords. In this case we need them, because each letter is a helper token. So, in defining our tokens, we have a sequence of helpers separated by a space.

Finally, we will describe the grammar for the language. Here it is:

Productions
program = program_heading global_declaration_part begin main_statement_part end dot;
program_heading = {non_empty} program identifier semicolon | {empty} ;
global_declaration_part = var_declaration;
var_declaration = var var_decl+;
var_decl = identifier_list colon type;
identifier_list = {single} identifier | {sequence} identifier_list comma identifier;
type = integer;
main_statement_part = statement_sequence;
statement_sequence = {single} statement | {sequence} statement_sequence semicolon statement ;
statement = {assign} identifier assignop expression | {writeln} writeln_stmt ;
writeln_stmt = {simple}writeln | {arguments} writeln l_paren expression_list r_paren ;
expression_list = {single} expression | {sequence} expression_list comma expression ;
expression = {term} term | {plus} expression plus term | {minus} expression minus term ;
term = {factor} factor | {mult} term mult factor | {div} term div factor;
factor = {identifier} identifier | {integer} integer;

Generating the working classes

In this section we will implement the semantic analyser and the code generators for the language. We start by implementing the semantic analyser below.

The semantic analyser

The semantic analysis phase completes the static analysis (parsing) phase of the source program, by ensuring that the grammatically correct statements passed to it by the parser ‘make sense’. For instance, the syntax of assignment statements in Pascal does not require that the identifiers have been declared, that they are variables, that they are of similar type, and that the operations can be performed on those types. All these restrictions are defined by the static semantics of Pascal, and the semantic analyser performs the corresponding checks. To perform this checking, the semantic analyser completely processes the declarations and creates equivalent information known as property information. This information is stored in a data structure known as symbol table. The symbol table is used so that given any identifier the associated information may be found.

For this small subset, the only checking that needs to be performed is that the identifiers are declared before use and that they are declared only once. There is no need for type checking because all the identifiers are of the same type. The semantic analysis is processed by a depth first traversal on the abstract syntax tree, storing information about identifiers in the symbol table after having visited the subtrees of the production containing the variables declaration. These information will then be fetched as we encounter identifiers in productions factor and variable, respectively.

Now that we have the required information to implement the semantic analyser, we will now describe how to implement it. Returning to the grammar above, we are now going to explain the effect of the names between { and } in the alternatives of each production. The name of each alternative in the productions of the grammar above are prefixed by A and concatenated by the name of the production to produce a type name for the alternative. For example, for the production below

Factor =           {identifier} identifier | {integer} integer_literal | … ;

a type named AIdentifierFactor and a type name AIntegerFactor will be produced.

Given the information above, we design our semantic analyser by defining the methods for the alternatives including identifiers. That is, alternatives ASingleIdentifierList, ASequenceIdentifierList, AAssignStatement and AIdentifierFactor. Also, we need to process alternative AVarDecl. Note that for the alternative for var_decl doe not have a name. This is allowed in SableCC if the production is defined by one alternative only, and the resulting type is the name of the production prefixed by A.

Here is the implementation of our semantic analyser:

package uk.co.brainycreatures.smallpascal;

import uk.co.brainycreatures.smallpascal.node.*;
import uk.co.brainycreatures.smallpascal.analysis.*;
import java.util.*; // for the Hashtable

public class SemanticAnalyzer extends DepthFirstAdapter {
    // stores the identifiers being defined
    Hashtable symbol_table = new Hashtable(); // create a new table

    // check if the identifier is already in the table and report an error
    // if it is
    public void outASingleIdentifierList(AidentifierList node) {
        // identifier to be stored in the symbol table
        TIdentifier ident = node.getIdentifier();
    
        // name of the identifier to be stored in the table
        String key = ident.getText().toUpperCase(); //is the identifier in the table?

        if (symbol_table.containsKey(key)) { // report an error
            System.out.println("Identifier already defined.");
            System.exit(0);
        }
        else {
            symbol_table.put(key, key);
        }
    }

    // same as above
    public void outASingleIdentifierList(AidentifierList node) {
        // identifier to be stored in the symbol table
        TIdentifier ident = node.getIdentifier();

        // name of the identifier to be stored in the table
        String key = ident.getText().toUpperCase(); // is the identifier in the table?

        if (symbol_table.containsKey(key)) { // report an error
            System.out.println("Error: [" ident.getLine() + "," + ident.getPos() + "] Identifier already defined.");
            System.exit(0);
        }
        else {
            symbol_table.put(key, key);
        }
    }

    // checks if the identifier in the assignment statement was previously
    // declared, and report an error if it wasn’t
    public void outAAssignStatement(AAssignStatement node) {
        Tidentifier ident = node.getIdentifier();
        String key = ident.getText().toUpperCase();

        // Is the identifier in the table?
        // if not report error
        if (!symbol_table.containsKey(key)) {
            System.out.println("Error: [" + ident.getLine() + "," ident.getPos() + "] Unknown identifier.");
            System.exit(0);
        }
    }

    // same as above
    public void outAIdentifierFactor(AIdentifierFactor node) {
        Tidentifier ident = node.getIdentifier();
        String key = ident.getText().toUpperCase();

        // Is the identifier in the table?
        // if not report error
        if (!symbol_table.containsKey(key)) {
            System.out.println("Error: [" + ident.getLine() + "," + ident.getPos() + "] Unknown identifier.");
            System.exit(0);
        }
    }
}

We define the class SemanticAnalyser as extending class DepthFirstAdapter.

This automatically provides a depth–first traversal of the AST. To check the declarations of identifiers we just need methods outASingleIdentifierList and outASequenceIdentifierList. To check if they were declared before use we just need methods outAAssignStatement and outAIdentifierFactor. For more about how SableCC generates this methods refer to chapters 5 and 6 of [1]. We are now done with the semantic analyser. For our code generation we do the same thing, but we generate code as well as we process the identifiers in the symbol table.

The Class Generator

The ClassGenerator is also a subclass of class DepthFirstAdapter. To implement the code generator we need to do a depth first traversal of the tree, just as we did with the semantic analyser. But for the code generator we will use alternatives from other productions as well. For example, we will use the program_heading to define the name of the class to be generated. That is, we will use the name of the identifier that follows the program keyword, if the program heading is present in the program. If the program heading is not present in the program, the name of the class will be a.out, just as the executables generated by a C compiler in a Unix based system. We will now start the description of the design of our code generator. For example, from the production

program_heading =
{non_empty} T.program identifier semicolon |
{empty} ;

we will define methods outANonEmptyProgramHeading and outAEmptyProgramHeading. To generate a class file using the Jas API we need to create an instance of class ClassEnv first. Then we set the class attributes: its name, its super class, the source from which it is being compiled and its access modifiers. This is all done in the methods outlined above. We also need to create a default constructor. Although we don’t realise, the Java compiler generates a default constructor for us. This is how it looks in Java assembly:

.method public ()V
aload_0
invokevirtual java⁄lang⁄Object⁄()V
return
.end method

To define this method, we can either do it in production program_heading, or even before production program traverses its subtrees. More precisely, on the definition of its inAProgram method. It is up to the programmer’s preference. In our case, we are going to define the constructor in methodinAProgram. To define a method we need to define an instance of class CodeAttr, then we add the instructions with its addInsn method. We also need to create a new instance of class CodeAttr for object main_code. Here is how we implement it:

public void inAProgram(AProgram node) {
    // create an new instance of CodeAttr
    init = new CodeAttr();
    main_code = new CodeAttr(); // create a new object for main program body

    try {
        // define the method
        init.addInsn(new Insn(opc_aload_0));
        init.addInsn(new Insn(opc_invokevirtual, new MethodCP("java/lang/Object", "<init>", "()V")));
        init.addInsn(new Insn(opc_return));
    }
    catch (Exception e) {
        System.out.println(e);
    }
}

After defining the method, we must add it to the class. This will be done in either method outAEmptyProgramHeading or outANonEmptyProgramHeading. Here is its implementation:

public void outANonEmptyProgramHeading(ANonEmptyProgramHeading node) {
    // name of the class to be generated
    class_name = node.getIdentifier().getText();

    try {
        // set class attributes
        main_class.setClass((new ClassCP(class_name));
        main_class.setSuper(new ClassCP("java/lang/Object"));

        // source from which it is compiled
        main_class.setSource(new SourceAttr(source_file_name));

        // make class public
        main_class.setClassAccess((short) ACC_PUBLIC);

        // add the constructor method to the class
        main_class.addMethod((short) ACC_PUBLIC, , "<init>", "()V", init, null);
    }
    catch (Exception e) {
        System.out.println(e);
    }
}

The code is self explanatory, so we won’t go into more details.

The code for method outAEmptyProgramHeading is similar to the code above, except for the name of the class. It sets class_name to “a.out”. Proceeding with the grammar, we are now going to define the code that processes identifiers; that is, the code that defines the identifiers. Pascal variables will be defined as Java fields. To define fields in the Jas API, we use method addField of class ClassEnv. The name of the generated variables will be in lowercase. We don’t need a symbol table in order to fetch the signature of a field. Rather, we use the name of the class to which it belongs. Here is the implementation:

public void outASingleIdentifierList(AsingleIdentifierList node) {
    // get the name of the name of the variable in lower case
    String var_name = node.getIdentifier().getText().toLowerCase();

    try {
        // add the field to the class
        main_class.addField(new Var((short) (ACC_STATIC | ACC_PUBLIC)), new AsciiCP(var_name), new AsciiCP("I"), new ConstAttr(new IntegerCP(0)))));
    }
    catch (Exception e) {
        System.out.println(e);
    }
}

The code above is equivalent to the following Java declaration:

public static int = 0;where is the same as the variable being declared.

The code for method outASequenceIdentifierList is the same as the code foroutASingleIdentifierList above. So, there is no need to describe it again.

Lets now describe how we implement the statements. Before going any further lets analyse how we translate Pascal statements into Java assembly. Because the Java Virtual Machine is stack based, we need to think in terms of stack operations only. So to translate an assignment statement we push all the factors on the right hand side to the stack and then we pop the result into the variable on the left hand side. For example, the assignment statement

a := b + cis translated to the following Java assembly code:

getstatic <class name>⁄b I
getstatic <class name>⁄c I
iadd
putstatic <class name>⁄a I</td>

where is the name of the class to which each of the fields a, b and c belong. In Java we use the ‘.’, but in Java assembly we use the ‘/’ to access either methods or fields. The I following the name of the variable means that this variable is of type int. The instruction getstatic is used to get the value of static fields of a class, and the instruction putstatic is used to store values into static fields of a class. The other arithmetic instructions that will be used in the code generation of the compiler for this subset are:

isub – used to subtract integer numbers. This instruction expects to find two numbers on
the top of the stack, and if they are there it pops them, subtracts them and pushes
the result onto the top of the stack.imul – the same as isub, but it multiplies insteadidiv – used to divide integers

In order to generate the appropriate code, we need to traverse the subtrees of the corresponding operator, and after visiting them we generate the opcode corresponding to this same operator. We need to push numbers and identifiers as we encounter them in production factor, and then generate the opcode for the arithmetic instruction in the alternatives of productions expression and term. Here are the implementations:

public void caseAIdentifierFactor(AidentifierFactor node) {
    String var_name = node.getIdentifier().getText().toLowerCase();
    
    try {
        // getstatic <class_name>⁄<var_name> I
        code.addInsn(new Insn(opc_getstatic, new FieldCP(class_name, var_name, "I")));
    }
    catch (Exception e) {
        System.out.println(e);
    }
}

Code for integer factors:

public void caseAIntegerFactor(AIntegerFactor node) {
    // get the string value
    String num_image = node.getIdentifier().getText();

    int value = Integer.parseInt(num_image);

    try {
        switch(value) {
            case –1 :
                code.addInsn(new Insn(opc_iconst_m1)); break;
            case 0:
                code.addInsn(new Insn(opc_iconst_0)); break;
            case 1 :
                code.addInsn(new Insn(opc_iconst_1)); break;
            case 2:
                code.addInsn(new Insn(opc_iconst_2)); break;
            case 3:
                code.addInsn(new Insn(opc_iconst_3)); break;
            case 4:
                code.addInsn(new Insn(opc_iconst_4)); break;
            case 5:
                code.addInsn(new Insn(opc_iconst_5)); break;
            default:
                if (value => –128 && value <= 127) {
                    code.addInsn(new Insn(opc_bipush, value));
                }
                else if (value >= –65536 && value <= 65535) {
                    code.addInsn(new Insn(opc_sipush, value));
                }
                else {
                    code.addInsn(new Insn(opc_ldc, value));
                }
                break;
        }
    }
}

Code for addition operator:

public void outAPlusExpression(APlusExpression node) {
    try {
        code.addInsn(new Insn(opc_iadd));
    }
    catch (Exception e) {
        System.out.println(e);
    }
}

Code for subtraction operator:

public void outAMinusExpression(AMinusExpression node) {
    try {
        code.addInsn(new Insn(opc_isub));
    }
    catch (Exception e) {
        System.out.println(e);
    }
}

Code for multiplication operator:

public void outAMultTerm(AMultTerm node) {
    try {
        code.addInsn(new Insn(opc_imul));
    }
    catch (Exception e) {
        System.out.println(e);
    }
}

Code for division operator:

public void outADivTerm(ADivTerm node) {
    try {
        code.addInsn(new Insn(opc_idiv));
    }
    catch (Exception e) {
        System.out.println(e);
    }
}

Code for assignment statement:

public void outAAssignStatement(AassignStatement node) {
    String var_name = node.getIdentifier().getText().toLowerCase();
    
    try {
        code.addInsn(new Insn(opc_putstatic, new FieldCP(class_name, var_name, "I")));
    }
    catch (Exception e) {
        System.out.println(e);
    }
}

Code for writeln statement:

public void inAWritelnStatement(AwritelnStatement node) {
    try {
        code.addInsn(new Insn(opc_getstatic, new FieldCP("java/lang/Object", "out", "Ljava/io/PrintStream;")));
    }
    catch (Exception e) {
        System.out.println(e);
    }
}

public void outAWritelnStatement(AwritelnStatement node) {
    try {
        code.addInsn(new Insn(opc_invokevirtual, new MethodCP("java/io/PrintStream", "println", "(I)V")));
    }
    catch (Exception e) {
        System.out.println(e);
    }
}

After generating all the code, we add an extra return instruction, we add the method to the class, and we dump the class. This is done in method outAProgram, that is at the exit of a program. The code is a follows:

public void outAProgram(AProgram node) {
    try {
        // add return instruction to the end of the method
        code.addInsn(new Insn(opc_return);

        // add main method to main_class
        main_class.addMethod((short)( ACC_PUBLIC | ACC_STATIC), "main", "([Ljava/lang/String;)V", code, null));// generate class file
        main_class.write(new DataOutputStream(new FileOutputStream(class_name + ".class"))); // output status

        System.out.println("Wrote " + class_name + ".class"]");
    }
    catch (Exception e) {
        System.out.println(e);
    }
}

The Main class

Now that we have implemented the working classes, it time to implement the Main class. This class simply activates the lexer, parser, semantic analyser and the class generator. This is always the same for every compiler or interpreter implemented using SableCC. The code for the main class is shown below:

package uk.co.brainycreatures.smallpascal;

import uk.co.brainycreatures.smallpascal.semantic.SemanticAnalyser;
import uk.co.brainycreatures.smallpascal.code.ClassGenerator;
import uk.co.brainycreatures.smallpascal.parser.*;
import uk.co.brainycreatures.smallpascal.lexer.*;
import uk.co.brainycreatures.smallpascal.node.*;
import java.io.*;

public class Main {
    public static void main(String[] args) {
        long start_time, stop_time; // times compilation

        if (args.length < 1) {
            System.out.println("Usage:");
            System.out.println(" java uk.co.brainycreatures.smallpascal.Main <filename>");
        }

        try {
            start_time = System.currentTimeMillis();// create lexer
            Lexer lexer = new Lexer (new PushbackReader(new BufferedReader(new FileReader(args[0])), 1024));    // parser program
            Parser parser = new Parser(lexer);
            Start ast = parser.parse(); // check program semantics
            ast.apply(new SemanticAnalyser());  // generate class file
            ast.apply(new ClassGenerator());
        }
        catch (Exception e) {
            System.out.println(e);
        }
    }
}

Now that we have implemented everything, it is time to compile all the source code. Assuming that the current working directory is in our CLASSPATH, we compile the compiler as follows:

C:\mycompilers\javac uk\co\brainycreatures\Main.java

If no errors were found in the source, we will have all the sources compiled. We then just have to debug the compiler. That is, test it with small pascal programs. The best way to test a compiler is by inputig erroneous data to it. That is, invalid programs.

If you have any queries, please do not hesitate to send an email at fidel dot viegas at brainycreatures dot co dot uk

Best Regards
Fidel.

References:

  1. Etienne Gagnon,”SableCC, An Object–Oriented Compiler Framework”, Master’s thesis, McGill University, Montreal, Quebec, March 1998.

Static, Shared Dynamic and Loadable Linux Libraries

Hello.

this link [1] points to a excellent tutorial on how to create, modify and hack with static and dynamic libraries on linux!

For those who don’t know what is a dynamic/static library the tuto explains with step-by-step examples how to start with the code and finish with a dynamic shared library.

It also helps to understand how the linux library system works! Its a great tuto, hope you enjoy it as mine.

c u soon.

John.

[1] http://www.yolinux.com/TUTORIALS/LibraryArchives-StaticAndDynamic.html

Follow

Get every new post delivered to your Inbox.