Penguin
Note: You are viewing an old revision of this page. View the current version.

The Art of AssemblyLanguage Programming is a delicate topic. There are many processor Architectures, with different instruction sets. A large list of different architectures can be found somewhere on the gcc(1) page, but here are a couple: Intel x86, MIPS, and the Motorola m68000 series.

AssemblyLanguage is a language constructed of instructions which correlate to MachineCode on a 1 for 1 basis. Thus each AssemblyLanguage instruction is a MachineCode instruction.

The most common form of AssemblyLanguage programming is done on the x86 Architecture. A sample piece of AssemblyLanguage code for Linux can be found in the HelloWorld section.

It is a common fact that AssemblyLanguage programmers get paid more per line of code than those who hack away in higher level languages.

AssemblyLanguage programming has the following advantages:

  • The hacker is able to HandOptimize? code as it is being written.
  • It is very difficult, if not impossible, to create code in a higher level language which will execute faster than hand-optimized AssemblyLanguage.

Disadvantages of AssemblyLanguage programming:

  • The code is very difficult to read, especially when having to maintain somebody elses code.
  • It is usually easier to start from scratch than to debug faulty code.
  • Due to the above two reasons, debugging is rarely done. Especially on hand-optimized code.

A Compiler such as gcc(1) will hide it's generation of AssemblyLanguage code from you as it generates it's object files and the executables. It is however possible to tell it to generate the AssemblyLanguage code for you by passing it the -S CommandLineOption?

Here is an example. First, the C code

int main(void) {

int i;

i=5; i=i*3; printf("%d\n",i); i=0xff; return i;

}

Now you can translate this to assembler. If I do this on an ix86 (ie Intel machine), I get

$ gcc -S x.c ; cat x.s

.file "x.c" .version "01.01"

gcc2_compiled.: .section .rodata

.LC0

.string "%d\n"

.text

.align 4

.globl main

.type main,@function

main

pushl %ebp movl %esp,%ebp subl $24,%esp movl $5,-4(%ebp) movl -4(%ebp),%eax movl %eax,%edx addl %edx,%edx leal (%eax,%edx),%ecx movl %ecx,-4(%ebp) addl $-8,%esp movl -4(%ebp),%eax pushl %eax pushl $.LC0 call printf addl $16,%esp movl $255,-4(%ebp) movl -4(%ebp),%edx movl %edx,%eax jmp .L2 .p2align 4,,7

.L2

leave ret

.Lfe1

.size main,.Lfe1-main .ident "GCC: (GNU) 2.95.3 20010315 (release)"

The %esp, %ebp etc are registers. For example, %esp is the Stack Pointer - it points to the base(?) of the current process's memory stack. The first "movl" copies the value in %esp into %ebp, then the "subl" subtracts 24 off %esp, so that the stack has grown by 24 bytes. The next "movl" copies the value 5 into stack, 4 bytes below end of the stack. This address is where the variable i is being stored, so all accesses to i in the C code become references to this memory location in assembler. As you can see, explaining what assembler is doing line-by-line is tediously boring. Instead of doing i*3, it does i+(i+i). That's the "addl" and "leal" instructions. Below that, it puts some pointers (to printf's arguments) on the stack and calls printf, which gets it's arguments off the stack. This is how programmers used to write code. Early versions of Unix were written in assembler - when BellLabs got new machines, they re-wrote their operating system for the new machine code, until they re-wrote it in C in 1973.