Diff: AssemblyLanguage - Waikato Linux Users Group

Differences between current version and predecessor to the previous major change of AssemblyLanguage.

Other diffs: Previous Revision, Previous Author, or view the Annotated Edit History

Newer page:	version 12	Last edited on Saturday, October 7, 2006 6:37:17 pm	by AristotlePagaltzis
Older page:	version 2	Last edited on Wednesday, September 11, 2002 12:37:40 pm	by JohnMcPherson	Revert

@@ -1,73 +1,80 @@

-~~The Art of~~ AssemblyLanguage ~~Programming~~ is ~~a delicate topic.~~

-~~There are many processor Architectures, with different instruction sets.~~

-~~A large list of different architectures can be found somewhere on the [gcc(~~ 1~~)] page, but here are a couple~~ : ~~Intel x86, [MIPS], and the Motorola m68000 series~~ .

+AssemblyLanguage is 1:1 translation of MachineCode into English mnemonics .

-AssemblyLanguage is a ~~language constructed of instructions which correlate~~ to ~~MachineCode on~~ a ~~1 for 1 basis~~ . ~~Thus each~~ AssemblyLanguage ~~instruction~~ is ~~a MachineCode instruction~~ .

+The Art of AssemblyLanguage Programming is a delicate topic. By programming in AssemblyLanguage you can hand optimize code and achieve efficiency that is difficult if not impossible to duplicate in a higher level language. However, current computers are fast enough to write most code in less efficient higher level languages . AssemblyLanguage is still used for embedded systems (where space and CPU speed are limited), and in parts of an OperatingSystem that are run very frequently or must run fast ([InterruptHandler]s etc.). Some parts of the GNU C library are also written in assembly for the same reasons (for example, some of the maths functions) .

-~~The most common form of~~ AssemblyLanguage ~~programming~~ is ~~done on the~~ x86 ~~Architecture~~ . ~~A sample piece~~ of [~~AssemblyLanguage~~ ] ~~code for Linux can be found~~ in the [~~HelloWorld~~ ] ~~section~~ .

+AssemblyLanguage code is not portable across different [CPU] architectures, of which there are many: Intel [ x86], [MIPS], and the Motorola m68000 series, to name but a few . Early versions of [Unix ] were written in assembler, and when BellLabs got new machines, they re-wrote their operating system for the new MachineCode, until they finally re-wrote most of it in [C ] in 1973 .

-~~It is a common fact that~~ AssemblyLanguage ~~programmers get paid more per line of~~ code than ~~those who hack away in higher level languages~~ .

+AssemblyLanguage code is difficult to understand and maintain. It is usually easier to start from scratch than to debug faulty code .

-AssemblyLanguage ~~programming has the following advantages:~~

-* The hacker is able to HandOptimize code as it ~~is being written~~ .

-* It is ~~very difficult, if not impossible,~~ to ~~create~~ code ~~in a higher level language which will execute faster than hand~~ -~~optimized AssemblyLanguage.~~

+A Compiler such as [GCC] will hide its generation of AssemblyLanguage code from you as it generates its object files and the executables . It is however possible to tell it to generate the AssemblyLanguage code for you by passing it the <tt> -S</tt> CommandLine option

-~~Disadvantages of AssemblyLanguage programming:~~

-* The code is ~~very difficult to read~~ , ~~especially when having to maintain somebody elses code.~~

-* It is usually easier to start from scratch than to debug faulty code.

-* Due to the ~~above two reasons, debugging is rarely done. Especially on hand-optimized~~ code.

+Here is an example. First , the [C] code:

-~~A Compiler such as [gcc(1)] will hide it's generation of AssemblyLanguage code from you as it generates it's object files and the executables~~ . ~~It is however possible to tell it to generate the AssemblyLanguage code for you by passing it the -S CommandLineOption~~

+<verbatim>

+#include <stdio .h>

-~~Here is~~ an example. ~~First~~ , the [C] code:

- ~~int main~~ (~~void~~ ) {

- ~~int i;~~

+int main(void) {

+ int i;

+

+ i = 5;

+ i = i * 3;

+ printf("%d\n",i);

+ i = 0xff;

+ return i;

+}

+</verbatim>

+

+Now you can translate this to assembler. If I do this on an [x86] (ie [Intel] machine), I get:

+

+<pre>

+__$ gcc -S x.c && cat x.s__

+ .file "x.c"

+ .section .rodata

+.LC0:

+ .string "%d\n"

+ .text

+.globl main

+ .type main, @function

+main:

+ pushl %ebp

+ movl %esp, %ebp

+ subl $8, %esp

+ andl $-16, %esp

+ movl $, %eax

+ addl $15, %eax

+ shrl $4, %eax

+ sall $4, %eax

+ subl %eax, %esp

+ movl $5, -4(%ebp)

+ movl -4(%ebp), %edx

+ movl %edx, %eax

+ addl %eax, %eax

+ addl %edx, %eax

+ movl %eax, -4(%ebp)

+ subl $8, %esp

+ pushl -4(%ebp)

+ pushl $.LC0

+ call printf

+ addl $16, %esp

+ movl $255, -4(%ebp)

+ movl -4(%ebp), %eax

+ leave

+ ret

+ .size main, .-main

+ .section .note.GNU-stack,"",@progbits

+ .ident "GCC: (GNU) 3.4.6"

+</pre>

+

+<tt>movl</tt>, <tt>jmp</tt>, <tt>addl</tt>, etc are mnemonics for individual [CPU] instruction OpCodes. <tt>%esp</tt>, <tt>%ebp</tt> etc are mnemonics for registers. For example, <tt>%esp</tt> is the [Stack] Pointer - it points to the top of the current process's [Stack] . The first <tt>movl</tt> copies the value in <tt>%esp</tt> into <tt>%ebp</tt> , then the <tt>subl</tt> subtracts 24 off <tt>%esp</tt>, so that the [Stack] has grown by 24 bytes. The next <tt>movl</tt> copies the value 5 into [Stack], 4 bytes below its end. This address is where the variable <tt>i</tt> is being stored, so all accesses to <tt>i</tt> in the [C] code become references to this memory location in MachineCode. We can also witness an optimization here : instead of doing i*3, it does i+ (i+i ). That's the <tt>addl</tt> and <tt>leal</tt> instructions. Below that, it puts some pointers (to <tt>printf</tt>'s arguments) on the stack and calls <tt>printf</tt>, which pulls its arguments from the stack.

+

+As you can see, explaining what AssemblyLanguage code is doing line-by-line is tediously boring. This is how programmers used to write code, and it is a common fact that AssemblyLanguage programmers get paid more per line of code than those who hack away in higher level languages.

+

+We can also note that it is extremely bad for your health to rely on the [GCC] output of some [C] code when learning [x86] AssemblyLanguage. [GCC] generates extremely horrid code on occassion, especially when working with multiplication and division because [x86] multiplication and division instructions are restricted in the registers they can use.

- ~~i=5;~~

- ~~i=i*3;~~

- ~~printf~~ (~~"%d\n"~~ ,i );

- ~~i=0xff;~~

- ~~return i;~~

- }

+However, the output of [GCC] can be a tremendously useful resource when optimising [C] code. Especialy when mixing different sizes of integers (char , int, long), the resulting MachineCode is sometimes flooded with unexpected typecasting instructions. While concealed at the [C] level, these extra instructions are quite obvious in the AssemblyLanguage (lots of <tt>and</tt> instructions and often additional <tt>mov</tt> ).

-~~Now you can translate this to assembler. If I do this on an ix86 (ie~~ [~~Intel~~ ] ~~machine), I get:~~

- ~~$ gcc -S x.c ; cat x.s~~

- ~~.file "x.c"~~

- ~~.version "01.01"~~

- ~~gcc2_compiled.:~~

- ~~.section .rodata~~

- ~~.LC0:~~

- ~~.string "%d\n"~~

- ~~.text~~

- ~~.align 4~~

- ~~.globl main~~

- ~~.type main,@function~~

- ~~main:~~

- ~~pushl %ebp~~

- ~~movl %esp,%ebp~~

- ~~subl $24,%esp~~

- ~~movl $5,-4(%ebp)~~

- ~~movl -4(%ebp),%eax~~

- ~~movl %eax,%edx~~

- ~~addl %edx,%edx~~

- ~~leal (%eax,%edx),%ecx~~

- ~~movl %ecx,-4(%ebp)~~

- ~~addl $-8,%esp~~

- ~~movl -4(%ebp),%eax~~

- ~~pushl %eax~~

- ~~pushl $.LC0~~

- ~~call printf~~

- ~~addl $16,%esp~~

- ~~movl $255,-4(%ebp)~~

- ~~movl -4(%ebp),%edx~~

- ~~movl %edx,%eax~~

- ~~jmp .L2~~

- ~~.p2align 4,,7~~

- ~~.L2:~~

- ~~leave~~

- ~~ret~~

- ~~.Lfe1:~~

- ~~.size main,.Lfe1-main~~

- ~~.ident "GCC: (GNU) 2.95~~ .~~3 20010315 (release)"~~

+Another sample piece of AssemblyLanguage code for [Linux ] can be found in the HelloWorld page .

-~~The %esp, %ebp etc are registers. For example, %esp is the Stack Pointer~~ - it points to the base(?) of the current process's memory stack. The first "movl" copies the value in %esp into %ebp, then the "subl" subtracts 24 off %esp, so that the stack has grown by 24 bytes. The next "movl" copies the value 5 into stack, 4 bytes below end of the stack. This address is where the variable i is being stored, so all accesses to i in the C code become references to this memory location in assembler. As you can see, explaining what assembler is doing line -by -line is tediously boring. Instead of doing i*3, it does i+(i+i). That's the "addl" and "leal" instructions. Below that, it puts some pointers (to printf's arguments) on the stack and calls printf, which gets it's arguments off the stack. This is how programmers used to write code. Early versions of [Unix] were written in assembler - ~~when BellLabs got new machines, they re-wrote their operating system for the new machine code, until they re-wrote it in [C] in 1973.~~

+----

+CategoryProgrammingLanguages