my very first aarch64 assembly program

I have been thinking for some number of years of learning how to write ARM assembly language applications. The driver for this is an old book, “Threaded Interpretive Languages: Their Design and Implementation” by R. G. Loeliger. I purchased a copy back in 1981, the year the book was first published, at a local Book Stop in Atlanta. A threaded interpretive language, or TIL, is the general classification for Forth and other languages like it. I wasn’t yet married, was still something of a night owl, and so I had time on my hands so to speak to investigate how to implement a TIL. Over a period of three years I implemented the same personal TIL on the Z80 (the book was written to use a Z80), an MOS 6502, a Motorola 68000, and an Intel 8086. By the time I put the book aside I was doing a lot more “serious” work in C and Pascal on DEC minis and IBM PCs. The only time I went back to writing any assembly was later in the 1990s when I wrote a fair bit of code on the side for a 65c802 and an Intel 80c196. Both of these were custom designed embedded systems. Unfortunately I didn’t write a TIL for either one, as the requirements directed me to spend my time on other functionality.

So this weekend I decided to dig a bit into the ARM processor driving the Nvidia Xavier NX. I went around the net a bit looking for examples and tutorials, and finally managed to cobble together a “Hello World” program in aarch64/ARMv8 assembly. Here’s the tiny application I wrote.

#include "include.h"

	.global	_start

	mov	x8, __NR_write
	mov x2, hello_len
	adr x1, hello_txt
	svc	0

	mov	x8, __NR_exit
	svc	0


hello_txt:	.ascii "Hello, World!\n"
hello_len = . - hello_txt

I’m following, for the most part, this tutorial:, “A Guide to ARM64 / AArch64 Assembly on Linux with Shellcodes and Cryptography.” The code and instructions for how to assemble it are towards the middle. My only comment about the program is that line 9 in my listing is different from the original. I found that if I wanted to load the register with the address to the string to print, then I needed to explicitly code the mnemonic. For whatever reason the tools on the Xavier simply interpreted it as a mov instruction, and nothing would print.

Because of the number of steps involved in building the app, I wrote a bit of Python 3 to automate the process a bit. My Python code turned out to be longer than my assembly code.

#!/usr/bin/env python3

import argparse
import os
from pathlib import Path
import subprocess
import sys

if not sys.version_info.minor >= 6:
    print("You are using Python version {}.{}.{}".
    print("Python version 3.6.0 or higher is required.")

parser = argparse.ArgumentParser()
parser.add_argument("source", help="Assembly source file name is required.")
args = parser.parse_args()

if not os.path.isfile(args.source):
    print("File {} can't be found.".format(args.source))

filestem = Path(args.source).stem

preprocess = "cpp -E {} -o {}.as".format(args.source, filestem)
p =[preprocess], shell=True)
if p.returncode != 0:

assemble = "as {}.as -o {}.o".format(filestem, filestem)
p =[assemble], shell=True)
if p.returncode != 0:

link = "ld {}.o -o {}".format(filestem, filestem)
p =[link], shell=True)

There are no comments, and only a little white space to make it readable to me. I really didn’t feel like diving into either make or cmake. Hopefully I haven’t embarresed myself too much with either program.

What I’ve discovered so far is that there are a lot of 32-bit ARM assembly tutorials that won’t work at all with the Xavier. They just won’t assemble. But I am moving along a bit after this. I have Loeliger’s inner and outer interpreter coded, and two words in a dictionary. I’ll begin to post this effort shortly. As for why, well, why not? If nothing else, this hello program is 1,104 bytes long, which beats the size of Go’s basic hello world program by, what, four orders of magnitude?

There’s just something bracingly honest about writing in assembly that no other method of coding can approach.

very simple file system testing with the raspberry pi 4 8gb

Tests were write only, a 1GB file to test1.img. The command to run the test was:

sudo dd if=/dev/zero of=/media/pi/SSD/test1.img bs=1G count=1 oflag=dsync

The test to the boot SDXC card was modified accordingly, while the SSD was simply moved between a USB2 and a USB3 port. These are write tests only. Reading will be much faster. This gives me a basic feel for further experimentation.

Device Write speed in seconds MB/s
/dev/mmcblk0p2 (/media) 50 s 21.4
/dev/sda1 on USB2 (/media/pi/SSD) 36.9 29.1
/dev/sda1 on USB3 (/media/pi/SSD) 7 154

Another test I ran was to build Python 3.8.3 again. I reconfigured to make sure this 64-bit version of Raspbian was complete. It was. I did not haveĀ  to install anything. I now have Python 3.8.3 installed in an alternate location, and I have create a virtual Python environment with version 3.8.3. The build was done on the SSD with all four cores. It completed rather quickly.


  • Device /dev/mmcblk0p2 is a 64GB SanDisk Ultra Plus microSDXC UHS-1 card formatted with the EXT4 file system.
  • Device /dev/sda1 is a 500GB Crucial CT500MX500SSD1 solid state drive connected with a StarTech SATA to USB 3 adapter formatted with the EXT4 filesystem.