A Brief Story About Compiling - learnclang/1-helloworld GitHub Wiki
Since we are trying to compile our code, it is good to know the things which are happening at the back-end.
- Actually we are trying to convert the high-level language(the source-code we written) to Low-level language(Machine Language)
This process involves four stages and utilizes following 'tools'
- Pre-processor
- Compiler
- Assembler
- Loader/Linker
Pre-processor
It replaces all the include(starts with #include) files and macros(starts with #define) with the actual files.
You can try this by using gcc.
c:\>gcc -E helloworld.c > pre.c
This will create a file pre.c which removes all include and macros and replace actual include file.
[Code with #]--->[preprocessor]--->[code without #]
Compiler
Compiler actually takes the pre-processor output and generates assembly source code. Compiler checks syntax and semantics of the code and shoot error if exists.
c:\>gcc -S helloworld.c
This will creates a "helloworld.s" assembler source code.
More About Compiler
The Compiler design consists of the following phases.
1. Lexical Analyzer
- The source code is converted into stream of tokens
- Removes white spaces and comments
- eg:
x = a + b * c /* source code */--> Lexical Analyzer -->id = id + id * id - This is achieved by using patterns which is known to the lexical analyzer
2. Syntax Analyzer(Parser)
- It takes the token produced by lexical analysis as input and generates a parse tree (or syntax tree)
- token arrangements are checked against the source code grammar, i.e. the parser checks if the expression made by the tokens is syntactically correct.

3. Semantic Analyzer
- Semantic analysis checks whether the parse tree constructed follows the rules of language.
- For example, assignment of values is between compatible data types, and adding string to an integer
- Also, the semantic analyzer keeps track of identifiers, their types and expressions; whether identifiers are declared before use or not etc
- The output of semantic analyzer will be a meaningfully verified parse tree.
4. Intermediate Code Generator
- After semantic analysis the compiler generates an intermediate code of the source code for the target machine
- It represents a program for some abstract machine. It is in between the high-level language and the machine language.
- There are various kinds of intermediate code and most popular one is three address code
eg:
x = a + b * c - three address code representation is,
t1 = b * c
t2 = t1 + a
x = t2
5. Code Optimizer
- Optimization can be assumed as something that removes unnecessary code lines, and arranges the sequence of statements in order to speed up the program execution without wasting resources (CPU, memory).
eg:
the optimized code will be,
t1 = b * c
x = a + t1
6. Target Code Generator
- Target code generator will generate codes that assembler can understand
- This will write the assembler code according to type of assembler used in different platform.
Summary
- The phases of a compiler are collected into front end and back end.
- front-end - all analysis phase(1-3) including intermediate code generation
- back-end - includes the code optimization phase and final code generation phase.
- You don't need to rebuild the whole phase to design a new compiler for a new platform, you can take the front-end and change the back-end according to the platform.
- That's why we have to recompile the code according to platform they run because each platform uses different assembler. And even this applicable while we porting codes from computer to mobile platform.
- Clang - Clang is a compiler front end for the C, C++, Objective-C and Objective-C++ programming languages. It uses LLVM as its back end and has been part of the LLVM release cycle since LLVM 2.6.
- LLVM(Low Level Virtual Machine) - LLVM Project is a collection of modular and reusable compiler and toolchain technologies. More at LLVM Wiki

Assembler
It takes the assembly source code and produces an assembly listing with offsets. The assembler output is stored in an object file.
c:\>gcc -c helloworld.c
This will create the "helloworld.o" (object file) with the machine codes.
Loader/Linker
It takes one or more object files or libraries as input and combines them to produce a single (usually executable) file. In doing so, it resolves references to external symbols, assigns final addresses to procedures/functions and variables, and revises code and data to reflect new addresses (a process called relocation).
c:\>gcc helloworld.o -lm
This will link and create executable, also can provide multiple files and link them together
c:\>gcc hello1.o hello2.o hello3.o -lm
Summary
