Roadmap

In this repository I’m gonna write down a roadmap with the topics, concepts, techniques, and technologies I intend to master over time. My intention is to create a strong mental model of how things correlate and fill all the gaps in my knowledge.

Summary

  1. Goals
  2. Computer Systems
    1. CPU
    2. Memory
    3. Disk
    4. Network
  3. Operating Systems
    1. Kernel
    2. Filesystems
    3. Processes and Threads
  4. Networks
  5. Software Design
  6. Distributed Systems
  7. Data Structure and Algorithms
  8. Books
  9. Techniques

Goals

I aim to master the following topics:

Computer Systems

CPU

The CPU (Central Processing Unit) fetches instructions from memory, decodes it to get the type and operand, and executes them. That cycle is repeated until the program finishes. Each CPU architecture has a different set of instructions therefore a specific architecture cannot run code compiled targeting a different one. Accessing memory to fetch instructions or data is more expensive than executing it. For that reason, CPU has registers inside it to store important values. CPU has at least two modes: kernel mode and user mode. When running in kernel mode, the CPU can execute any instruction in its instruction set.

CPU clock

Memory

Memory is made up of a number of locations and each of them is uniquely identifiable and has the ability to store information. This identifier of each memory location is known as its address. The total number of identifiable memory locations is known as its address space.

Disk

Network

Operating Systems

My focus is on Unix-like operating systems – more specifically Linux – as they use to be open source and are the fundamental building blocks in the server-side space.

Kernel

The kernel is the most important part of a preemptive and time-sharing operating system, it is responsible for providing an execution environment for user applications and a layer of service to interface the communication between user applications and hardware.

Kernel mode and User mode

CPU has at least two modes: kernel mode and user mode. The kernel mode is a privileged mode where operating systems can use any instruction of the CPU instruction set architecture (ISA). The user applications run in user mode so when they need to execute a privileged operation, they request the kernel to execute it on their behalf. The mechanism used to accomplish this is the system calls, which are the interface between the user applications and the hardware capabilities exposed by the kernel, such as reading a file from disk. Hardware instructions allow switching from one mode to the other and areas of virtual memory can be marked as part of user space or kernel space.

Preemption

Reentrancy

System Calls

System calls are services offered by the kernel to user applications. Even though it seems like simple function calls, actually they are assembly instructions. They are also called kernel entrypoints.

To trigger a system the call it is used a special CPU instruction:

The program, typically through a library function (e.g libc), triggers a special CPU instruction (like syscall), which causes the CPU to switch from user mode to kernel mode. The CPU then begins executing the kernel’s system call handler, which is provided by the operating system. This handler interprets the request made by the program, using parameters — such as the system call number and arguments — that were set up in registers or memory before the instruction was called.

Filesystems

The Unix-like filesystems are organized as a hierarchical tree. In Unix-like OSes, everything is a file.

File

A file is an unstructured stream of bytes. Another definition could be: a file is an information container structured as a sequence of bytes. It means that the operating system does not impose any structure to the file. The responsability of defining the file structure is of the application that manipulates it.

Types of Files

Universality of IO

Virtual File System

The VFS is a kernel software layer that handles all the system calls related to a standard Unix filesystem. It provides a common interface to several kind of filesystems.

Dentry

Inode

An i-node (index node) is a data structure that maintains some information about a file, such as:

File Descriptors

File descriptors are (generally small) non-negative integer numbers. A file descriptor is used to refer to all type of open files. When a process is created it inherits three files descriptors: standard input (STDIN), standard output (STDOUT), and standard error (STDERR).

File Objects

Unix file permissions

Files in Unix are protected by assigning to each of them a 9-bit binary protection code. The protection code consists of three 3-bit fields. One for the owner, one for other members of the owner’s group, and one for everyone else. Each field has a bit for read access, a bit for write access, and a bit for execute access. These bits are known as the rwx bits.

For example, suppose that I run the following command in the terminal and I get this result:

$ ls -l
6940089 -rwxrwxr-x 1 lucas lucas 15776 Mar  3 21:52  a.out

The a.out file has the following protection code: rwxrwxr-x (the dash represents the type of file), therefore:

The /proc and /sys filesystems

Processes and Threads

Processes

A process is an instance of a program in execution. We also can say that a process is the basic unit by which the kernel allocates resources such as CPU time and memory.

Memory Layout of a Process

The memory layout of a process is devided into parts called segments:

Scheduler

Process Exit Codes

Context switch

Kernel Threads

Lightweight Process

Inter-process Communication (IPC) Mechanisms

IPCs are the mechanism by which processes can communicate with each other. Linux provides the following IPCs:

Sockets

Sockets are a mechanism of IPC that allow data to be exchanged between processes, either on the same host or on different hosts connected by network.

Socket Domains:

DomainCommunication performedCommunication between applicationsAddress format
AF_UNIXwithin kernelsame hostsockaddr_uni
AF_INETipv4hosts connected by ipv4sockaddr_in
AF_INET6ipv6hosts connected by ipv6sockaddr_in6

Socket Types

Error Codes

Users and Groups

/etc/passwd file

Terminal Commands

Computer Networking

Network protocols are typically implemented in the kernel space for performance and security reasons.

TCP/IP

image

Application Layer

HTTP - Hypertext Transfer Protocol

DNS - Domain Name System

SSL - Secure Socket Layer and TLS - Transport Layer Security

SSH (Secure Shell)

Transport Layer

Internet Layer

IP (Internet Protocol)

The IP protocol implements two basic functions: addressing and fragmentation.

Private Address Space

The Internet Address Numbers Authority (IANA) has reserved the following three blocks of the IP address space for private internetes:

range startrange endcidr prefix
10.0.0.010.255.255.25510/8 prefix
172.16.0.0172.31.255.255172.16/12 prefix
192.168.0.0192.168.255.255192.168/16 prefix

Network Layer

Latency components

Software Design

Software Design Principles

Software Design Qualities

Distributed Systems

Data Structure and Algorithms

Databases, Disks, and Filesystems

To learn more about databases I am going to use the Postgres documentation. It’s a very rich documentation. You can go to https://www.postgresql.org/docs/ and select the version you wanna read.

Relational Databases and PostgreSQL

SQL (Structured Query Language) syntax

DML (Data Manipulation Language)

DDL (Data Definition Language)

ACID (Atomicity, Consistency, Isolation, and Durability)

Write-ahead log (WAL)

Concurrency Control

To handle data consistency, Postgres leverages the MVCC model (Multiversion Concurrency Control). In that model, each SQL statement sees a snapshot of the database.

Transactions

Transactions are a fundamental concept of all database systems. Given a set of operations executed within the context of a transaction, they must be executed atomically. This means that either all operations are completed sucessfully or they are rolled back. The intermediate states are not visible to concurrent transactions. Postgres treats every SQL statement as running within a transaction, so if you don’t use a BEGIN command, it implicitly wraps it with one. And COMMIT in case of success.

Isolation Levels:

Phenomena (anomalies):

Also read: https://www.postgresql.org/docs/current/transaction-iso.html

Read about deferred, immediate, or exclusive transactions. Read about lock and the types of locks

Cardinality

CTEs (common table expressions)

ORMs (Object Relational Mapper)

Active record vs. Data Mapper (Read: ORM Active Record vs. Data Mapper)

Replication

RAID

Sharding

Migrations

Bulk Operations

Techniques

Disks

Filesystems

Communication Protocols

Programming Languages

Common symbols across languages:

Cloud Providers

AWS

GCP

Authentication (AuthN) and Authorization (AuthZ): A Learning Path

Core Concepts

Authentication (AuthN) - the process of verifying “who” a user or system is, typically via credentials like passwords, tokens, or biometrics.

Authorization (AuthZ) - the process of determining “what” an authenticated user or system is allowed to do, based on permissions or policies.

Identity Foundations

Authentication Protocols

OAuth2 (RFC 6749) Authorization Framework

OAuth2 defines four roles:

Authorization Mechanisms

Supporting Technologies

Version Control

High level skills

Design Patterns

Software Quality Attributes

NodeJS

Hints

Terminology

Enconding

UML (Unified Modeling Language)

Books

Papers

Must Reads

Techniques and Principles

Security

Web Security

Random Topics That Need to Be Organized

(WIP) LLMs


In Introduction to Computer Systems, it says:

To be perfectly precise, it is not really the case that the computer differenciates the absolute absense of voltage (0) from the absolute presence of voltage (1). Actually the eletronic circuits differenciates the voltages close to zero from voltages far from 0.