Roadmap

In this repository I’m gonna write down a roadmap with the topics, concepts, techniques, and technologies I intend to master over time. My intention is to create a strong mental model of how things correlate and fill all the gaps in my knowledge.

Summary

Goals
Computer Systems
1. CPU
2. Memory
3. Disk
4. Network
Operating Systems
Networks
Software Design
Distributed Systems
Data Structure and Algorithms
Books
Techniques

Goals

I aim to master the following topics:

Algorithms and Data Structures
Linux and Operating Systems in general
Containerization
Computer Networking
Distributed Systems
Software Design
SQL and techniques applied by database engines
Security
Javascript/Typescript
NodeJS runtime

Computer Systems

CPU

The CPU (Central Processing Unit) fetches instructions from memory, decodes it to get the type and operand, and executes them. That cycle is repeated until the program finishes. Each CPU architecture has a different set of instructions therefore a specific architecture cannot run code compiled targeting a different one. Accessing memory to fetch instructions or data is more expensive than executing it. For that reason, CPU has registers inside it to store important values. CPU has at least two modes: kernel mode and user mode. When running in kernel mode, the CPU can execute any instruction in its instruction set.

CPU clock

Memory

Memory is made up of a number of locations and each of them is uniquely identifiable and has the ability to store information. This identifier of each memory location is known as its address. The total number of identifiable memory locations is known as its address space.

Disk

Network

Operating Systems

My focus is on Unix-like operating systems – more specifically Linux – as they use to be open source and are the fundamental building blocks in the server-side space.

Kernel

The kernel is the most important part of a preemptive and time-sharing operating system, it is responsible for providing an execution environment for user applications and a layer of service to interface the communication between user applications and hardware.

Kernel mode and User mode

CPU has at least two modes: kernel mode and user mode. The kernel mode is a privileged mode where operating systems can use any instruction of the CPU instruction set architecture (ISA). The user applications run in user mode so when they need to execute a privileged operation, they request the kernel to execute it on their behalf. The mechanism used to accomplish this is the system calls, which are the interface between the user applications and the hardware capabilities exposed by the kernel, such as reading a file from disk. Hardware instructions allow switching from one mode to the other and areas of virtual memory can be marked as part of user space or kernel space.

Preemption

Reentrancy

System Calls

System calls are services offered by the kernel to user applications. Even though it seems like simple function calls, actually they are assembly instructions. They are also called kernel entrypoints.

To trigger a system the call it is used a special CPU instruction:

int on x86
syscall on modern x86-64
svc on arm

The program, typically through a library function (e.g libc), triggers a special CPU instruction (like syscall), which causes the CPU to switch from user mode to kernel mode. The CPU then begins executing the kernel’s system call handler, which is provided by the operating system. This handler interprets the request made by the program, using parameters — such as the system call number and arguments — that were set up in registers or memory before the instruction was called.

Filesystems

The Unix-like filesystems are organized as a hierarchical tree. In Unix-like OSes, everything is a file.

File

A file is an unstructured stream of bytes. Another definition could be: a file is an information container structured as a sequence of bytes. It means that the operating system does not impose any structure to the file. The responsability of defining the file structure is of the application that manipulates it.

Types of Files

Regular Files
Symbolic Links
Directories
Special Files
- Char devices
- Block devices
- Sockets
Pipes and Named Pipes (FIFOs)

Universality of IO

Universality of IO means that the same four system calls open(), read(), write(), and close() can be used for any type of file.

Virtual File System

The VFS is a kernel software layer that handles all the system calls related to a standard Unix filesystem. It provides a common interface to several kind of filesystems.

Dentry

Inode

An i-node (index node) is a data structure that maintains some information about a file, such as:

File type
Owner
Group
Permissions

File Descriptors

File descriptors are (generally small) non-negative integer numbers. A file descriptor is used to refer to all type of open files. When a process is created it inherits three files descriptors: standard input (STDIN), standard output (STDOUT), and standard error (STDERR).

File Objects

Unix file permissions

Files in Unix are protected by assigning to each of them a 9-bit binary protection code. The protection code consists of three 3-bit fields. One for the owner, one for other members of the owner’s group, and one for everyone else. Each field has a bit for read access, a bit for write access, and a bit for execute access. These bits are known as the rwx bits.

For example, suppose that I run the following command in the terminal and I get this result:

$ ls -l
6940089 -rwxrwxr-x 1 lucas lucas 15776 Mar  3 21:52  a.out

The a.out file has the following protection code: rwxrwxr-x (the dash represents the type of file), therefore:

The owner can read, write, and execute.
The group members can also read, write, and execute.
Everyone else can read, execute, but not write.

The `/proc` and `/sys` filesystems

Processes and Threads

Processes

A process is an instance of a program in execution. We also can say that a process is the basic unit by which the kernel allocates resources such as CPU time and memory.

Memory Layout of a Process

The memory layout of a process is devided into parts called segments:

Text segment - it contains the machine language code to be run by the process. It is read-only to avoid the modification as it can be mapped to the virtual memory of other processes in a way to save memory (e.g: if the same program is executed several times the kernel will spawn a process for each execution but they can share the same text segment as it wouldn’t be efficent to keep several copies of it in-memory for each of them).
Initialized data segment - it contains global and static variables that are explicitly initialized.
Uninitialized data segment - it contains global and static variables that are not explicitly initialized.
Stack - this is a dynamically growing and shrinking segment containing stack frames.
Heap - this is the area from which the process can dynamically allocate extra memory.

Scheduler

Process Exit Codes

Context switch

Kernel Threads

Lightweight Process

Inter-process Communication (IPC) Mechanisms

IPCs are the mechanism by which processes can communicate with each other. Linux provides the following IPCs:

signals
pipes and FIFOs
sockets
file locking
message queues
semaphores
shared memory

Sockets

Sockets are a mechanism of IPC that allow data to be exchanged between processes, either on the same host or on different hosts connected by network.

Socket Domains:

AF_UNIX allows communication between applications on the same host.
AF_INET allows communication between applications running on hosts connected using the protocol ipv4.
AF_INET6 allows communication between applications running on hosts connected by the protocol ipv6.

Domain	Communication performed	Communication between applications	Address format
AF_UNIX	within kernel	same host	sockaddr_uni
AF_INET	ipv4	hosts connected by ipv4	sockaddr_in
AF_INET6	ipv6	hosts connected by ipv6	sockaddr_in6

Socket Types

Stream
Datagram

Error Codes

EACCES
EADDRINUSE
EAGAIN
ECONNREFUSED
ECONNRESET
ENOENT
ETIMEDOUT

Users and Groups

`/etc/passwd` file

Terminal Commands

ls
cd
rm
chmod
chown
dpkg -i [package]
which
xargs - build and execute command lines from the standard input
source
ulimit
awk
strace
printenv
grep
umask - set file mode creation mask
lsop
perf

Computer Networking

Network protocols are typically implemented in the kernel space for performance and security reasons.

TCP/IP

Application Layer

HTTP - Hypertext Transfer Protocol

DNS - Domain Name System

DNS has 3 major components
- Domain Name Space and Resource Records
- Domain Servers
- Resolvers

SSL - Secure Socket Layer and TLS - Transport Layer Security

Certificates (e.g., ACME challenges and Let’s Encrypt).
OpenSSH
ACME (Automated Certificate Management Environment)

SSH (Secure Shell)

Transport Layer

TCP (Transmission Control Protocol)
UDP (User Datagram Protocol)
PORTS (well-known ports)

Internet Layer

IP (Internet Protocol)

The IP protocol implements two basic functions: addressing and fragmentation.

Internet Header Format
CIDR (Classless Inter-domain Routing)

Private Address Space

The Internet Address Numbers Authority (IANA) has reserved the following three blocks of the IP address space for private internetes:

range start	range end	cidr prefix
10.0.0.0	10.255.255.255	10/8 prefix
172.16.0.0	172.31.255.255	172.16/12 prefix
192.168.0.0	192.168.255.255	192.168/16 prefix

etc/hosts file

Network Layer

NIC (Network Interface Card)
mTLS (Mutual Transport Layer Security)

Latency components

Propagation delay
Transmission delay
Processing delay
Queuing delay

Software Design

Software Design Principles

Abstraction
Separation of concerns
Modularization
Encapsulation (information hiding)
Separation of interface and implementation
Coupling
Cohesion
Uniformity
Completeness
Verifiability

Software Design Qualities

Concurrency
Control and Handling Events
Data Persistence
Distribution of Components
Error and Exception Handling and Fault Tolerance
Interaction and Presentation
Security

Distributed Systems

Backoff
Backpressure
Throttling
Consistent hashing
CAP Theorem - Given three properties of computing systems, consistency, availability, and partition tolerance, a distributed computing system can provide any two of these features, but never all three.

Data Structure and Algorithms

https://www.ime.usp.br/~pf/algoritmos/index.html

Databases, Disks, and Filesystems

To learn more about databases I am going to use the Postgres documentation. It’s a very rich documentation. You can go to https://www.postgresql.org/docs/ and select the version you wanna read.

Relational Databases and PostgreSQL

SQL (Structured Query Language) syntax

DML (Data Manipulation Language)

DDL (Data Definition Language)

ACID (Atomicity, Consistency, Isolation, and Durability)

Write-ahead log (WAL)

Concurrency Control

To handle data consistency, Postgres leverages the MVCC model (Multiversion Concurrency Control). In that model, each SQL statement sees a snapshot of the database.

Transactions

Transactions are a fundamental concept of all database systems. Given a set of operations executed within the context of a transaction, they must be executed atomically. This means that either all operations are completed sucessfully or they are rolled back. The intermediate states are not visible to concurrent transactions. Postgres treats every SQL statement as running within a transaction, so if you don’t use a BEGIN command, it implicitly wraps it with one. And COMMIT in case of success.

Isolation Levels:

Serializable - is defined by the standard that says that any concurrent Serializable transactions are guaranteed to produce the same result as running them one at a time in some order.
Repeatable Read;
Read Committed;
Read Uncommitted.

Phenomena (anomalies):

Lost update - lost updates are forbidden by the SQL standard at all isolation levels.
Dirty Read - a transaction reads data written by a uncommited transaction. The standard allows Dirty Read at the Read Uncommited isolation level.
Nonrepeatable Read
Phantom Read
Serialization Anomaly

Also read: https://www.postgresql.org/docs/current/transaction-iso.html

Read about deferred, immediate, or exclusive transactions. Read about lock and the types of locks

Cardinality

CTEs (common table expressions)

ORMs (Object Relational Mapper)

Active record vs. Data Mapper (Read: ORM Active Record vs. Data Mapper)

Replication

RAID

Sharding

Migrations

Bulk Operations

Techniques

Read/Write Patterns

Disks

Filesystems

Communication Protocols

HTTP
WebSockets
gRPC

Programming Languages

Common symbols across languages:

; - semicolon
, - colon
{} - curly braces
[] - square brackets
() - brackets or parenthesis
@ - at
. - dot
/ - slash
- - dash
Paradigms
- OOP (Object-oriented Programming)
- Procedural
- Functional
Idiomatic Code
Ecosystem
Dependency Management
Actor Model
Concurrency and Parallelism
Distributed Programming
Regular Expressions
Language Server Protocol
Runtime Data Validation
Turing Completeness

Cloud Providers

AWS

ARN (Amazon Resource Name)
IAM (Identity and Access Management)
EC2 (Elastic Computer Cloud)
VPC (Virtual Private Cloud)
- Route Table (what is main route table?)
- Security Group
- Internet Gateway (IGW)
- Route Table
- CIDR
- NAT Gateway
- Subnet
CDK

GCP

Authentication (AuthN) and Authorization (AuthZ): A Learning Path

Core Concepts

Authentication (AuthN) - the process of verifying “who” a user or system is, typically via credentials like passwords, tokens, or biometrics.

Authorization (AuthZ) - the process of determining “what” an authenticated user or system is allowed to do, based on permissions or policies.

Identity Foundations

Identity Providers (IdPs)
Single Sign-On (SSO)

Authentication Protocols

LDAP (Lightweight Directory Access Protocol)
Kerberos
SAML (Security Assertion Markup Language)
OIDC (OpenID Connect)

OAuth2 (RFC 6749) Authorization Framework

OAuth2 defines four roles:

resource owner
resource server
authentication server
client

Authorization Mechanisms

ACL (Access Control List)
RBAC (Role-Based Access Control)
ABAC (Attribute-Based Access Control)
JWT (JSON Web Tokens)
OPA (Open Policy Agent)

Supporting Technologies

PKI (Public Key Infrastructure)
MFA (Multi-Factor Authentication)

Version Control

Git
Github
Semantic Versioning Semver

High level skills

Troubleshooting
Profilling
Debugging
Coding
Problem-solving

Design Patterns

Dependency Injection
Service Locator
Outbox pattern

Software Quality Attributes

Integrity
Security
Reusability
Efficiency
Correctness
Readability
Speed of Development
Maintainability
Usability
Reliability
Compatibility
Portability
Testability
Scalability
Flexibility
Functional Suitability
Interoperability
Performance Efficiency
- CPU Usage
- Memory Usage
- Requests per Minute and Bytes per Request
- Latency
- Uptime/Downtime
- Response Time
- Error Rates
- Garbage Collection

NodeJS

Node Version Manager (https://github.com/nvm-sh/nvm)
peerDependency (NPM)
NodeJS Dev Guide
Is Node really single threaded?

Hints

Ctrl + R - in the terminal for search history.
Ctrl + Alt + - - in the vs code to go back to previous location.
Ctrl + D - select a text and press the sequence to select the next equal occurrence. You can hold Ctrl and keep pressing and it will select the next one.
Ctrl + shift + L - works similar to Ctrl + D, but it select all occurrences at once instead of one by one as pressing.
Select some text, move the cursor to where you wanna put the content and click the the mouse wheel. The interesting part is that it does not mess with the Ctrl + C selection.

Terminology

Anemic Model
ACID
Atomicity
Availability
Backoff
Backpressure
Bitwise
Bottlenecks
Bounded Context
CAP theorem
Caching
Consistency
Certificates
Circuit Breaker
Clean Architecture
Cross Site Request Forgery
Dead Letter Queue
Debugging
Downtime
Durability
Deadlock
Edge Computing
Eventual Consistency
EOF (end of file)
Failover
Fault Tolerance
Flush operation
Gateway
Greenfield
Latency - the time it takes for a message travel from point of origin to the destination.
Logging
Microservices
Monitoring
Multitenant System
Observability
Outage
Postmortem
Preemptive
Profiling
Proxy
Reactance
Request for Comment
Resilience
Response time
Retry Pattern
Readiness
Scaling
Separation of Concerns
Service Discovery
Sharding
Sidecar
SOLID principles
Throttling
Threshold
Throughput
Traffic Routing
Tradeoff
Tracing
Troubleshooting
Upstream
Vendoring
Intermittent failures
Idempotency
Replica set
Race conditions
Offset
Overhead
Liveness

Enconding

UML (Unified Modeling Language)

Books

Introduction to Computing Systems
Code: The Hidden Language of Computer Hardware and Software
The Elements of Computing Systems : Building a Modern Computer from First Principles
Operating Systems: Three Easy Pieces
Understanding the Linux Kernel
Designing Data-Intensive Applications
Database Internals
PostgreSQL 14 Internals
High Performance Browser Networking
System Performance: Enterprise and Cloud
SWEBOK (Software Engineering Body of Knowledge) (not exactly a book)
Software Engineering Soft Parts
The Mythical Man-Month

Papers

Experiences implementing a high performance TCP in user-space
http://kegel.com/c10k.html

Must Reads

Flame Graph (Brendan Gregg)
HTTP and Networking (HPBN)
JVNS
Hacker News
Martin Fowler
Addy Osmani Blog
Fundamentals of Software Architecture - An Engineering Approach
Books: Libgen
High Scalability
Alexandre Elias
Linux Journey
OWASP Top Ten
Syndicode Blog
Technical Debt
Code Review at Google: Google
Matt Rickard Archive
Linux Kernel Labs

Techniques and Principles

Prefix sum
Difference array
Ubiquitous Language
Sensible Defaults
Single Source of Truth
Bounded Context (Martin Fowler)
Cold Storage
Monorepository (Monorepo.tools)
Separation of Concerns

Security

Web Security

CORS (Cross-Origin Resource Sharing)
CSRF (Cross-Site Request Forgery)
XSS (Cross-Site Scripting)
CSP (Content Security Policy)

Random Topics That Need to Be Organized

Queues: Dead Letter Queues and other stuff
Blob Store
Unit and Integration Tests
Feature and Bug
Projen
T-shaped
Fuzzing
Qualities of Service
Tiobe Index
James W. Kurose, Keith W. Ross - Computer Networking - A Top-Down Approach-Pearson
Certificates (e.g., ACME challenges and Let’s Encrypt).
ACME (Automated Certificate Management Environment) - rfc
Principle of Least Privilege
Fallacies of Distributed Computing (Circuit Breaker Design Pattern)
Microservices (Martin Fowler)
RFCs (Request for Comment)
Jump instruction vs inline function
Man pages
https://blog.allegro.tech/2024/03/kafka-performance-analysis.html
https://microsoft.github.io/debug-adapter-protocol/
CPU and IO Bound
https://www.designgurus.io/blog/system-design-interview-fundamentals
Open Sourcing
How to choose a license
I need to understant better how dot files work and how they are used in the linux context. And de rc files too.
GCC vs. Clang/LLVM
Communicating Sequential Processes (CSP)
PIDs
Structuring Folders
What is a Protocol
epoll: a mechanism for obtaining notification of file I/O events
notify: a mechanism for monitoring changes in files and directories
capabilities: a mechanism for granting a process a subset of the powers of the superuser
extended attributes
node flags
kqueue
Read more about set -e
bit mask
errno
TTL (time to live)
Agnostic
MTU - Maximum Transmission Unit
CRDTs
Passkey
BGP protocol
Dual Write problem
Nand2Tetris project
tail call
https://build-your-own.org/
big-endian vs. little-endian
N+1 query problems
OWASP API Security Top 10
ISO/IEC 9075-1
Gracefully shutdown - where does it come from?
stateful
stateless
libc
Von Neumann Model
Request per Second (RPS)
shebang (e.g #!/usr/bin/env node)
cgroups and seccomp
deduplication
CVE

(WIP) LLMs

MCP Protocol
JSON-RPC

In Introduction to Computer Systems, it says:

To be perfectly precise, it is not really the case that the computer differenciates the absolute absense of voltage (0) from the absolute presence of voltage (1). Actually the eletronic circuits differenciates the voltages close to zero from voltages far from 0.

Roadmap

Summary

Goals

Computer Systems

CPU

Memory

Disk

Network

Operating Systems

Kernel

Filesystems

File

Types of Files

Universality of IO

Virtual File System

Dentry

Inode

File Descriptors

File Objects

Unix file permissions

The /proc and /sys filesystems

Processes and Threads

Processes

Scheduler

Process Exit Codes

Context switch

Kernel Threads

Lightweight Process

Inter-process Communication (IPC) Mechanisms

Users and Groups

/etc/passwd file

Terminal Commands

Computer Networking

TCP/IP

Network Layer

Software Design

Software Design Principles

Software Design Qualities

Distributed Systems

Data Structure and Algorithms

Databases, Disks, and Filesystems

Relational Databases and PostgreSQL

SQL (Structured Query Language) syntax

DML (Data Manipulation Language)

DDL (Data Definition Language)

ACID (Atomicity, Consistency, Isolation, and Durability)

Write-ahead log (WAL)

Concurrency Control

Transactions

Cardinality

CTEs (common table expressions)

ORMs (Object Relational Mapper)

Active record vs. Data Mapper (Read: ORM Active Record vs. Data Mapper)

Replication

RAID

Sharding

Migrations

Bulk Operations

Techniques

Disks

Filesystems

Communication Protocols

Programming Languages

Cloud Providers

AWS

GCP

Authentication (AuthN) and Authorization (AuthZ): A Learning Path

Core Concepts

Identity Foundations

Authentication Protocols

Authorization Mechanisms

Supporting Technologies

Version Control

High level skills

Design Patterns

Software Quality Attributes

NodeJS

Hints

Terminology

Enconding

The `/proc` and `/sys` filesystems

`/etc/passwd` file