TCP/IP Explained — The 4-Layer Model and TCP vs UDP

TCP/IP means both "the protocol suite that runs the Internet" and "the four-layer reference model that organizes it". This article starts with the four-layer overview, then makes its center — TCP — the question of how it builds reliability on top of IP's best-effort delivery, walked in the order of the actual internal mechanisms: 3-way handshake / sequence numbers / receive window / congestion control / connection state. It ends by watching the actual behavior live with ss and tcpdump.

The 4-layer model at a glance #

The TCP/IP model splits communication into four layers. Unlike OSI's seven, it doesn't cut L5/L6 out as separate layers — those are absorbed into the Application layer.

Layer	Name	Role	Representative protocols
L4	Application	Rules that give meaning to the bytes that arrived	HTTP / DNS / SSH / SMTP / TLS / gRPC
L3	Transport	Logical "app ↔ app" connection — identified by port	TCP / UDP / QUIC / SCTP
L2	Internet	Connects "host ↔ host" at planetary scale — identified by IP	IPv4 / IPv6 / ICMP / IPsec / OSPF / BGP
L1	Link	The physical "how does it reach the next hop"	Ethernet / Wi-Fi / PPP / ARP / fiber

▸ The core design — the Internet layer is best-effort

The IP layer makes no delivery guarantee and no ordering guarantee. "TCP picks up the slack when reliability is needed" and "UDP passes through thinly when it isn't" — that separation is the fundamental reason the Internet has been able to stretch into every imaginable use case.

Restating each layer's job in one sentence:

Application — the rules that give bytes meaning, like HTTP request/response or DNS query/answer
Transport — which app, which connection, optionally with reliability and an encrypted session
Internet — get this packet to the host with that IP, from anywhere on Earth (best-effort)
Link — right now, on this cable or radio, hand the frame to the neighbor in front of me

The "four guarantees" TCP builds on top of IP #

IP is a service that does nothing more than "write an address on an envelope and drop it in the mail" — no delivery guarantee, no ordering guarantee, no de-duplication guarantee. The reason almost every app on the Internet can still be written assuming "what I send arrives in order, without losses, without duplicates" is that TCP sits in between and hides IP's uncertainty.

What TCP provides is exactly four guarantees:

Connection — a handshake establishes "we're about to talk"
Order + de-duplication + loss recovery — sequence numbers + ACKs + retransmission
Flow control — the receiver tells the sender "stop sending more" via a window
Congestion control — estimate path congestion and automatically pace the sender

3-way handshake and 4-way teardown #

TCP's defining property is connection-oriented. Before any data flows, both ends agree "we're about to talk" / "go ahead" in three messages.

1. SYN (Client → Server)

The client picks seq=x and sends. State becomes SYN_SENT.

2. SYN+ACK (Server → Client)

The server picks its own seq=y and replies with ack=x+1. State: SYN_RECV.

3. ACK (Client → Server)

The client returns ack=y+1. Both sides become ESTABLISHED — application data can now flow.

The essence of the 3-way handshake is "make each side confirm the other's sequence number, in both directions". Each direction has its own 32-bit sequence number; SYN announces "I'll start from this number", and ACK answers "I'm waiting from +1". Doing the same exchange in both directions = three messages.

4-way teardown and TIME_WAIT #

TCP is designed so that each direction can be closed independently, so termination becomes a 4-step exchange.

1. Client → FIN

"I'm done sending, but I can still receive." State: FIN_WAIT_1.

2. Server → ACK

Server becomes CLOSE_WAIT, client becomes FIN_WAIT_2.

3. Server → FIN

Sent once the server application has also called close().

4. Client → ACK + TIME_WAIT

Wait 2 MSL (≈ 60 s) to make sure delayed old segments don't sneak into a new connection.

▸ What "port exhaustion" really is

Opening huge numbers of short-lived sessions piles up TIME_WAIT entries and eventually causes "port exhaustion" — a classic problem of web servers from the keep-alive-less HTTP/1.1 era. CLOSE_WAIT hanging around for a long time, on the other hand, is the signature of the application forgetting to call close(); the only real fix is in the app.

Sequence numbers and ACKs — ordering, loss, duplicates #

The two most important fields in the TCP header are the sequence number (32-bit) and the ACK number (32-bit).

Sequence number — the position (in bytes, on this connection) of this segment's first byte
ACK number — the next byte the receiver wants (= "I've received everything up to here")

Example: the server sends "500 bytes starting at seq=1000", and the client replies "ack=1500". That single mechanism solves three problems at once.

Three things sequence numbers + ACKs deliver

# Loss detection
No ACK arrives → RTO (Retransmission Timeout) expires → retransmit
# Reordering
Segments that arrive out of order are sorted by seq before delivery to the app
# De-duplication
Segments with a seq we've already seen are dropped

▸ SACK tells the peer about the "holes"

In practice, SACK (Selective ACK, RFC 2018) is in use almost everywhere: the receiver tells the sender "I have 1000-1500 and 2500-3000 but not 1500-2500", marking the holes so only the missing pieces are retransmitted.

Flow control — the receive window (rwnd) #

The mechanism that prevents the sender from outrunning the receiver's processing capacity. On every ACK, the receiver advertises "I can take this many bytes from here" (= the receive window, Window Size) in the header, and the sender keeps "unacknowledged in flight" below that window.

The TCP header's Window Size field is 16-bit (max 65,535), which isn't enough for modern high-speed links. The Window Scale option (RFC 7323) left-shifts the window by up to 14 bits, allowing it to reach up to 1 GB.

▸ Why "the download tops out around 80 MB/s"

Often the real cause is the receive window not being able to grow because the kernel-side buffer is too small. Tuning net.ipv4.tcp_rmem or the application's SO_RCVBUF can lift the ceiling.

Congestion control — estimating path congestion #

Flow control (rwnd) only sees the two endpoints' situation. Congestion along the path is the job of congestion control. The sender keeps a second window — the congestion window (cwnd) — and what it can actually send is throttled to min(rwnd, cwnd).

Algorithm	How it works (essence)	Adoption
Reno (1990)	Assumes loss = congestion → halve cwnd and restart	Old standard
CUBIC (2008)	Cubic-curve recovery of cwnd, strong on high-BDP links	Linux default (since 2.6.19)
BBR (2016)	Models estimated bandwidth × RTT directly rather than loss	Widely used at Google (YouTube, GCP)

Slow Start doubles cwnd at every round trip, the algorithm switches to Congestion Avoidance at a threshold, and loss cuts it down — the classic "exponential → linear → braking" 3-phase cycle. BBR throws away the underlying assumption "loss = slow down" entirely. Wi-Fi paths with high retransmits, satellite links with long RTT, ultra-low-RTT data-center fabric — each shows very different numbers, so modern kernels let you switch via sysctl.

Switching the congestion control algorithm (Linux)

$ sysctl net.ipv4.tcp_congestion_control
net.ipv4.tcp_congestion_control = cubic
$ sysctl -w net.ipv4.tcp_congestion_control=bbr
$ sysctl net.ipv4.tcp_available_congestion_control
net.ipv4.tcp_available_congestion_control = reno cubic bbr

Connection state — the six worth watching #

The TCP state machine has roughly 11 states, but only a handful actually matter day to day.

State	Meaning	When to care
`LISTEN`	Server is waiting for accept	Confirming the service is up
`ESTABLISHED`	Data is flowing	Active connections — is the count too high?
`SYN_SENT` / `SYN_RECV`	Mid-handshake	Many SYN_RECV = a sign of SYN flood
`FIN_WAIT_1` / `FIN_WAIT_2`	After your side sent FIN	Normally clears quickly
`CLOSE_WAIT`	You got the peer's FIN but haven't called close yet	Long-lived = an app bug
`TIME_WAIT`	Active-close side / 2 MSL wait	Often plentiful, usually fine

▸ Anti-pattern: tcp_tw_recycle

A large pile of TIME_WAIT is normal; trying to forcibly squash it via tcp_tw_recycle (removed in Linux 4.12) is a well-known historical anti-pattern. It was known to break clients behind NAT.

UDP — when reliability is "in the way" #

Every piece of TCP's machinery is paying a cost to give you "arrived / in order / no duplicates". For real-time voice, video, or games — where "show a newer frame even if a few pixels are missing, rather than wait for stale data to catch up" is what you actually want — TCP's guarantees get in the way.

UDP (User Datagram Protocol, RFC 768) is designed at the opposite extreme:

Connectionless — no handshake, just send
Header is only 8 bytes (source port / destination port / length / checksum)
No retransmission / no ordering / no flow control / no congestion control
The app provides whatever guarantees it needs, itself

Representative cases that pick UDP:

DNS queries — a TCP handshake is overkill when one round trip is enough
VoIP / RTP / WebRTC — re-delivering a 0.5-second-old voice packet is pointless
Games — newer inputs always beat older positions
DHCP / NTP / TFTP — bootstrap-oriented

QUIC — rebuilding TCP+TLS+HTTP on top of UDP #

TCP is a protocol whose reliability and congestion control live in the OS kernel. For a long time that was an advantage, but it has become a weakness: evolution is slow (new algorithms require an OS update), and stacking TCP and TLS handshakes in series (TCP 1 RTT + TLS 1 RTT = 2 RTT) is also heavy for today's web.

QUIC (RFC 9000, 2021) rebuilds all of this in userspace, on top of UDP:

TCP+TLS+HTTP merged into one — TLS 1.3 is built in; first connection is 1 RTT, repeat connection is 0 RTT
Multiple streams — independent streams multiplexed over one connection, eliminating TCP's head-of-line blocking
Connection ID — IP/port can change (Wi-Fi → 4G handover) without breaking the connection
Userspace implementation — new congestion control algorithms ship far faster

HTTP/3 is "HTTP over QUIC", broadly adopted by Google / Cloudflare / Meta / Akamai. As of 2026 it has become a standard option for new infrastructure. The simple slogan "UDP = no reliability" became dated the day QUIC shipped.

Seeing the packets for real #

Everything covered above can be observed live with tcpdump and ss.

Watching just SYN / FIN with tcpdump

$ sudo tcpdump -nn -i any 'tcp port 443 and (tcp[tcpflags] & (tcp-syn|tcp-fin) != 0)'
12:34:56.123 IP 10.0.0.5.50234 > 93.184.216.34.443: Flags [S], seq 100, ...
12:34:56.145 IP 93.184.216.34.443 > 10.0.0.5.50234: Flags [S.], seq 200, ack 101, ...
12:34:56.146 IP 10.0.0.5.50234 > 93.184.216.34.443: Flags [.], ack 201, ...

Per-state counts and per-connection stats with ss

# breakdown by state — the kind of number you want on a dashboard
$ ss -tan | awk 'NR>1 {s[$1]++} END {for(k in s) print k, s[k]}'
ESTAB 142
TIME-WAIT 38
LISTEN 12
CLOSE-WAIT 2
# list ESTABLISHED on a specific port
$ ss -tan state established sport = :443
# RTT / cwnd / retransmits and the rest (Linux)
$ ss -tin
cubic wscale:7,7 rto:204 rtt:3.142/1.5 ato:40 mss:1448 cwnd:10 ssthresh:7 bytes_sent:8421 retrans:0/2

▸ The output where every investigation begins

The cwnd:N / rtt:M/V / retrans:R/T shown by ss -tin reveals the congestion window, round-trip time, and retransmits directly. "The page is slow," "transfer collapses between these two servers only" — almost every TCP investigation ends back here.

Summary #

TCP/IP is the Internet's "courier service", and the heart of the four-layer model is Transport (TCP) and Internet (IP)
TCP's essence is "a chain of mechanisms that build reliability on top of best-effort IP" = handshake + sequence numbers + receive window + congestion window
UDP shows "what happens when you throw all of those mechanisms away"; QUIC shows "what you get when you redesign the whole stack in userspace" — both go off in opposite directions from TCP
Once you have the mechanisms, reading ss -tin, choosing whether to adopt HTTP/3, and picking a congestion-control algorithm all become things you can reason about on your own axis