TCP/IP Explained — The 4-Layer Model and TCP vs UDP thumbnail

TCP/IP Explained — The 4-Layer Model and TCP vs UDP

⏱ approx. 18 min views 162 likes 0 LOG_DATE:2026-05-09
TOC

TCP/IP means both "the protocol suite that runs the Internet" and "the four-layer reference model that organizes it". This article starts with the four-layer overview, then makes its center — TCP — the question of how it builds reliability on top of IP's best-effort delivery, walked in the order of the actual internal mechanisms: 3-way handshake / sequence numbers / receive window / congestion control / connection state. It ends by watching the actual behavior live with ss and tcpdump.

01

The 4-layer model at a glance #

The TCP/IP model splits communication into four layers. Unlike OSI's seven, it doesn't cut L5/L6 out as separate layers — those are absorbed into the Application layer.

Layer Name Role Representative protocols
L4 Application Rules that give meaning to the bytes that arrived HTTP / DNS / SSH / SMTP / TLS / gRPC
L3 Transport Logical "app ↔ app" connection — identified by port TCP / UDP / QUIC / SCTP
L2 Internet Connects "host ↔ host" at planetary scale — identified by IP IPv4 / IPv6 / ICMP / IPsec / OSPF / BGP
L1 Link The physical "how does it reach the next hop" Ethernet / Wi-Fi / PPP / ARP / fiber
▸ The core design — the Internet layer is best-effort

The IP layer makes no delivery guarantee and no ordering guarantee. "TCP picks up the slack when reliability is needed" and "UDP passes through thinly when it isn't" — that separation is the fundamental reason the Internet has been able to stretch into every imaginable use case.

Restating each layer's job in one sentence:

  • Application — the rules that give bytes meaning, like HTTP request/response or DNS query/answer
  • Transport — which app, which connection, optionally with reliability and an encrypted session
  • Internet — get this packet to the host with that IP, from anywhere on Earth (best-effort)
  • Link — right now, on this cable or radio, hand the frame to the neighbor in front of me
02

The "four guarantees" TCP builds on top of IP #

IP is a service that does nothing more than "write an address on an envelope and drop it in the mail" — no delivery guarantee, no ordering guarantee, no de-duplication guarantee. The reason almost every app on the Internet can still be written assuming "what I send arrives in order, without losses, without duplicates" is that TCP sits in between and hides IP's uncertainty.

What TCP provides is exactly four guarantees:

  1. Connection — a handshake establishes "we're about to talk"
  2. Order + de-duplication + loss recovery — sequence numbers + ACKs + retransmission
  3. Flow control — the receiver tells the sender "stop sending more" via a window
  4. Congestion control — estimate path congestion and automatically pace the sender
03

3-way handshake and 4-way teardown #

TCP's defining property is connection-oriented. Before any data flows, both ends agree "we're about to talk" / "go ahead" in three messages.

1. SYN (Client → Server)
The client picks seq=x and sends. State becomes SYN_SENT.
2. SYN+ACK (Server → Client)
The server picks its own seq=y and replies with ack=x+1. State: SYN_RECV.
3. ACK (Client → Server)
The client returns ack=y+1. Both sides become ESTABLISHED — application data can now flow.

The essence of the 3-way handshake is "make each side confirm the other's sequence number, in both directions". Each direction has its own 32-bit sequence number; SYN announces "I'll start from this number", and ACK answers "I'm waiting from +1". Doing the same exchange in both directions = three messages.

4-way teardown and TIME_WAIT #

TCP is designed so that each direction can be closed independently, so termination becomes a 4-step exchange.

1. Client → FIN
"I'm done sending, but I can still receive." State: FIN_WAIT_1.
2. Server → ACK
Server becomes CLOSE_WAIT, client becomes FIN_WAIT_2.
3. Server → FIN
Sent once the server application has also called close().
4. Client → ACK + TIME_WAIT
Wait 2 MSL (≈ 60 s) to make sure delayed old segments don't sneak into a new connection.
▸ What "port exhaustion" really is

Opening huge numbers of short-lived sessions piles up TIME_WAIT entries and eventually causes "port exhaustion" — a classic problem of web servers from the keep-alive-less HTTP/1.1 era. CLOSE_WAIT hanging around for a long time, on the other hand, is the signature of the application forgetting to call close(); the only real fix is in the app.

04

Sequence numbers and ACKs — ordering, loss, duplicates #

The two most important fields in the TCP header are the sequence number (32-bit) and the ACK number (32-bit).

  • Sequence number — the position (in bytes, on this connection) of this segment's first byte
  • ACK number — the next byte the receiver wants (= "I've received everything up to here")

Example: the server sends "500 bytes starting at seq=1000", and the client replies "ack=1500". That single mechanism solves three problems at once.

Three things sequence numbers + ACKs deliver
# Loss detection No ACK arrives → RTO (Retransmission Timeout) expires → retransmit # Reordering Segments that arrive out of order are sorted by seq before delivery to the app # De-duplication Segments with a seq we've already seen are dropped
▸ SACK tells the peer about the "holes"

In practice, SACK (Selective ACK, RFC 2018) is in use almost everywhere: the receiver tells the sender "I have 1000-1500 and 2500-3000 but not 1500-2500", marking the holes so only the missing pieces are retransmitted.

05

Flow control — the receive window (rwnd) #

The mechanism that prevents the sender from outrunning the receiver's processing capacity. On every ACK, the receiver advertises "I can take this many bytes from here" (= the receive window, Window Size) in the header, and the sender keeps "unacknowledged in flight" below that window.

The TCP header's Window Size field is 16-bit (max 65,535), which isn't enough for modern high-speed links. The Window Scale option (RFC 7323) left-shifts the window by up to 14 bits, allowing it to reach up to 1 GB.

▸ Why "the download tops out around 80 MB/s"

Often the real cause is the receive window not being able to grow because the kernel-side buffer is too small. Tuning net.ipv4.tcp_rmem or the application's SO_RCVBUF can lift the ceiling.

06

Congestion control — estimating path congestion #

Flow control (rwnd) only sees the two endpoints' situation. Congestion along the path is the job of congestion control. The sender keeps a second window — the congestion window (cwnd) — and what it can actually send is throttled to min(rwnd, cwnd).

Algorithm How it works (essence) Adoption
Reno (1990) Assumes loss = congestion → halve cwnd and restart Old standard
CUBIC (2008) Cubic-curve recovery of cwnd, strong on high-BDP links Linux default (since 2.6.19)
BBR (2016) Models estimated bandwidth × RTT directly rather than loss Widely used at Google (YouTube, GCP)

Slow Start doubles cwnd at every round trip, the algorithm switches to Congestion Avoidance at a threshold, and loss cuts it down — the classic "exponential → linear → braking" 3-phase cycle. BBR throws away the underlying assumption "loss = slow down" entirely. Wi-Fi paths with high retransmits, satellite links with long RTT, ultra-low-RTT data-center fabric — each shows very different numbers, so modern kernels let you switch via sysctl.

Switching the congestion control algorithm (Linux)
$ sysctl net.ipv4.tcp_congestion_control net.ipv4.tcp_congestion_control = cubic $ sysctl -w net.ipv4.tcp_congestion_control=bbr $ sysctl net.ipv4.tcp_available_congestion_control net.ipv4.tcp_available_congestion_control = reno cubic bbr
07

Connection state — the six worth watching #

The TCP state machine has roughly 11 states, but only a handful actually matter day to day.

State Meaning When to care
LISTEN Server is waiting for accept Confirming the service is up
ESTABLISHED Data is flowing Active connections — is the count too high?
SYN_SENT / SYN_RECV Mid-handshake Many SYN_RECV = a sign of SYN flood
FIN_WAIT_1 / FIN_WAIT_2 After your side sent FIN Normally clears quickly
CLOSE_WAIT You got the peer's FIN but haven't called close yet Long-lived = an app bug
TIME_WAIT Active-close side / 2 MSL wait Often plentiful, usually fine
▸ Anti-pattern: tcp_tw_recycle

A large pile of TIME_WAIT is normal; trying to forcibly squash it via tcp_tw_recycle (removed in Linux 4.12) is a well-known historical anti-pattern. It was known to break clients behind NAT.

08

UDP — when reliability is "in the way" #

Every piece of TCP's machinery is paying a cost to give you "arrived / in order / no duplicates". For real-time voice, video, or games — where "show a newer frame even if a few pixels are missing, rather than wait for stale data to catch up" is what you actually want — TCP's guarantees get in the way.

UDP (User Datagram Protocol, RFC 768) is designed at the opposite extreme:

  • Connectionless — no handshake, just send
  • Header is only 8 bytes (source port / destination port / length / checksum)
  • No retransmission / no ordering / no flow control / no congestion control
  • The app provides whatever guarantees it needs, itself

Representative cases that pick UDP:

  • DNS queries — a TCP handshake is overkill when one round trip is enough
  • VoIP / RTP / WebRTC — re-delivering a 0.5-second-old voice packet is pointless
  • Games — newer inputs always beat older positions
  • DHCP / NTP / TFTP — bootstrap-oriented
09

QUIC — rebuilding TCP+TLS+HTTP on top of UDP #

TCP is a protocol whose reliability and congestion control live in the OS kernel. For a long time that was an advantage, but it has become a weakness: evolution is slow (new algorithms require an OS update), and stacking TCP and TLS handshakes in series (TCP 1 RTT + TLS 1 RTT = 2 RTT) is also heavy for today's web.

QUIC (RFC 9000, 2021) rebuilds all of this in userspace, on top of UDP:

  • TCP+TLS+HTTP merged into one — TLS 1.3 is built in; first connection is 1 RTT, repeat connection is 0 RTT
  • Multiple streams — independent streams multiplexed over one connection, eliminating TCP's head-of-line blocking
  • Connection ID — IP/port can change (Wi-Fi → 4G handover) without breaking the connection
  • Userspace implementation — new congestion control algorithms ship far faster

HTTP/3 is "HTTP over QUIC", broadly adopted by Google / Cloudflare / Meta / Akamai. As of 2026 it has become a standard option for new infrastructure. The simple slogan "UDP = no reliability" became dated the day QUIC shipped.

10

Seeing the packets for real #

Everything covered above can be observed live with tcpdump and ss.

Watching just SYN / FIN with tcpdump
$ sudo tcpdump -nn -i any 'tcp port 443 and (tcp[tcpflags] & (tcp-syn|tcp-fin) != 0)' 12:34:56.123 IP 10.0.0.5.50234 > 93.184.216.34.443: Flags [S], seq 100, ... 12:34:56.145 IP 93.184.216.34.443 > 10.0.0.5.50234: Flags [S.], seq 200, ack 101, ... 12:34:56.146 IP 10.0.0.5.50234 > 93.184.216.34.443: Flags [.], ack 201, ...
Per-state counts and per-connection stats with ss
# breakdown by state — the kind of number you want on a dashboard $ ss -tan | awk 'NR>1 {s[$1]++} END {for(k in s) print k, s[k]}' ESTAB 142 TIME-WAIT 38 LISTEN 12 CLOSE-WAIT 2 # list ESTABLISHED on a specific port $ ss -tan state established sport = :443 # RTT / cwnd / retransmits and the rest (Linux) $ ss -tin cubic wscale:7,7 rto:204 rtt:3.142/1.5 ato:40 mss:1448 cwnd:10 ssthresh:7 bytes_sent:8421 retrans:0/2
▸ The output where every investigation begins

The cwnd:N / rtt:M/V / retrans:R/T shown by ss -tin reveals the congestion window, round-trip time, and retransmits directly. "The page is slow," "transfer collapses between these two servers only" — almost every TCP investigation ends back here.

Summary #

  • TCP/IP is the Internet's "courier service", and the heart of the four-layer model is Transport (TCP) and Internet (IP)
  • TCP's essence is "a chain of mechanisms that build reliability on top of best-effort IP" = handshake + sequence numbers + receive window + congestion window
  • UDP shows "what happens when you throw all of those mechanisms away"; QUIC shows "what you get when you redesign the whole stack in userspace" — both go off in opposite directions from TCP
  • Once you have the mechanisms, reading ss -tin, choosing whether to adopt HTTP/3, and picking a congestion-control algorithm all become things you can reason about on your own axis
𝕏 Post B! Hatena