TCP/IP Explained: 4-Layer Model and TCP vs UDP thumbnail

TCP/IP Explained: 4-Layer Model and TCP vs UDP

⏱ approx. 20 min views 58 likes 0 LOG_DATE:2026-05-09
TOC

TCP/IP #

The phrase "TCP/IP" carries two meanings. One is the whole protocol suite anchored on TCP and IP (HTTP / DNS / TLS / SSH / ICMP / IP / Ethernet / …). The other is the four-layer reference model that organizes that suite. Because virtually every byte on the Internet rides on top of TCP/IP, "TCP/IP" and "the protocols that run the Internet" are nearly the same statement.

There are separate articles for IP (addressing and routing) and the OSI Reference Model (the 7-layer reference). This article carves out a different role: §1 is the four-layer overview, and §2 — the heart of the article — is what TCP itself is doing on top of IP, with §3 covering UDP, §4 covering QUIC, and §5 dropping into actual observed behavior.

1. The four-layer model #

In the TCP/IP model, communication is organized into four layers. Compared to OSI's seven, the major difference is that L5 / L6 don't get their own layer — they're absorbed into the Application layer.

TCP/IP four-layer model Higher = closer to apps / lower = closer to hardware. OSI L5–L7 collapses into Application here. L4 Application Protocols user-facing applications speak directly HTTP / HTTPS / DNS / SSH / SMTP / IMAP / POP3 / FTP / NTP / SNMP / MQTT / gRPC "How to interpret the bytes that arrived" — TLS and compression also live here L3 Transport Logical "app ↔ app" connections — port numbers identify them TCP (reliable) / UDP (best-effort) / QUIC (next-gen on UDP) / SCTP The center of this article — TCP — is here L2 Internet "Host ↔ host" reach across the planet — IP addresses identify hosts IPv4 / IPv6 / ICMP / IPsec / routing protocols (OSPF, BGP) "Best-effort" — no delivery or ordering guarantee (TCP above is what fills that in) L1 Link (Network Access) Getting a frame to the next adjacent node Ethernet (802.3) / Wi-Fi (802.11) / PPP / ARP / fiber / copper / radio Combines OSI's Physical (L1) and Data Link (L2) into a single layer For the OSI 7-layer mapping, see the OSI reference-model article

The role of each layer in one sentence:

  • Application: rules for "what the bytes mean" (HTTP request/response, DNS query/answer, SSH channel multiplexing, …)
  • Transport: identifies "which app's which connection", and optionally adds reliability (TCP) or an encrypted UDP session (QUIC)
  • Internet: gets "a packet to this IP, from anywhere on the planet" — pathing is best-effort
  • Link: gets "this frame to the adjacent node, right now, over this cable / radio"

Internet is best-effort — that the IP layer makes no promise of delivery — is the central design choice of TCP/IP. TCP fills it in only when you need it; UDP stays out of the way when you don't. That separation is why the Internet was able to grow into so many shapes.

2. TCP — Building reliability on top of IP #

IP is "write the address on the envelope and drop it in the post" — best-effort, with no promise of delivery, ordering, or de-duplication. The fact that almost every Internet application can still write code as if "what I send arrives, in order, exactly once" is because TCP sits in between and hides IP's uncertainty.

TCP gives you four guarantees, each backed by a specific mechanism:

  1. Connection — handshake to agree on "we're going to talk now" (§2.1)
  2. Ordering + de-dup + loss recovery — sequence numbers + ACKs + retransmission (§2.2)
  3. Flow control — the receiver tells the sender, with a window, "don't send more than this" (§2.3)
  4. Congestion control — the sender estimates path congestion and adjusts pace automatically (§2.4)

2.1 3-way handshake and 4-way teardown #

TCP's defining feature is being connection-oriented. Before any data flows, both ends agree — in three messages — that "we're going to talk." Closing is symmetric: each side independently says "I'm done sending."

The whole life of a TCP connection 3-way handshake → data transfer → 4-way teardown, on one page Client Server (LISTEN) ▼ 3-way handshake (agree to talk) SYN seq=x Client → SYN_SENT / picks initial sequence x and emits SYN+ACK seq=y, ack=x+1 Server → SYN_RECV / commits its own y / acknowledges x+1 ACK ack=y+1 Both sides ESTABLISHED — application data can flow ▼ Data transfer (both directions, multiplexed on the same connection) Data seq=x+1, len=1024 Client sends a 1024-byte segment ACK ack=x+1+1024 Server replies "next byte I'm waiting for is x+1025" Data seq=y+1, len=2048 (response) Server response flows back on the same connection ▼ 4-way teardown (close each direction independently) FIN (Client → "I'm done sending") Client → FIN_WAIT_1 / can still receive ACK Server → CLOSE_WAIT / Client → FIN_WAIT_2 FIN (Server → "I'm also done sending") Once the server app calls close() ACK Client → TIME_WAIT (waits 2 MSL ≈ 60s to absorb stragglers) Lots of TIME_WAIT is normal. CLOSE_WAIT lingering for a long time is a sign of close() not being called

The essence of the 3-way handshake is "each side gets the other to confirm its sequence number." Each direction has its own independent 32-bit sequence; the SYN announces "I'll start counting from here," and the peer's ACK says "got it, I'm waiting for +1." Because that exchange is only convincing one direction at a time, doing it both directions = three messages total.

The 4-way teardown reflects that each direction in TCP can be closed independently:

  1. The client sends FIN — "I'm done sending; I can still receive."
  2. The server ACKs.
  3. When the server application calls close(), the server sends FIN.
  4. The client ACKs and sits in TIME_WAIT for 2 MSL (Maximum Segment Lifetime, typically 30–60s).

TIME_WAIT lingers after data is done because it has to absorb stragglers from the old connection so they can't get mixed into a new one. When you build something that opens many short-lived sessions (the classic example is a web server in the pre-keepalive HTTP/1.1 era), TIME_WAIT piles up — and that's the famous "ephemeral-port exhaustion" problem.

2.2 Sequence numbers and ACKs — solving order, loss, and duplicates #

The two most important fields in the TCP header are the 32-bit sequence number and the 32-bit ACK number.

  • Sequence number = "the position, in this connection, of the first byte in this segment"
  • ACK number = "the next byte position I want" (= "I've gotten everything up to here")

Example: if the server sends "500 bytes starting at seq=1000" to the client, the client replies "ack=1500." Just from that:

  • Loss detection: no ACK arriving → after a timeout (RTO, retransmission timeout), retransmit
  • Reordering: if segments arrive out of order, sort them by sequence number before handing to the app
  • De-duplication: a segment with the same sequence number is discarded

In practice, SACK (Selective ACK, RFC 2018) is enabled by default; it lets the receiver tell the sender "I have 1000–1500 and 2500–3000, but 1500–2500 is missing," so only the holes get retransmitted.

2.3 Flow control — the receive window (rwnd) #

A mechanism for the sender to not exceed the receiver's processing capacity. In every ACK, the receiver advertises "how many bytes I can accept from here" (= the receive window, Window Size). The sender ensures that un-ACKed data on the wire does not exceed that window.

TCP's Window Size field is 16 bits (max 65,535), which is too small for modern high-speed paths. Window Scale option (RFC 7323) shifts the window left by up to 14 bits, allowing up to 1 GB.

A symptom like "downloads cap out at 80 MB/s" is often the receiver's kernel buffer not letting the receive window grow large enough. Tuning net.ipv4.tcp_rmem, or the application's SO_RCVBUF, can cure it.

2.4 Congestion control — estimating path congestion #

Flow control (§2.3) only respects what the two endpoints can handle; congestion in the middle is a separate problem. That's the territory of congestion control. The sender keeps a separate congestion window (cwnd), and what actually gets sent = min(rwnd, cwnd).

How cwnd evolves is set by the congestion-control algorithm:

Algorithm Idea (in brief) Where it's used
Reno (1990) Treats loss as congestion → halves cwnd and recovers Older standard
CUBIC (2008) Cubic curve recovery, strong on high-BDP paths Linux default since 2.6.19
BBR (2016) Models bandwidth × RTT directly instead of equating loss with congestion Google has rolled it out widely (YouTube, GCP, …)

The classical pattern — slow-start grows cwnd exponentially, then congestion-avoidance grows it linearly, and loss halves it — is the exponential → linear → reset triangle, but BBR drops the "loss = throttle" assumption entirely. Lossy Wi-Fi paths, long-RTT satellite links, and ultra-low-RTT data-center fabrics all behave very differently under each algorithm, which is why modern OSes let you switch via sysctl: net.ipv4.tcp_congestion_control.

2.5 Connection state — what ss actually shows you #

TCP's state machine has roughly 11 states, but the ones worth recognizing in operations are limited:

State Meaning When to care
LISTEN Server is accepting connections Confirms the service is up
ESTABLISHED Data can flow Active connection — watch for excess count
SYN_SENT / SYN_RECV Mid-handshake Normally clears fast / many SYN_RECV = a SYN-flood signature
FIN_WAIT_1 / FIN_WAIT_2 After sending your own FIN Normally clears fast
CLOSE_WAIT Peer sent FIN but you haven't called close() yet Persisting = an application bug (forgotten close)
TIME_WAIT You did the active close / waiting 2 MSL Even a lot of these is usually normal
# Per-state count in one line
ss -tan | awk 'NR>1 {print $1}' | sort | uniq -c

# All ESTABLISHED on a port
ss -tan state established sport = :443

A server piling up CLOSE_WAIT is almost always missing a close(), and only an application fix actually solves it. Lots of TIME_WAIT is normal, and the now-deleted tcp_tw_recycle (removed in Linux 4.12) — once recommended for taming it — is worth remembering as an anti-pattern.

3. UDP — When reliability gets in the way #

Everything in §2.1–§2.4 — handshakes, sequence numbers, retransmission, windows — pays a cost to give you "reliable, ordered, deduplicated." For real-time audio, video, and gaming, that cost is the wrong choice: you'd rather show a fresh frame with a gap in it than a stale frame because we're waiting on a retransmit.

UDP (User Datagram Protocol, RFC 768) is the opposite design:

  • Connectionless — no handshake; just send
  • Just an 8-byte header (source port / destination port / length / checksum)
  • No retransmission, no ordering, no flow control, no congestion control
  • Whatever guarantees you need, the application builds them

Where UDP is the right pick:

  • DNS queries — one round trip is enough; a TCP handshake is overkill
  • VoIP / RTP / WebRTC — retransmitting half-second-old audio doesn't help anyone
  • Games — the latest input matters more than recovering an old position
  • DHCP / NTP / TFTP — bootstrap-time protocols

The clean "reliability → TCP, real-time → UDP" split was the rule for decades — until QUIC (next section) blurred it.

4. QUIC — Rebuilding TCP+TLS+HTTP on top of UDP #

TCP keeps reliability and congestion control inside the OS kernel. That used to be a strength; it became a weakness, because evolution is slow (a new congestion-control algorithm needs a kernel update). And on top of that, the TCP and TLS handshakes used to stack serially (TCP 1 RTT + TLS 1 RTT = 2 RTT) — too heavy for the modern web.

QUIC (RFC 9000, 2021) is a rebuild in user space, on top of UDP:

  • TCP + TLS + HTTP unified — TLS 1.3 is built in, and the first connection completes in 1 RTT, resumption in 0 RTT
  • Multiplexed streams — independent streams within one connection, eliminating TCP's head-of-line blocking
  • Connection IDs — even when the client's IP / port changes (Wi-Fi → 4G handoff), the connection survives
  • Built in user space — new congestion-control algorithms can be deployed without OS upgrades

HTTP/3 is "HTTP over QUIC," and Google / Cloudflare / Meta / Akamai have rolled it out widely. As of 2026 it is a standard option for new infrastructure. The neat "UDP = no reliability" picture is, post-QUIC, a thing of the past.

5. Watching it live #

Everything above is observable in real time with tcpdump and ss:

# Watch handshakes and FINs with the smallest possible filter
sudo tcpdump -nn -i any 'tcp port 443 and (tcp[tcpflags] & (tcp-syn|tcp-fin) != 0)'

# All ESTABLISHED connections to one server
ss -tan dst :443

# State distribution and connection counts (worth pinning to an ops dashboard)
ss -tan | awk 'NR>1 {s[$1]++} END {for(k in s) print k, s[k]}'

# Detail including RTT / cwnd / retransmits (Linux)
ss -tin

In ss -tin output, the values cwnd:N, rtt:M/V, retrans:R/T are exactly the congestion window, the round-trip time, and the retransmission count that §2.4 talked about. When you face "this web page is slow," or "only this server-to-server path drops throughput," identifying which layer is the actual bottleneck almost always begins here.


TCP/IP is "the courier that runs the Internet," and TCP, at its center, is best understood as "the set of moves that build reliability on top of best-effort IP" — handshakes, sequence numbers, the receive window, the congestion window. UDP is what's left when you throw all those moves away on purpose; QUIC is the same moves rebuilt in user space for agility, in a different direction. Once the mechanics are in your head, reading ss -tin, deciding whether to adopt HTTP/3, or picking a congestion-control algorithm all stop being other people's problems.