TCP/IP #
The phrase "TCP/IP" carries two meanings. One is the whole protocol suite anchored on TCP and IP (HTTP / DNS / TLS / SSH / ICMP / IP / Ethernet / …). The other is the four-layer reference model that organizes that suite. Because virtually every byte on the Internet rides on top of TCP/IP, "TCP/IP" and "the protocols that run the Internet" are nearly the same statement.
There are separate articles for IP (addressing and routing) and the OSI Reference Model (the 7-layer reference). This article carves out a different role: §1 is the four-layer overview, and §2 — the heart of the article — is what TCP itself is doing on top of IP, with §3 covering UDP, §4 covering QUIC, and §5 dropping into actual observed behavior.
1. The four-layer model #
In the TCP/IP model, communication is organized into four layers. Compared to OSI's seven, the major difference is that L5 / L6 don't get their own layer — they're absorbed into the Application layer.
The role of each layer in one sentence:
- Application: rules for "what the bytes mean" (HTTP request/response, DNS query/answer, SSH channel multiplexing, …)
- Transport: identifies "which app's which connection", and optionally adds reliability (TCP) or an encrypted UDP session (QUIC)
- Internet: gets "a packet to this IP, from anywhere on the planet" — pathing is best-effort
- Link: gets "this frame to the adjacent node, right now, over this cable / radio"
Internet is best-effort — that the IP layer makes no promise of delivery — is the central design choice of TCP/IP. TCP fills it in only when you need it; UDP stays out of the way when you don't. That separation is why the Internet was able to grow into so many shapes.
2. TCP — Building reliability on top of IP #
IP is "write the address on the envelope and drop it in the post" — best-effort, with no promise of delivery, ordering, or de-duplication. The fact that almost every Internet application can still write code as if "what I send arrives, in order, exactly once" is because TCP sits in between and hides IP's uncertainty.
TCP gives you four guarantees, each backed by a specific mechanism:
- Connection — handshake to agree on "we're going to talk now" (§2.1)
- Ordering + de-dup + loss recovery — sequence numbers + ACKs + retransmission (§2.2)
- Flow control — the receiver tells the sender, with a window, "don't send more than this" (§2.3)
- Congestion control — the sender estimates path congestion and adjusts pace automatically (§2.4)
2.1 3-way handshake and 4-way teardown #
TCP's defining feature is being connection-oriented. Before any data flows, both ends agree — in three messages — that "we're going to talk." Closing is symmetric: each side independently says "I'm done sending."
The essence of the 3-way handshake is "each side gets the other to confirm its sequence number." Each direction has its own independent 32-bit sequence; the SYN announces "I'll start counting from here," and the peer's ACK says "got it, I'm waiting for +1." Because that exchange is only convincing one direction at a time, doing it both directions = three messages total.
The 4-way teardown reflects that each direction in TCP can be closed independently:
- The client sends FIN — "I'm done sending; I can still receive."
- The server ACKs.
- When the server application calls
close(), the server sends FIN. - The client ACKs and sits in TIME_WAIT for 2 MSL (Maximum Segment Lifetime, typically 30–60s).
TIME_WAIT lingers after data is done because it has to absorb stragglers from the old connection so they can't get mixed into a new one. When you build something that opens many short-lived sessions (the classic example is a web server in the pre-keepalive HTTP/1.1 era), TIME_WAIT piles up — and that's the famous "ephemeral-port exhaustion" problem.
2.2 Sequence numbers and ACKs — solving order, loss, and duplicates #
The two most important fields in the TCP header are the 32-bit sequence number and the 32-bit ACK number.
- Sequence number = "the position, in this connection, of the first byte in this segment"
- ACK number = "the next byte position I want" (= "I've gotten everything up to here")
Example: if the server sends "500 bytes starting at seq=1000" to the client, the client replies "ack=1500." Just from that:
- Loss detection: no ACK arriving → after a timeout (RTO, retransmission timeout), retransmit
- Reordering: if segments arrive out of order, sort them by sequence number before handing to the app
- De-duplication: a segment with the same sequence number is discarded
In practice, SACK (Selective ACK, RFC 2018) is enabled by default; it lets the receiver tell the sender "I have 1000–1500 and 2500–3000, but 1500–2500 is missing," so only the holes get retransmitted.
2.3 Flow control — the receive window (rwnd) #
A mechanism for the sender to not exceed the receiver's processing capacity. In every ACK, the receiver advertises "how many bytes I can accept from here" (= the receive window, Window Size). The sender ensures that un-ACKed data on the wire does not exceed that window.
TCP's Window Size field is 16 bits (max 65,535), which is too small for modern high-speed paths. Window Scale option (RFC 7323) shifts the window left by up to 14 bits, allowing up to 1 GB.
A symptom like "downloads cap out at 80 MB/s" is often the receiver's kernel buffer not letting the receive window grow large enough. Tuning
net.ipv4.tcp_rmem, or the application'sSO_RCVBUF, can cure it.
2.4 Congestion control — estimating path congestion #
Flow control (§2.3) only respects what the two endpoints can handle; congestion in the middle is a separate problem. That's the territory of congestion control. The sender keeps a separate congestion window (cwnd), and what actually gets sent = min(rwnd, cwnd).
How cwnd evolves is set by the congestion-control algorithm:
| Algorithm | Idea (in brief) | Where it's used |
|---|---|---|
| Reno (1990) | Treats loss as congestion → halves cwnd and recovers |
Older standard |
| CUBIC (2008) | Cubic curve recovery, strong on high-BDP paths | Linux default since 2.6.19 |
| BBR (2016) | Models bandwidth × RTT directly instead of equating loss with congestion | Google has rolled it out widely (YouTube, GCP, …) |
The classical pattern — slow-start grows cwnd exponentially, then congestion-avoidance grows it linearly, and loss halves it — is the exponential → linear → reset triangle, but BBR drops the "loss = throttle" assumption entirely. Lossy Wi-Fi paths, long-RTT satellite links, and ultra-low-RTT data-center fabrics all behave very differently under each algorithm, which is why modern OSes let you switch via sysctl: net.ipv4.tcp_congestion_control.
2.5 Connection state — what ss actually shows you #
TCP's state machine has roughly 11 states, but the ones worth recognizing in operations are limited:
| State | Meaning | When to care |
|---|---|---|
LISTEN |
Server is accepting connections | Confirms the service is up |
ESTABLISHED |
Data can flow | Active connection — watch for excess count |
SYN_SENT / SYN_RECV |
Mid-handshake | Normally clears fast / many SYN_RECV = a SYN-flood signature |
FIN_WAIT_1 / FIN_WAIT_2 |
After sending your own FIN | Normally clears fast |
CLOSE_WAIT |
Peer sent FIN but you haven't called close() yet | Persisting = an application bug (forgotten close) |
TIME_WAIT |
You did the active close / waiting 2 MSL | Even a lot of these is usually normal |
# Per-state count in one line
ss -tan | awk 'NR>1 {print $1}' | sort | uniq -c
# All ESTABLISHED on a port
ss -tan state established sport = :443
A server piling up CLOSE_WAIT is almost always missing a close(), and only an application fix actually solves it. Lots of TIME_WAIT is normal, and the now-deleted tcp_tw_recycle (removed in Linux 4.12) — once recommended for taming it — is worth remembering as an anti-pattern.
3. UDP — When reliability gets in the way #
Everything in §2.1–§2.4 — handshakes, sequence numbers, retransmission, windows — pays a cost to give you "reliable, ordered, deduplicated." For real-time audio, video, and gaming, that cost is the wrong choice: you'd rather show a fresh frame with a gap in it than a stale frame because we're waiting on a retransmit.
UDP (User Datagram Protocol, RFC 768) is the opposite design:
- Connectionless — no handshake; just send
- Just an 8-byte header (source port / destination port / length / checksum)
- No retransmission, no ordering, no flow control, no congestion control
- Whatever guarantees you need, the application builds them
Where UDP is the right pick:
- DNS queries — one round trip is enough; a TCP handshake is overkill
- VoIP / RTP / WebRTC — retransmitting half-second-old audio doesn't help anyone
- Games — the latest input matters more than recovering an old position
- DHCP / NTP / TFTP — bootstrap-time protocols
The clean "reliability → TCP, real-time → UDP" split was the rule for decades — until QUIC (next section) blurred it.
4. QUIC — Rebuilding TCP+TLS+HTTP on top of UDP #
TCP keeps reliability and congestion control inside the OS kernel. That used to be a strength; it became a weakness, because evolution is slow (a new congestion-control algorithm needs a kernel update). And on top of that, the TCP and TLS handshakes used to stack serially (TCP 1 RTT + TLS 1 RTT = 2 RTT) — too heavy for the modern web.
QUIC (RFC 9000, 2021) is a rebuild in user space, on top of UDP:
- TCP + TLS + HTTP unified — TLS 1.3 is built in, and the first connection completes in 1 RTT, resumption in 0 RTT
- Multiplexed streams — independent streams within one connection, eliminating TCP's head-of-line blocking
- Connection IDs — even when the client's IP / port changes (Wi-Fi → 4G handoff), the connection survives
- Built in user space — new congestion-control algorithms can be deployed without OS upgrades
HTTP/3 is "HTTP over QUIC," and Google / Cloudflare / Meta / Akamai have rolled it out widely. As of 2026 it is a standard option for new infrastructure. The neat "UDP = no reliability" picture is, post-QUIC, a thing of the past.
5. Watching it live #
Everything above is observable in real time with tcpdump and ss:
# Watch handshakes and FINs with the smallest possible filter
sudo tcpdump -nn -i any 'tcp port 443 and (tcp[tcpflags] & (tcp-syn|tcp-fin) != 0)'
# All ESTABLISHED connections to one server
ss -tan dst :443
# State distribution and connection counts (worth pinning to an ops dashboard)
ss -tan | awk 'NR>1 {s[$1]++} END {for(k in s) print k, s[k]}'
# Detail including RTT / cwnd / retransmits (Linux)
ss -tin
In ss -tin output, the values cwnd:N, rtt:M/V, retrans:R/T are exactly the congestion window, the round-trip time, and the retransmission count that §2.4 talked about. When you face "this web page is slow," or "only this server-to-server path drops throughput," identifying which layer is the actual bottleneck almost always begins here.
TCP/IP is "the courier that runs the Internet," and TCP, at its center, is best understood as "the set of moves that build reliability on top of best-effort IP" — handshakes, sequence numbers, the receive window, the congestion window. UDP is what's left when you throw all those moves away on purpose; QUIC is the same moves rebuilt in user space for agility, in a different direction. Once the mechanics are in your head, reading ss -tin, deciding whether to adopt HTTP/3, or picking a congestion-control algorithm all stop being other people's problems.