TCP/IP means both "the protocol suite that runs the Internet" and "the four-layer reference model that organizes it". This article starts with the four-layer overview, then makes its center — TCP — the question of how it builds reliability on top of IP's best-effort delivery, walked in the order of the actual internal mechanisms: 3-way handshake / sequence numbers / receive window / congestion control / connection state. It ends by watching the actual behavior live with ss and tcpdump.
The 4-layer model at a glance #
The TCP/IP model splits communication into four layers. Unlike OSI's seven, it doesn't cut L5/L6 out as separate layers — those are absorbed into the Application layer.
| Layer | Name | Role | Representative protocols |
|---|---|---|---|
| L4 | Application | Rules that give meaning to the bytes that arrived | HTTP / DNS / SSH / SMTP / TLS / gRPC |
| L3 | Transport | Logical "app ↔ app" connection — identified by port | TCP / UDP / QUIC / SCTP |
| L2 | Internet | Connects "host ↔ host" at planetary scale — identified by IP | IPv4 / IPv6 / ICMP / IPsec / OSPF / BGP |
| L1 | Link | The physical "how does it reach the next hop" | Ethernet / Wi-Fi / PPP / ARP / fiber |
The IP layer makes no delivery guarantee and no ordering guarantee. "TCP picks up the slack when reliability is needed" and "UDP passes through thinly when it isn't" — that separation is the fundamental reason the Internet has been able to stretch into every imaginable use case.
Restating each layer's job in one sentence:
- Application — the rules that give bytes meaning, like HTTP request/response or DNS query/answer
- Transport — which app, which connection, optionally with reliability and an encrypted session
- Internet — get this packet to the host with that IP, from anywhere on Earth (best-effort)
- Link — right now, on this cable or radio, hand the frame to the neighbor in front of me
The "four guarantees" TCP builds on top of IP #
IP is a service that does nothing more than "write an address on an envelope and drop it in the mail" — no delivery guarantee, no ordering guarantee, no de-duplication guarantee. The reason almost every app on the Internet can still be written assuming "what I send arrives in order, without losses, without duplicates" is that TCP sits in between and hides IP's uncertainty.
What TCP provides is exactly four guarantees:
- Connection — a handshake establishes "we're about to talk"
- Order + de-duplication + loss recovery — sequence numbers + ACKs + retransmission
- Flow control — the receiver tells the sender "stop sending more" via a window
- Congestion control — estimate path congestion and automatically pace the sender
3-way handshake and 4-way teardown #
TCP's defining property is connection-oriented. Before any data flows, both ends agree "we're about to talk" / "go ahead" in three messages.
seq=x and sends. State becomes SYN_SENT.seq=y and replies with ack=x+1. State: SYN_RECV.ack=y+1. Both sides become ESTABLISHED — application data can now flow.The essence of the 3-way handshake is "make each side confirm the other's sequence number, in both directions". Each direction has its own 32-bit sequence number; SYN announces "I'll start from this number", and ACK answers "I'm waiting from +1". Doing the same exchange in both directions = three messages.
4-way teardown and TIME_WAIT #
TCP is designed so that each direction can be closed independently, so termination becomes a 4-step exchange.
FIN_WAIT_1.CLOSE_WAIT, client becomes FIN_WAIT_2.close().Opening huge numbers of short-lived sessions piles up TIME_WAIT entries and eventually causes "port exhaustion" — a classic problem of web servers from the keep-alive-less HTTP/1.1 era. CLOSE_WAIT hanging around for a long time, on the other hand, is the signature of the application forgetting to call close(); the only real fix is in the app.
Sequence numbers and ACKs — ordering, loss, duplicates #
The two most important fields in the TCP header are the sequence number (32-bit) and the ACK number (32-bit).
- Sequence number — the position (in bytes, on this connection) of this segment's first byte
- ACK number — the next byte the receiver wants (= "I've received everything up to here")
Example: the server sends "500 bytes starting at seq=1000", and the client replies "ack=1500". That single mechanism solves three problems at once.
# Loss detection
No ACK arrives → RTO (Retransmission Timeout) expires → retransmit
# Reordering
Segments that arrive out of order are sorted by seq before delivery to the app
# De-duplication
Segments with a seq we've already seen are droppedIn practice, SACK (Selective ACK, RFC 2018) is in use almost everywhere: the receiver tells the sender "I have 1000-1500 and 2500-3000 but not 1500-2500", marking the holes so only the missing pieces are retransmitted.
Flow control — the receive window (rwnd) #
The mechanism that prevents the sender from outrunning the receiver's processing capacity. On every ACK, the receiver advertises "I can take this many bytes from here" (= the receive window, Window Size) in the header, and the sender keeps "unacknowledged in flight" below that window.
The TCP header's Window Size field is 16-bit (max 65,535), which isn't enough for modern high-speed links. The Window Scale option (RFC 7323) left-shifts the window by up to 14 bits, allowing it to reach up to 1 GB.
Often the real cause is the receive window not being able to grow because the kernel-side buffer is too small. Tuning net.ipv4.tcp_rmem or the application's SO_RCVBUF can lift the ceiling.
Congestion control — estimating path congestion #
Flow control (rwnd) only sees the two endpoints' situation. Congestion along the path is the job of congestion control. The sender keeps a second window — the congestion window (cwnd) — and what it can actually send is throttled to min(rwnd, cwnd).
| Algorithm | How it works (essence) | Adoption |
|---|---|---|
| Reno (1990) | Assumes loss = congestion → halve cwnd and restart | Old standard |
| CUBIC (2008) | Cubic-curve recovery of cwnd, strong on high-BDP links | Linux default (since 2.6.19) |
| BBR (2016) | Models estimated bandwidth × RTT directly rather than loss | Widely used at Google (YouTube, GCP) |
Slow Start doubles cwnd at every round trip, the algorithm switches to Congestion Avoidance at a threshold, and loss cuts it down — the classic "exponential → linear → braking" 3-phase cycle. BBR throws away the underlying assumption "loss = slow down" entirely. Wi-Fi paths with high retransmits, satellite links with long RTT, ultra-low-RTT data-center fabric — each shows very different numbers, so modern kernels let you switch via sysctl.
$ sysctl net.ipv4.tcp_congestion_control
net.ipv4.tcp_congestion_control = cubic
$ sysctl -w net.ipv4.tcp_congestion_control=bbr
$ sysctl net.ipv4.tcp_available_congestion_control
net.ipv4.tcp_available_congestion_control = reno cubic bbrConnection state — the six worth watching #
The TCP state machine has roughly 11 states, but only a handful actually matter day to day.
| State | Meaning | When to care |
|---|---|---|
LISTEN |
Server is waiting for accept | Confirming the service is up |
ESTABLISHED |
Data is flowing | Active connections — is the count too high? |
SYN_SENT / SYN_RECV |
Mid-handshake | Many SYN_RECV = a sign of SYN flood |
FIN_WAIT_1 / FIN_WAIT_2 |
After your side sent FIN | Normally clears quickly |
CLOSE_WAIT |
You got the peer's FIN but haven't called close yet | Long-lived = an app bug |
TIME_WAIT |
Active-close side / 2 MSL wait | Often plentiful, usually fine |
A large pile of TIME_WAIT is normal; trying to forcibly squash it via tcp_tw_recycle (removed in Linux 4.12) is a well-known historical anti-pattern. It was known to break clients behind NAT.
UDP — when reliability is "in the way" #
Every piece of TCP's machinery is paying a cost to give you "arrived / in order / no duplicates". For real-time voice, video, or games — where "show a newer frame even if a few pixels are missing, rather than wait for stale data to catch up" is what you actually want — TCP's guarantees get in the way.
UDP (User Datagram Protocol, RFC 768) is designed at the opposite extreme:
- Connectionless — no handshake, just send
- Header is only 8 bytes (source port / destination port / length / checksum)
- No retransmission / no ordering / no flow control / no congestion control
- The app provides whatever guarantees it needs, itself
Representative cases that pick UDP:
- DNS queries — a TCP handshake is overkill when one round trip is enough
- VoIP / RTP / WebRTC — re-delivering a 0.5-second-old voice packet is pointless
- Games — newer inputs always beat older positions
- DHCP / NTP / TFTP — bootstrap-oriented
QUIC — rebuilding TCP+TLS+HTTP on top of UDP #
TCP is a protocol whose reliability and congestion control live in the OS kernel. For a long time that was an advantage, but it has become a weakness: evolution is slow (new algorithms require an OS update), and stacking TCP and TLS handshakes in series (TCP 1 RTT + TLS 1 RTT = 2 RTT) is also heavy for today's web.
QUIC (RFC 9000, 2021) rebuilds all of this in userspace, on top of UDP:
- TCP+TLS+HTTP merged into one — TLS 1.3 is built in; first connection is 1 RTT, repeat connection is 0 RTT
- Multiple streams — independent streams multiplexed over one connection, eliminating TCP's head-of-line blocking
- Connection ID — IP/port can change (Wi-Fi → 4G handover) without breaking the connection
- Userspace implementation — new congestion control algorithms ship far faster
HTTP/3 is "HTTP over QUIC", broadly adopted by Google / Cloudflare / Meta / Akamai. As of 2026 it has become a standard option for new infrastructure. The simple slogan "UDP = no reliability" became dated the day QUIC shipped.
Seeing the packets for real #
Everything covered above can be observed live with tcpdump and ss.
$ sudo tcpdump -nn -i any 'tcp port 443 and (tcp[tcpflags] & (tcp-syn|tcp-fin) != 0)'
12:34:56.123 IP 10.0.0.5.50234 > 93.184.216.34.443: Flags [S], seq 100, ...
12:34:56.145 IP 93.184.216.34.443 > 10.0.0.5.50234: Flags [S.], seq 200, ack 101, ...
12:34:56.146 IP 10.0.0.5.50234 > 93.184.216.34.443: Flags [.], ack 201, ...# breakdown by state — the kind of number you want on a dashboard
$ ss -tan | awk 'NR>1 {s[$1]++} END {for(k in s) print k, s[k]}'
ESTAB 142
TIME-WAIT 38
LISTEN 12
CLOSE-WAIT 2
# list ESTABLISHED on a specific port
$ ss -tan state established sport = :443
# RTT / cwnd / retransmits and the rest (Linux)
$ ss -tin
cubic wscale:7,7 rto:204 rtt:3.142/1.5 ato:40 mss:1448 cwnd:10 ssthresh:7 bytes_sent:8421 retrans:0/2The cwnd:N / rtt:M/V / retrans:R/T shown by ss -tin reveals the congestion window, round-trip time, and retransmits directly. "The page is slow," "transfer collapses between these two servers only" — almost every TCP investigation ends back here.
Summary #
- TCP/IP is the Internet's "courier service", and the heart of the four-layer model is Transport (TCP) and Internet (IP)
- TCP's essence is "a chain of mechanisms that build reliability on top of best-effort IP" = handshake + sequence numbers + receive window + congestion window
- UDP shows "what happens when you throw all of those mechanisms away"; QUIC shows "what you get when you redesign the whole stack in userspace" — both go off in opposite directions from TCP
- Once you have the mechanisms, reading
ss -tin, choosing whether to adopt HTTP/3, and picking a congestion-control algorithm all become things you can reason about on your own axis