Open Source Intelligence (OSINT) #
OSINT (Open Source Intelligence) is the umbrella term for the practice and culture of using publicly available information alone to derive useful knowledge about people, organizations, infrastructure, or events. "Publicly available" means anything that anyone can lawfully access — the Web, social networks, gazettes and registries, news media, satellite imagery, DNS records, code repositories, data brokers, leaked datasets — and it is deliberately separated from hacking or unauthorized data collection.
OSINT is not a single tool or job title. It powers attacker reconnaissance (the preparation phase of an intrusion), defensive threat intelligence, investigative journalism, digital forensics, pre-employment background checks, disaster damage assessment, and missing-person searches. The premise is "more than half of the answers are already public", and the question becomes how to find, verify, and reassemble them quickly, accurately, and ethically.
1. The history of OSINT #
1.1 Cold War era — Foreign Broadcast Information Service (1941) #
The term open source intelligence traces back to World War II. In 1941 the United States set up the Foreign Broadcast Monitoring Service (later FBIS) to translate and analyze adversary and neutral-country radio, newspapers, and publications. During the Cold War, public information reportedly accounted for as much as 80% of all intelligence, and Soviet rail timetables, agricultural statistics, and provincial obituaries were used to infer military movements and economic conditions.
1.2 1980s–1990s — Arrival of the Internet #
The 1990s spread of the Web made it possible to search public information from your own home, anywhere in the world. OSINT in this era was still effectively "Google for it"; no specialized methodology had been formalized yet.
1.3 2000s — Google hacking and the rise of social media #
In 2002 Johnny Long published the "Google Hacking Database", codifying techniques for combining inurl:, filetype:, intitle:, and other operators to dig out sensitive public files and configuration leaks. The same era saw the launches of Friendster (2002), MySpace (2003), Facebook (2004), Twitter (2006), and LinkedIn (2003) — the era when people voluntarily started publishing their own personal information at scale.
1.4 2010s — Bellingcat and citizen intelligence #
In 2014, Bellingcat was founded by Eliot Higgins. Through investigations of MH17, the Syrian chemical-weapons attacks, and Russian assassination attempts, it showed the world the power of citizen OSINT — combining public videos, satellite imagery, social-media posts, flight-radar data, and street-view photography. Around the same time, dedicated tools matured: Maltego (Paterva, 2007–) and theHarvester (Christian Martorella, 2011–), and OSINT emerged as an independent discipline.
1.5 2020s — Wartime OSINT and AI #
After Russia's 2022 invasion of Ukraine, the front line became near-real-time observable through citizen TikTok / Twitter / Telegram posts, satellite imagery (Maxar, Planet), and IoT geolocation, ushering in an era of open battle-space visualization. AI-powered image analysis, geolocation, machine translation, and face recognition exploded into widespread availability, raising both the scale and the precision of OSINT a notch.
2. The OSINT cycle #
The intelligence cycle that the military and intelligence communities have used for decades applies directly to OSINT. The point is "loop, don't go straight" — every answer you obtain produces a new question that feeds back into collection.
A quick walk through each step:
- Planning — Put your question in words and form a hypothesis. "Which cloud vendor does company X use?", "Where does person Y operate from?", "When did event Z take place?". A scope that's too broad burns infinite time, so keep yourself to 1–3 questions at a time.
- Collection — Pull public information per your plan. Web, social, DNS, satellite imagery, breach databases, government data — run the sources covered later in parallel.
- Processing — Turn what you found into machine-readable form. Transcribe video to text, OCR images, translate foreign-language sources, align timestamps to UTC, extract metadata.
- Analysis — Stitch the fragments together, spot contradictions and biases, and pull out evidence that supports or refutes your hypothesis. The rule: triangulate from at least three independent sources.
- Dissemination — Get the conclusion to the right audience as a report, timeline, or visualization. Separate your judgments from your citations so the work is verifiable and trustworthy after the fact.
Whatever answer you get usually creates a new question, and you go back to Planning. In the field, a single case loops several times.
3. Categories of data sources #
OSINT data sources broadly map to the following categories.
| Category | Examples | What it tells you |
|---|---|---|
| Web pages and blogs | Official sites, news, blogs, job postings, IR documents | Org info, contacts, technology stack |
| Social media | Twitter/X, Facebook, Instagram, LinkedIn, TikTok, Telegram, Discord | Connections, movement, preferences, timeline |
| Image and video | YouTube, TikTok, Flickr, satellite imagery (Maxar, Planet, Sentinel) | Place, time, identity correlation |
| Maps and geography | Google Maps Street View, OpenStreetMap, Mapillary, KartaView | Street features, building layout |
| Public records | Corporate registries, real-estate registries, court records, government statistics, FOIA disclosures | Officers, shareholders, owners, lawsuits |
| DNS / IP / certificates | DNS records, BGP, WHOIS, crt.sh, Shodan, Censys | Infrastructure layout, exposed services |
| Code and infra leaks | GitHub, GitLab, Pastebin, public S3 buckets | Credentials, internal designs |
| Breach / leak data | Have I Been Pwned, Dehashed, leak forums | Passwords, emails, phone numbers |
| Data brokers | Spokeo, BeenVerified, etc. | Addresses, phones, family (especially in the US) |
| Device metadata | EXIF, IPTC, ID3 tags, browser fingerprints | Capture device, geotags, edit history |
Caveat: what counts as "public" varies enormously by country and region. In Japan, corporate-registry copies are obtainable for a fee through the official portal; US real-estate data is essentially fully public at the county level; in the EU, GDPR sharply restricts secondary use of personal data. Confirm legality in your own jurisdiction first.
4. Canonical techniques #
4.1 Search operators (Google dorking) #
The Google / Bing operator vocabulary lets you surface public files and configuration data that ordinary searches miss.
| Operator | Example | Purpose |
|---|---|---|
site: |
site:example.com |
Restrict to a domain |
inurl: |
inurl:admin |
Substring in the URL |
intitle: |
intitle:"index of" |
Substring in the page title |
filetype: |
filetype:pdf "internal" |
Restrict to a file extension |
intext: |
intext:"password" |
Substring in the body |
cache: |
cache:example.com |
Google's cached copy |
- |
-marketing |
Exclude |
"" |
"social security number" |
Exact phrase |
Practical combinations:
# Open-directory leakage on a target org
site:example.com intitle:"index of" -html
# .env files left publicly readable
filetype:env "DB_PASSWORD"
# Pastebin hits for an internal-only string
site:pastebin.com "internal-only" example.com
# Old, vulnerable phpMyAdmin instances
inurl:phpmyadmin/index.php intitle:"phpMyAdmin 2."
The Google Hacking Database (GHDB) (exploit-db.com/google-hacking-database) curates thousands of these queries.
4.2 Reverse image search #
Search for "the same / similar" image starting from one image.
- Google Images (images.google.com) — strong on common subjects, celebrities, products
- TinEye (tineye.com) — strong on identifying the original posting date and the sites where it appears
- Yandex Images (yandex.com/images) — unusually strong on face matching and place identification (a class apart in OSINT circles)
- Bing Visual Search — product recognition
- PimEyes — face only; ethically contested
Real-world uses:
- Spot fake social-media accounts whose profile photos are stock-photo reuses, instantly
- Locate the actual capture site of a disaster / incident photo in a single Yandex hop
- Run a scam-site product photo through TinEye and see "the same image is reused on 50 sites"
4.3 Geolocation #
The technique of identifying capture location from photos and videos. The core of the GeoGuessr-style OSINT that Bellingcat popularized.
Clues to hunt for:
- Road signs and street markings — language, font, color, shape
- Building style and roof color — characteristic by country
- Vegetation — palms vs. conifers, seasonal state
- Sun position and shadow length — infer time of day and latitude (SunCalc.org)
- Vehicles and license plates
- Power poles, wiring, mailboxes
- Background mountains and coastlines, cross-referenced with Google Earth
Tools:
- SunCalc — narrow time and date from sun azimuth and altitude
- Mapillary / KartaView — street imagery
- Wikimapia — user-contributed geographic tags
4.4 Social media intelligence (SOCMINT) #
Mining social media systematically as a primary source on people and events.
Typical investigation items:
- Account creation date and first post
- Following / follower network
- Posting-time histogram → infer the user's time zone
- Photo geotags (sometimes still in EXIF even after stripping)
- Likes / comments graph → close relationships
- Username / phone / email matches across platforms to consolidate identity
Tools:
- Sherlock — bulk-check a username across hundreds of sites
- Maigret — Sherlock fork with broader coverage
- WhatsMyName — same idea, browser-based
- OSINT Industries — commercial; given an email or phone, returns full social presence
- snscrape — scraper for Twitter, Facebook, etc.
4.5 Domain / IP / certificate OSINT #
Map an organization's infrastructure from the outside — the starting point for attacker recon, pentesting, and threat intelligence alike.
# WHOIS — domain registration data
whois example.com
# DNS — various record types
dig example.com ANY +noall +answer
dig +short MX example.com
dig +short TXT example.com # SPF, DKIM, DMARC
# Subdomain enumeration
subfinder -d example.com -all -silent
amass enum -d example.com
# Certificate Transparency logs — every cert ever issued
curl -s "https://crt.sh/?q=%25.example.com&output=json" | jq -r '.[].name_value' | sort -u
# Shodan — public ports / banners / vulns
shodan host 93.184.216.34
shodan search "Server: Apache" port:80 country:JP
# Censys — Shodan competitor; especially strong on TLS-cert indexing
censys search 'services.tls.certificates.leaf_data.subject.common_name: "example.com"'
# Past Web — chase information that's been deleted
curl -s "http://web.archive.org/cdx/search/cdx?url=example.com/*&output=json" | head -20
4.6 Metadata and EXIF analysis #
Photos, PDFs, Word docs, and videos all carry abundant metadata.
# Image EXIF
exiftool photo.jpg
# → camera model, capture date and time, GPS coords (Lat/Lon), white balance, lens, serial number
# PDF / Office metadata
exiftool report.pdf
# → author name, last editor, software version, comments, revision history
# Video
ffprobe -v error -show_format -show_streams video.mp4
# Pull metadata from a whole site at once
metagoofil -d example.com -t pdf,doc,xls -l 100 -n 50 -o results
EXIF is easy to forget to strip, and in journalism and investigations there are many cases where the author name in a leaked PDF identified an internal organizational connection.
5. Major tools #
5.1 All-in-one OSINT frameworks #
| Tool | Use | License |
|---|---|---|
| Maltego (maltego.com) | Graph-visualization OSINT IDE; "Transforms" link external data sources into nodes | Commercial (limited Community edition) |
| SpiderFoot (spiderfoot.net) | 200+ modules of automated OSINT collection; HX is the cloud edition | Open / commercial (HX) |
| Recon-ng (github.com/lanmaster53/recon-ng) | Metasploit-style interactive CLI, modular | Open |
| Datasploit | Python-based, broad sweeps over domain / email / person | Open (archived) |
5.2 Target-specific tools #
| Tool | Input | Output |
|---|---|---|
| theHarvester | Domain | Emails / subdomains / employee names |
| Sherlock / Maigret | Username | Existence checks across hundreds of social sites |
| holehe | Email address | List of services where the email is registered |
| GHunt | Gmail | Public info on the corresponding Google account |
| EmailRep | Email address | Trust score / breach history |
| OSINT Framework (osintframework.com) | — | Curated link tree of tools, organized by purpose |
5.3 Specialized search engines #
| Service | Search target |
|---|---|
| Shodan (shodan.io) | Internet-exposed hosts, services, banners |
| Censys (censys.io) | TLS certificate index, hosts, subdomains |
| ZoomEye (zoomeye.org) | A Chinese counterpart to Shodan |
| Wayback Machine (web.archive.org) | Historical Web (1996–) |
| Internet Archive | Video, books, software |
| GreyNoise (greynoise.io) | Classifies "Internet noise" — scanner IPs and the like |
5.4 Breach data services #
| Service | Use |
|---|---|
| Have I Been Pwned (haveibeenpwned.com) | Has my email / password leaked in a known breach? |
| Dehashed | Commercial; cross-search of breach databases (ethically heavy — only for lawful investigations) |
| IntelX | Search engine over public and leaked documents |
6. Worked example workflows #
Two concrete OSINT flows.
6.1 From domain to a complete picture of an organization's infrastructure #
Starting point: "We want to pentest company X (already authorized)."
whois example.comfor registration info and contactsdig/subfinderto enumerate subdomains- crt.sh for every certificate ever issued → discover hidden subdomains (e.g.
vpn-staging.example.com) - Shodan sweeps each IP for
port:443port:22port:3389→ spot exposed RDP, old OpenSSH builds - theHarvester harvests email addresses from LinkedIn / Google → a list approximating the employee directory
- Run those employees through Have I Been Pwned → past password hashes that leaked are leading indicators of credential reuse risk in current accounts
- GitHub search for
org:example-corp→ tokens, internal hostnames, accidentally published API keys
Aggregating each step in a Maltego or SpiderFoot graph yields a single-page map of "infrastructure and people, rooted in the domain."
6.2 From one photo to capture location and time #
Verifying a photo posted to social media as evidence of an incident or unlawful act.
- Check whether EXIF still has GPS via
exiftool photo.jpg— if not, move on - Reverse-search via TinEye / Google Images / Yandex Images to see "where else this image appears"
- Language and typography of background text (signs, markings) → narrow country / region
- Architecture, vegetation, power-pole shape → narrow further within the country
- Google Earth / Mapillary to walk candidate locations and find one that matches the photo's perspective
- Read shadow length and azimuth from the photo, then use SunCalc to compute "at this latitude, that shadow corresponds to roughly this hour" → narrow capture time
- Wayback Machine — compare past photos of the same location; construction or remodeling can pin "must have been taken within the last N years"
For MH17, Bellingcat exhausted these techniques to reconstruct the movement of the Russian Buk system minute by minute.
7. Ethics and law #
"OSINT is just public information" still leaves the what's allowed vs. what isn't boundary fuzzy, and that boundary is drawn by both national law and professional ethics.
7.1 Areas that are often illegal or borderline #
- Unauthorized access — viewing public pages is fine, but guessing credentials to log in is unauthorized access (Japan: Unauthorized Computer Access Law; US: CFAA)
- Secondary use / sale of personal data — GDPR (EU), the amended Personal Information Protection Act (Japan), CCPA (California) all require a legitimate purpose
- Stalking and harassment — depending on use, the same data collection becomes a crime
- Child-related content — possession itself is criminalized (no investigative exception)
- Use of leaked data — checking "is my email in a breach?" is fine; redistributing someone else's leaked data is almost universally not
- Face-recognition OSINT — Clearview AI use is illegal in the EU; PimEyes is contested; in Japan, portrait rights and privacy concerns apply
7.2 Ethics codes #
Guidelines widely supported in investigative journalism and the security industry:
- Legitimacy of purpose — public interest, contractual engagement, or self-defense
- Proportionality — depth of investigation must not exceed what the purpose requires
- Minimization — collect only what's needed and discard everyone else's irrelevant data
- Triangulation — never conclude from a single source; require at least three independent ones
- Reject misidentification — keep the risk of name collisions / wrong inferences in mind throughout
- Notify those affected — before publication, consider whether subjects or relevant authorities should be notified or given a right of reply
7.3 Notes for working as a professional #
- Get a written engagement / NDA
- Keep logs — preserve queries, source URLs, and timestamps for full reproducibility
- Data hygiene — encrypt at rest, securely delete after the engagement ends
- Cross-border laws — comply with the law of the target's region (rules limiting cross-border investigation)
8. Defense — Counter-OSINT #
If attackers can assemble a picture of you with OSINT, defense is making yourself harder to assemble.
8.1 At the personal level #
- Privacy settings on social media — friends-only at minimum; treat your real name, employer, and alma mater as careful disclosures
- Strip EXIF before posting photos (Twitter / Instagram do it automatically; Discord and Telegram sometimes don't)
- Disable geotagging — turn it off in your phone's camera settings
- Use different usernames — same handle across all platforms means Sherlock-class tools consolidate you in one shot
- Compartmentalize email / phone — main, secondary, throwaway — three buckets to keep things separate
- Google Alerts — get notified on hits for your name / email / phone
- Have I Been Pwned notifications — change passwords the moment your email shows up in a new breach
8.2 At the organizational level #
- WHOIS privacy — enable the registrar's proxy
- Certificate-issuance policy — don't accidentally surface
staging.,dev.,internal.to crt.sh (use an internal CA) - GitHub / GitLab secret scanning — keep it on continuously to catch accidental commits early
- Watch ex-employees' personal SNS / signatures — they leak job titles, org charts, and customer names
- Self-monitoring on Shodan — periodically scan your own IP space for unexpected exposed services
- Email-list / roster cache scrubbing — check Pastebin and Internet Archive for unintended copies
- OSINT-style red-team drills — periodically perform OSINT on yourselves from the outside
8.3 Counter-intelligence and deliberate noise #
When organizational defense gets sophisticated, deliberately seeding misleading information to raise the attacker's cost is on the table — fake employee names, deliberately vulnerable-looking exposed endpoints (honeypots), seeded fake leaked accounts. This is ethically and legally delicate, so do it under expert advice.
9. Conclusion #
OSINT is not "a magic toolbox"; it's the combination of how you frame the question, how efficiently you gather public information, how rigorously you verify, and how you make ethical judgments. As Bellingcat and wartime OSINT have shown, even citizen-level practice can get to the bottom of historic events. But because it touches the privacy, safety, and reputation of its subjects, you have to keep asking yourself "can do" vs. "should do" at every step.
For security practitioners, OSINT is the phase that attackers will absolutely run first, so defenders are equally obligated to know how their organization (and they themselves) look from outside. Trying these techniques on yourself is the best way to enter OSINT — and the best defense you can mount.