OSINT — Methods, Tools, and Real-World Examples of Open-Source Investigation

OSINT (Open Source Intelligence) is the umbrella term for the techniques and culture of using publicly available information alone to investigate people, organisations, infrastructure, or events. Its applications are wide: attacker reconnaissance, defensive threat intelligence, investigative journalism, digital forensics. Starting from the premise that "more than half the answer is already out in the open", the essence of OSINT is how fast, how accurately, and how ethically you can gather and re-assemble it.

A short history of OSINT #

The term OSINT dates back to World War II. In 1941 the United States established the Foreign Broadcast Monitoring Service (later FBIS) to translate and analyse enemy and neutral-country radio, newspapers, and publications. Throughout the Cold War, public-source information was said to account for 80% of intelligence — Soviet rail timetables, agricultural statistics, and local-paper obituaries were used to infer military movements.

1941 — FBIS established

Systematic monitoring, translation, and analysis of foreign broadcasts. The institutional origin of OSINT.

1990s — The Web arrives

Public-source information from around the world becomes searchable from home. Methodology, though, is still at the level of "search on Google".

2002 — Google Hacking

Johnny Long publishes the Google Hacking Database. The discipline of combining inurl:, filetype:, and intitle: to unearth sensitive material is codified. The major social networks (Facebook 2004, Twitter 2006, LinkedIn 2003) also appear.

2014 — Bellingcat founded

Eliot Higgins reconstructs the downing of MH17, Syrian chemical-weapons attacks, and Russian assassination attempts by combining public videos, satellite imagery, social-media posts, and Flightradar data. The power of citizen OSINT is demonstrated to the world.

2022 onward — Wartime OSINT and AI

Since Russia's invasion of Ukraine, TikTok / Twitter / Telegram civilian posts have made the front line trackable almost in real time. Maxar and Planet satellite imagery, AI-driven face matching, and AI geolocation have pushed both scale and accuracy a tier higher.

The OSINT cycle #

The intelligence cycle long used by militaries and intel agencies applies to OSINT unchanged. The point is that it loops — each round of findings produces new questions, which feed back into more collection.

1. Planning

Put what you want to know into words and formulate hypotheses. If the scope is too wide, you'll burn unlimited time, so the iron rule is to narrow it down to 1–3 questions.

2. Collection

Run Web / social media / DNS / satellite imagery / leak data / government data sources in parallel.

3. Processing

Transcribe videos, OCR images, translate foreign-language text, align timestamps to UTC, extract metadata. Get everything into a machine-readable form.

4. Analysis

Connect fragments, detect contradictions and biases, and pull out evidence that supports or refutes the hypothesis. Triangulation — confirmation by three or more independent sources — is the rule.

5. Dissemination

Turn conclusions into reports, timelines, and visualisations. Keeping your inferences clearly separated from the sources you cite makes the work verifiable later and earns trust.

The answers almost always raise new questions and you go back to Planning. In real engagements you loop several times per case.

Data source categories #

Category	Examples	What it reveals
Web / blogs	Official sites, news, IR materials, job postings	Org info, contact details, tech stack
Social media	Twitter/X, Facebook, Instagram, LinkedIn, TikTok, Telegram	Connections, movement, preferences, timeline
Images / video	YouTube, TikTok, Flickr, satellite imagery (Maxar, Planet, Sentinel)	Location, time, identity
Maps / geography	Google Maps Street View, OpenStreetMap, Mapillary	Street features, building layout
Public records	Corporate registry, real-estate records, court records, FOIA disclosures	Officers, shareholders, disputes
DNS / IP / certificates	WHOIS, crt.sh, Shodan, Censys	Infrastructure layout, vulnerabilities
Code leaks	GitHub, GitLab, Pastebin, public S3	Credentials, internal design
Breached data	Have I Been Pwned, Dehashed	Passwords, email addresses
Device fingerprints	EXIF, IPTC, ID3, FP	Camera, geotags

▸ "Public" is defined differently country by country

In Japan corporate registry records can be obtained by anyone who pays the fee through the Registry Information Service; US real-estate records are fully public at the county level; the EU under GDPR strictly limits secondary use of personally identifiable information. Verifying legality under your own jurisdiction is the first priority.

Google dorking — search operators #

Special operators in Google / Bing surface public files and configuration that ordinary search wouldn't find.

Operator	Example	Purpose
`site:`	`site:example.com`	Restrict to a domain
`inurl:`	`inurl:admin`	String in URL
`intitle:`	`intitle:"index of"`	String in title
`filetype:`	`filetype:pdf "internal"`	Narrow by file extension
`intext:`	`intext:"password"`	String in body text
`cache:`	`cache:example.com`	Google's cached copy
`-`	`-marketing`	Exclude
`""`	`"social security number"`	Exact match

Practical dork examples

# Open-directory leakage from an organisation
site:example.com intitle:"index of" -html
# Mistakenly-public .env files
filetype:env "DB_PASSWORD"
# Specific keywords on Pastebin
site:pastebin.com "internal-only" example.com
# Old, vulnerable phpMyAdmin
inurl:phpmyadmin/index.php intitle:"phpMyAdmin 2."

The Google Hacking Database (GHDB) (exploit-db.com/google-hacking-database) catalogues thousands of pre-built queries.

Images and geolocation #

Reverse image search #

Google Images — strong for everyday subjects, celebrities, and products
TinEye — strong at finding original posting date and the sites where an image appeared
Yandex Images — uncannily strong at face matching and location ID (in OSINT circles, a category of its own)
Bing Visual Search — product recognition
PimEyes — faces only; ethically debated

Geolocation #

The technique of "figuring out where a photo or video was taken". The core of the GeoGuessr-style OSINT that Bellingcat popularised.

Road signs and street markings — language and typography (font, colour, shape)
Building architecture and roof colours (have national characteristics)
Vegetation (palms vs conifers, seasonal state)
Sun position and shadow length for inferring time and latitude (SunCalc.org)
Vehicles and license plates
Power poles, wiring, postboxes
Mountains and coastlines in the background, cross-referenced with Google Earth

▸ Why Yandex is so strong on faces and places

Yandex dominates over Google in Russia, so its training set of Russian-language street images and portraits is overwhelmingly large. If you're looking for a face or location match, throw it at Yandex first. Bellingcat leaned on it heavily for the MH17 and Syria chemical-weapons investigations.

Social media and person reconciliation #

Working through social media (SOCMINT) systematically — it's the primary source for both people and events.

Typical items to investigate:

Account creation date and the first post
The follow / follower network
Histogram of posting times → likely time zone of residence
Photo geotags (sometimes still in EXIF even when stripped from display)
"Likes" / comment targets → close relationships
Cross-matching usernames, phone numbers, and emails across networks for identity reconciliation

Bulk-scan social networks by username or email

# Sherlock — search for a username across hundreds of sites at once
$ sherlock johndoe
[+] GitHub: https://github.com/johndoe
[+] Reddit: https://reddit.com/user/johndoe
[+] Instagram: https://instagram.com/johndoe
# Maigret — Sherlock fork, even more sites
$ maigret johndoe --top-sites 500
# holehe — infer services an email is registered to
$ holehe target@example.com
# GHunt — Google account public info from a Gmail address
$ ghunt email target@gmail.com

Domain, IP, and certificate OSINT #

Mapping an organisation's infrastructure from the outside. The starting point for attacker recon, penetration testing, and threat intelligence alike.

WHOIS / DNS / subdomain enumeration

# WHOIS — domain registration info
$ whois example.com
# DNS — assorted records
$ dig example.com ANY +noall +answer
$ dig +short MX example.com
$ dig +short TXT example.com   # SPF, DKIM, DMARC
# Subdomain enumeration
$ subfinder -d example.com -all -silent
$ amass enum -d example.com

Certificate Transparency logs and Shodan

# crt.sh — every certificate ever issued (hidden subdomains pop out)
$ curl -s "https://crt.sh/?q=%25.example.com&output=json" \
    | jq -r '.[].name_value' | sort -u
# Shodan — ports, banners, and vulnerabilities of publicly-exposed hosts
$ shodan host 93.184.216.34
$ shodan search "Server: Apache" port:80 country:JP
# Censys — competitor with very strong TLS-certificate indexing
$ censys search 'services.tls.certificates.leaf_data.subject.common_name: "example.com"'
# Wayback Machine — deleted historical pages
$ curl -s "http://web.archive.org/cdx/search/cdx?url=example.com/*&output=json"

Metadata and EXIF analysis #

Photos, PDFs, Office documents, and video all carry a great deal of metadata.

EXIF / Office metadata extraction

# Image EXIF — camera model, timestamp, GPS coordinates, serial number
$ exiftool photo.jpg
# PDF / Office — author, last editor, software, revision history
$ exiftool report.pdf
# Video
$ ffprobe -v error -show_format -show_streams video.mp4
# Bulk-collect metadata across a whole site
$ metagoofil -d example.com -t pdf,doc,xls -l 100 -o results

EXIF is very commonly forgotten. There are many press and investigation cases where the author of an internal PDF revealed someone inside the organisation.

Major tools and search engines #

All-in-one frameworks #

Tool	Use case	Licence
Maltego	Graph-visualisation OSINT IDE; Transforms unify various sources as nodes	Commercial (Community edition available)
SpiderFoot	Automated OSINT collection across 200+ modules; HX edition is cloud-hosted	Open / commercial (HX)
Recon-ng	Metasploit-like interactive CLI, modular	Open

Target-specific tools #

Tool	Input	Output
theHarvester	Domain	Emails / subdomains / employee names
Sherlock / Maigret	Username	Presence check across hundreds of social networks
holehe	Email address	List of registered services
GHunt	Gmail address	Public Google-account information
OSINT Framework (osintframework.com)	—	Categorised link directory of tools by purpose

Specialised search engines #

Service	Searches
Shodan	Internet-exposed hosts / services / banners
Censys	TLS certificate index, hosts, subdomains
ZoomEye	Chinese-built equivalent of Shodan
Wayback Machine	Historical Web (1996–present)
GreyNoise	Classification of "Internet background noise" (scanner IPs)
Have I Been Pwned	Check whether an email / password appears in breaches

Worked example workflows #

(1) From a domain to a full picture of an organisation's infrastructure #

Starting point: "We want to pentest Company X's site (with authorisation)."

1. WHOIS for registration info

whois example.com for contact info and registration date.

2. Subdomain enumeration

Combine subfinder / amass with crt.sh to find hidden assets like vpn-staging.example.com.

3. Shodan for exposure

Check port:443/22/3389 against each IP — discover exposed RDP, old OpenSSH.

4. Build the employee list

Use theHarvester to collect email addresses from LinkedIn / Google → an employee roster.

5. Breach history

Submit employee emails to Have I Been Pwned to estimate the risk of password reuse.

6. GitHub secrets

Search org:example-corp for committed tokens, internal hostnames, and leaked API keys.

Aggregating each step into a Maltego or SpiderFoot graph gives you a single picture of "the relationships between infrastructure and people, anchored at the domain".

(2) Pinpointing time and place from a single photograph #

Verifying photos posted to social media as evidence of an incident or illegal act. Even without EXIF, you can narrow the location down by combining background text, building style, vegetation, power poles, and the sun's position. For the MH17 case, Bellingcat used exactly this technique to reconstruct the path of the Russian Buk system minute by minute.

Ethics, law, and Counter-OSINT #

OSINT may be "public information only", but the line between what you may and may not do is blurry, drawn by both national law and professional ethics.

▸ Areas that are illegal or borderline in many countries

Unauthorised access — browsing public pages is fine, but guessing IDs / passwords to log in falls under the Unauthorised Computer Access Act (Japan) / CFAA (US)
Secondary use or sale of personal information — GDPR / Japan's Amended PIPA / CCPA require a legitimate purpose
Stalking and harassment — depending on how collected information is used, it can become a crime
CSAM-related content — possession itself is criminal (no investigative-purpose exception)
Redistributing leaked data — checking your own email is fine; redistributing someone else's leak data is not
Face-recognition OSINT — Clearview AI has been ruled illegal in the EU; PimEyes is under continuing debate

Ethics code #

Guidelines broadly endorsed by investigative journalism and the security industry:

Legitimate purpose — does it fall under public interest / contractual work / self-defence?
Proportionality — is the depth of the investigation excessive relative to the goal?
Minimisation — collect only what's necessary; discard unrelated third-party data
Triangulation — never conclude from a single source; require at least three independent ones
Eliminate false identification — stay constantly alert to the risk of misidentification (people with the same name, mistaken inferences)

Counter-OSINT — at the personal level #

If attackers can assemble a profile with OSINT, defenders should make sure they can't.

Tighten social-media privacy settings to at least "friends only"; be cautious about real name, employer, school
Strip EXIF before posting (Discord and Telegram do not auto-strip)
Disable geotagging in OS settings
Use different usernames per social network — using the same handle everywhere makes Sherlock-style reconciliation a one-shot
Compartmentalise email and phone — primary / secondary / burner is a reasonable split
Enable Have I Been Pwned notifications

Counter-OSINT — at the corporate level #

WHOIS privacy to mask registration details
Certificate-issuance policy — keep staging. and dev. out of crt.sh (use an internal CA)
Always-on GitHub / GitLab secrets scanning
Shodan self-monitoring — periodically scan your own IPs for unexpected services
OSINT red-team exercises — OSINT your own organisation from the outside

Summary #

OSINT is not a "magic tool" — it is the combination of how you frame questions + how efficiently you collect public information + how rigorously you triangulate + how you make ethical calls. As Bellingcat and wartime OSINT have shown, even citizens can get close to the truth of historical events, but because it directly touches the privacy, safety, and reputation of the subject, you have to constantly ask "can I" vs "should I".

For security practitioners, OSINT is the phase attackers will run first — which is precisely why defenders are obligated to understand how their own organisation looks from the outside. Applying the techniques in this article to yourself is the best introduction to OSINT, and the best defence.