Ghidra — How NSA's Open-Source Reverse Engineering Suite Works thumbnail

Ghidra — How NSA's Open-Source Reverse Engineering Suite Works

⏱ approx. 23 min views 566 likes 0 LOG_DATE:2026-05-11
TOC

Ghidra is the reverse-engineering suite the NSA used internally before releasing it as OSS under Apache License 2.0 at RSA Conference 2019. By putting a disassembler and a "free decompiler" within reach of the general public, it opened a window into the RE market that IDA Pro had effectively owned, and dramatically widened the audience for malware analysis, CTF, and firmware research. This article covers the SLEIGH and P-Code machinery behind its multi-architecture support, the typical workflow, comparisons with other tools, automation, and the limits.

01

The problem Ghidra is solving — what is reverse engineering #

Reverse Engineering (RE) is the umbrella term for working backwards from a compiled executable, a firmware image, or an object file, to recover the original design intent and behaviour. You use it when source isn't available and you still need to answer: "what does this malware do?", "how does this proprietary protocol work?", "what did this patch actually fix?"

Machine code (say x86-64's 48 83 EC 28 ...) isn't readable by humans. An RE tool re-translates it into something readable in several layers:

1. Raw binary
A meaningless-looking stream of machine bytes.
2. Disassembly (assembly)
Broken down into instructions like sub rsp, 0x28; mov rax, [rbp-0x10]; ....
3. Decompilation (pseudo-C)
High-level structure recovered as something like int main() { int x = ...; if (x > 0) {...} }.
4. Annotated graph
Function calls / control flow / cross-references made visually navigable.

Ghidra provides all of (1) → (4) in a single GUI, and has project management that retains the work of incrementally attaching meaning (renaming functions, annotating types, adding comments). The process of "keeping the analyst's manual insight as part of the project" is what makes long-term and shared analysis possible.

02

History — from internal NSA tool to OSS in 2019 #

Ghidra's origins go back to internal NSA development around 1999. To analyse foreign and domestic encrypted-communications software and embedded systems for SIGINT work, it grew into a cross-platform RE suite written in Java. It was a classified internal tool, but its name "Ghidra" and a few screenshots leaked to the world via the Snowden disclosures in 2013.

The turning point was March 2019. Rob Joyce, then director of the NSA's Cybersecurity Directorate, announced its public release at RSA Conference, and source and binaries went up on GitHub under Apache License 2.0.

Several reasons have been offered:

  • Contribution to academic and research communities (consistent with the NSA's "Cybersecurity for the Nation" line)
  • National investment in skills development — RE talent is undersupplied, opening the tool widens the funnel
  • The name and capabilities had already leaked via the Snowden material — full disclosure is the more transparent move
  • Talent competition against IDA Pro and friends — "I know Ghidra" becomes a viable resume line for entry-level candidates
▸ Settling the "NSA-made therefore dangerous" question

Some viewed the 9.0 release with suspicion of backdoors, but full source under Apache 2.0 plus a huge OSS community auditing it means it is treated as safe in practice. As of 2026 the latest is the 11.x series, with releases continuing several times a year.

03

Architecture — SLEIGH and P-Code make multi-arch work #

Ghidra's most uniquely engineered part is the combination of a custom processor-specification language called SLEIGH and an intermediate representation (IR) called P-Code. The reason the same analysis engine works on x86, ARM, MIPS, PowerPC, and RISC-V is these two.

① Disassembly — machine code → assembly
A SLEIGH specification (x86.sla / ARM.sla / ...) interprets byte patterns. 48 83 EC 28sub rsp, 0x28 (x86-64); E5 2D E0 04str lr, [sp,#-4]! (ARM).
② Lifting to P-Code
Architecture-specific details are removed in the conversion to IR. sub rsp, 0x28 becomes a chain of P-Code ops like INT_SUB(rsp, 0x28) → COPY → rsp. Both x86 and ARM end up normalised into the same operator set.
③ Data-flow / control-flow analysis
Function boundaries / Basic Blocks / use-def chains / SSA form / register liveness / constant propagation. At the P-Code level it estimates "what's on the stack", "what is the loop condition", etc.
④ Decompiler — reconstruct pseudo-C
Local variables / if / for / while / switch / function calls are structured back into a Hex-Rays-quality C-like display. Making this — the biggest practical gap between paid and free tools — free is Ghidra's biggest single contribution.

Why SLEIGH matters:

  • Adding a new ISA is "add a spec file", not "rewrite the program"
  • Unknown / proprietary processors (embedded gear, old ASICs, some IoT) can be analysed in Ghidra by writing a SLEIGH spec for them
  • The community has contributed SLEIGH definitions for 6502 / 8086 / SH4 / various retro machines

Why P-Code matters:

  • An analysis plug-in written on top of x86 also works on ARM — architecture dependence is erased at the upper layer
  • Data-flow analysis / symbolic execution / abstract interpretation need to be written once to work on every arch
  • Symbolic-execution frameworks like angr and Triton can integrate via P-Code
04

Major features — what analysts use day to day #

Ghidra is hard to summarise in one phrase because it is an integrated environment, but the features analysts reach for most often are these.

Feature What it does
Code Browser The central UI for the whole binary. Disassembly / Decompiler / symbols / references on one screen
Decompiler Reconstruct pseudo-C from the assembly (Ghidra's biggest draw)
Function Graph Visualise a function's control flow as a directed graph of Basic Blocks
String Search Extract string constants from the binary → find malware URLs, process names, API names
Symbol Tree Organise functions / globals / namespaces into a tree
Cross References (xrefs) Trace bidirectional callers / users of a function or variable
Data Type Manager Define structs / unions / enums and apply types to memory regions
Function ID / FidDb Auto-name known library functions (libc / OpenSSL / .NET ...) via signature matching
Bookmark Mark a location as "important," return to it later
Version Tracking Align functions between two binaries (pre/post patch, two variants)
Headless Analyzer Create projects / analyse / run scripts from the CLI, no GUI required (CI integration)
Script Manager Extend with Python (Jython) / Java
Collaborative Server Set up a Ghidra Server so multiple analysts share an analysis in real time

The consistent design philosophy is "support the process of an analyst incrementally attaching meaning to a featureless binary." Even an analysis that doesn't finish in a single session has all its state saved to the project file (.gpr), so you continue where you left off the next day.

05

Typical workflow — from Import to Decompile #

What a real analysis session looks like:

1. Create a project
File → New Project, pick Non-Shared or Shared.
2. Import the binary
PE / ELF / Mach-O / Raw are auto-detected. "Format" "Language" "Compiler" usually just take auto-detect.
3. Auto-Analyze
All defaults enabled; seconds to minutes. Function ID / Stack / Decompiler Parameter ID / DWARF run.
4. Start at main / entry point
Symbol Tree → main to jump to the starting point. Read Disassembly and Decompiler side-by-side in Code Browser.
5. Rename, type, comment
L renames variables / functions, Ctrl-L attaches a type, ; adds a comment. Attach meaning incrementally.
6. Unfold via xrefs
Ctrl-Shift-F follows callers / references. Anchor your search on strings or syscalls (printf / WriteFile / connect).
7. Automate via scripts → save
Codify recurring work in Python / Java via the Script Manager. Save and share in a Shared Project + Ghidra Server.
▸ Decompiler quality is "80 % right, 20 % wrong"

Typical failure modes: lost track of the stack pointer / mis-identified register liveness / wrong function boundaries. When something looks suspicious, go back to the assembly and verify. Right-click → "Override Function Signature" or "Edit Function" to correct types — the Decompiler output often improves dramatically right after.

Essential shortcuts that change your productivity
# Key shortcuts in the Code Browser L Rename a symbol (function / variable) ; Add a comment Ctrl-Shift-E Edit the function signature Ctrl-Shift-F List Cross References G Jump to a given address N Graph the next function Ctrl-L Apply a data type
06

Comparison with other RE tools #

Ghidra is not the only RE tool. The major modern ones, used by preference and purpose:

Tool Price Decompiler Strengths Weaknesses Fit
Ghidra (NSA / OSS) Free ○ Built-in, multi-arch (SLEIGH) Free IDA-equivalent / rare ISAs / Headless automation / Shared Project Java startup cost / quirky UI / fewer plugins Beginners / individuals / bulk samples / CI
IDA Pro (Hex-Rays, 1991-) $$$ (thousands+) ◎ Hex-Rays, top-tier quality (sold separately) Best-in-class decompiler / industry de facto / rich plugin ecosystem Expensive / rare ISAs sold separately Commercial / large SOCs / projects needing top quality
Binary Ninja (Vector 35, 2016-) $ ($299+) ○ HLIL, multi-tier IRs Refined UI / UX / clean API design / fast to launch and operate Free version limited / moderate ISA coverage Pro individuals / API-heavy users
radare2 / Cutter (OSS, 2006-) Free △ pdc / boosted by r2ghidra CLI-complete / Unix philosophy / many ISAs / pipes Steep learning curve / weak decompiler CLI users / automation / CTF
▸ How to choose

Start with Ghidra, buy IDA if you need to is the modern typical learning path. The biggest impact Ghidra had on the industry is removing the "I have to buy IDA first" hurdle. objdump / nm / readelf / strings remain useful as supporting tools — they are not full analysis environments.

07

How it's used — practical contexts #

Six common practical contexts for Ghidra.

(1) Malware analysis Sample received → Import → Auto-analyze → extract URLs / IPs / API names with Strings → Decompile suspicious functions to read the behaviour. Both Sunburst (SolarWinds, 2020) and WannaCry generated lots of public Ghidra writeups right after disclosure. The typical setup is initial triage in Ghidra, then hand off to dynamic analysis (Cuckoo Sandbox / x64dbg).

(2) Vulnerability research / patch diffing After a CVE drops, diff Microsoft / Adobe patches Before / After to pin down what was changed. Ghidra's Version Tracking helps with function-level alignment and difference display. BinDiff / Diaphora are also commonly used.

(3) CTF Ghidra is the de facto standard for CTF Reverse challenges. Extracting flags from stripped ELF, unpacking the structure of Rust / Go binaries, decompiling custom VMs — Ghidra's flexibility shines.

(4) Firmware / IoT For router / IP camera / embedded firmware: expand with binwalk → extract ELF / raw bin → analyse in Ghidra. Coverage for MIPS / ARM / RISC-V plus the ability to extend to rare embedded CPUs via SLEIGH both pay off here.

(5) Protocol analysis For proprietary network protocols (games / SCADA / old vendor-specific protocols), reverse the packet format from the implementing binary. The standard move is to trace from the receive path via cross-references.

(6) License-check defeat (legally grey) The classic use case for reversing shareware "serial number checks."

▸ Before crossing legal lines — distributing cracks is copyright infringement

Legitimate research is fine, but distributing a crack of commercial software is copyright infringement. Keep it to research on software you own. Even for malware analysis, get samples from legitimate sources such as VirusTotal / Malshare / MalwareBazaar.

08

Headless Analyzer and scripts — about automation #

Ghidra can be driven from the CLI without the GUI (analyzeHeadless). Used for bulk-sample auto-analysis / CI integration / batch processing.

analyzeHeadless for CI / batch analysis
# Create project + Import + Auto-analyze $ analyzeHeadless /path/to/project ProjectName \ -import sample.exe \ -postScript MyAnalysisScript.py # Run only a script against an existing project $ analyzeHeadless /path/to/project ProjectName \ -process sample.exe \ -scriptPath ./scripts \ -postScript ExtractStrings.py
Python (Jython) script — list every function and its callers
# Write in Script Manager → New Python fm = currentProgram.getFunctionManager() for func in fm.getFunctions(True): print("Function:", func.getName()) for ref in func.getEntryPoint().getReferenceIteratorTo(): print(" called from:", ref.getFromAddress())
Pull URL-like / IP-like things out of the strings
import re listing = currentProgram.getListing() for data in listing.getDefinedData(True): if data.hasStringValue(): s = data.getValue() if re.search(r"https?://|\b\d+\.\d+\.\d+\.\d+\b", str(s)): print(data.getAddress(), s)
▸ Community script collections

Ghidra-Scripts (several repos on GitHub): Find Crypt / String Decryption / Anti-VM detection. Ghidra-CTF: generic scripts for CTF. Ghidra Bridge: bridges Ghidra's Jython to external CPython (with the entire PyPI library ecosystem) → lets you integrate angr / capa / yara.

09

Limits and tricks — "the Decompiler is not perfect" #

The limits of Ghidra (and of every RE tool):

  • Decompiler output is approximate — stack mis-identification / register type inference failures / kernel code / aggressive optimisation (LTO/PGO) produce errors. Going back to the assembly to verify is the rule
  • Packing / obfuscation — UPX-class can be auto-unpacked, but commercial packers (VMProtect, Themida) and hand-rolled packers must be dynamically unpacked (i.e., process-dumped) before being passed in
  • JIT / JVM / .NET / Python — these are intermediate bytecode, not machine code — dnSpy (C#) / jadx (Java) / Decompyle3 (Python) and similar specialist tools fit better
  • Stripped binaries — without symbols / debug info, all function and variable names have to be reapplied by you
  • Huge binaries — at 100 MB scale, Auto-analyze takes tens of minutes to hours, and the GUI gets slow. Process in chunks with Headless
  • JVM startup cost — just ghidraRun takes 10–20 seconds to come up
  • Idiosyncratic UI — newcomers from IDA / Binary Ninja take a while to adjust
▸ Field-tested tips
  • Start small — for unknown functions, "trace back from callers (xrefs)" is the most efficient. Top-down from main tends to dead-end
  • Use the signature DB — Function ID auto-naming for library functions instantly removes cognitive load
  • Apply structures — once you suspect "this region looks like an XYZ struct" and attach a type, the Decompiler output gets dramatically better
  • Don't skimp on comments and renames — assume future-you, six months later, is the reader
  • Scripts for routines — anything you do twice should become a script
10

Related tools and the wider ecosystem #

Modern analysis is Ghidra plus its surrounding tools, not Ghidra alone.

Role Tools
Dynamic analysis x64dbg / OllyDbg / GDB / WinDbg / Frida
Sandbox Cuckoo Sandbox / Joe Sandbox / Hybrid Analysis / Any.Run
Firmware extraction binwalk / firmware-mod-kit / unblob
Diff / Version Tracking BinDiff (Google) / Diaphora
Symbolic execution angr / Triton / KLEE
Signatures / YARA yara / yarGen / capa
Unpackers unipacker / scyllahide / Volatility (memory dumps)
Malware sample platforms VirusTotal / Malshare / MalwareBazaar
Analysis notebooks Obsidian / Notion / Markdown + Git

Using Ghidra Bridge to call the Ghidra API from external CPython, then composing with Python libraries for angr (symbolic execution) and capa (capability detection) is the advanced-user style.

▸ Summary

Ghidra combines declarative processor specification via SLEIGH + the P-Code IR + a shared analysis engine to analyse x86 / ARM / MIPS / PowerPC / RISC-V — even unusual embedded CPUs — in a single tool. Its biggest historical contribution is "a free and open-source decompiler in the hands of the general public". The effect of moving the start line of RE learning from "price" to "interest and time" cannot be overstated. If you're starting to learn RE, start with Ghidra, and bring in IDA Pro / Binary Ninja / radare2 as you need them — that is the modern standard route.

𝕏 Post B! Hatena