Ghidra — How NSA's Open-Source Reverse Engineering Suite Works

Ghidra is the reverse-engineering suite the NSA used internally before releasing it as OSS under Apache License 2.0 at RSA Conference 2019. By putting a disassembler and a "free decompiler" within reach of the general public, it opened a window into the RE market that IDA Pro had effectively owned, and dramatically widened the audience for malware analysis, CTF, and firmware research. This article covers the SLEIGH and P-Code machinery behind its multi-architecture support, the typical workflow, comparisons with other tools, automation, and the limits.

The problem Ghidra is solving — what is reverse engineering #

Reverse Engineering (RE) is the umbrella term for working backwards from a compiled executable, a firmware image, or an object file, to recover the original design intent and behaviour. You use it when source isn't available and you still need to answer: "what does this malware do?", "how does this proprietary protocol work?", "what did this patch actually fix?"

Machine code (say x86-64's 48 83 EC 28 ...) isn't readable by humans. An RE tool re-translates it into something readable in several layers:

1. Raw binary

A meaningless-looking stream of machine bytes.

2. Disassembly (assembly)

Broken down into instructions like sub rsp, 0x28; mov rax, [rbp-0x10]; ....

3. Decompilation (pseudo-C)

High-level structure recovered as something like int main() { int x = ...; if (x > 0) {...} }.

4. Annotated graph

Function calls / control flow / cross-references made visually navigable.

Ghidra provides all of (1) → (4) in a single GUI, and has project management that retains the work of incrementally attaching meaning (renaming functions, annotating types, adding comments). The process of "keeping the analyst's manual insight as part of the project" is what makes long-term and shared analysis possible.

History — from internal NSA tool to OSS in 2019 #

Ghidra's origins go back to internal NSA development around 1999. To analyse foreign and domestic encrypted-communications software and embedded systems for SIGINT work, it grew into a cross-platform RE suite written in Java. It was a classified internal tool, but its name "Ghidra" and a few screenshots leaked to the world via the Snowden disclosures in 2013.

The turning point was March 2019. Rob Joyce, then director of the NSA's Cybersecurity Directorate, announced its public release at RSA Conference, and source and binaries went up on GitHub under Apache License 2.0.

Several reasons have been offered:

Contribution to academic and research communities (consistent with the NSA's "Cybersecurity for the Nation" line)
National investment in skills development — RE talent is undersupplied, opening the tool widens the funnel
The name and capabilities had already leaked via the Snowden material — full disclosure is the more transparent move
Talent competition against IDA Pro and friends — "I know Ghidra" becomes a viable resume line for entry-level candidates

▸ Settling the "NSA-made therefore dangerous" question

Some viewed the 9.0 release with suspicion of backdoors, but full source under Apache 2.0 plus a huge OSS community auditing it means it is treated as safe in practice. As of 2026 the latest is the 11.x series, with releases continuing several times a year.

Architecture — SLEIGH and P-Code make multi-arch work #

Ghidra's most uniquely engineered part is the combination of a custom processor-specification language called SLEIGH and an intermediate representation (IR) called P-Code. The reason the same analysis engine works on x86, ARM, MIPS, PowerPC, and RISC-V is these two.

① Disassembly — machine code → assembly

A SLEIGH specification (x86.sla / ARM.sla / ...) interprets byte patterns. 48 83 EC 28 → sub rsp, 0x28 (x86-64); E5 2D E0 04 → str lr, [sp,#-4]! (ARM).

② Lifting to P-Code

Architecture-specific details are removed in the conversion to IR. sub rsp, 0x28 becomes a chain of P-Code ops like INT_SUB(rsp, 0x28) → COPY → rsp. Both x86 and ARM end up normalised into the same operator set.

③ Data-flow / control-flow analysis

Function boundaries / Basic Blocks / use-def chains / SSA form / register liveness / constant propagation. At the P-Code level it estimates "what's on the stack", "what is the loop condition", etc.

④ Decompiler — reconstruct pseudo-C

Local variables / if / for / while / switch / function calls are structured back into a Hex-Rays-quality C-like display. Making this — the biggest practical gap between paid and free tools — free is Ghidra's biggest single contribution.

Why SLEIGH matters:

Adding a new ISA is "add a spec file", not "rewrite the program"
Unknown / proprietary processors (embedded gear, old ASICs, some IoT) can be analysed in Ghidra by writing a SLEIGH spec for them
The community has contributed SLEIGH definitions for 6502 / 8086 / SH4 / various retro machines

Why P-Code matters:

An analysis plug-in written on top of x86 also works on ARM — architecture dependence is erased at the upper layer
Data-flow analysis / symbolic execution / abstract interpretation need to be written once to work on every arch
Symbolic-execution frameworks like angr and Triton can integrate via P-Code

Major features — what analysts use day to day #

Ghidra is hard to summarise in one phrase because it is an integrated environment, but the features analysts reach for most often are these.

Feature	What it does
Code Browser	The central UI for the whole binary. Disassembly / Decompiler / symbols / references on one screen
Decompiler	Reconstruct pseudo-C from the assembly (Ghidra's biggest draw)
Function Graph	Visualise a function's control flow as a directed graph of Basic Blocks
String Search	Extract string constants from the binary → find malware URLs, process names, API names
Symbol Tree	Organise functions / globals / namespaces into a tree
Cross References (xrefs)	Trace bidirectional callers / users of a function or variable
Data Type Manager	Define structs / unions / enums and apply types to memory regions
Function ID / FidDb	Auto-name known library functions (libc / OpenSSL / .NET ...) via signature matching
Bookmark	Mark a location as "important," return to it later
Version Tracking	Align functions between two binaries (pre/post patch, two variants)
Headless Analyzer	Create projects / analyse / run scripts from the CLI, no GUI required (CI integration)
Script Manager	Extend with Python (Jython) / Java
Collaborative Server	Set up a Ghidra Server so multiple analysts share an analysis in real time

The consistent design philosophy is "support the process of an analyst incrementally attaching meaning to a featureless binary." Even an analysis that doesn't finish in a single session has all its state saved to the project file (.gpr), so you continue where you left off the next day.

Typical workflow — from Import to Decompile #

What a real analysis session looks like:

1. Create a project

File → New Project, pick Non-Shared or Shared.

2. Import the binary

PE / ELF / Mach-O / Raw are auto-detected. "Format" "Language" "Compiler" usually just take auto-detect.

3. Auto-Analyze

All defaults enabled; seconds to minutes. Function ID / Stack / Decompiler Parameter ID / DWARF run.

4. Start at main / entry point

Symbol Tree → main to jump to the starting point. Read Disassembly and Decompiler side-by-side in Code Browser.

5. Rename, type, comment

L renames variables / functions, Ctrl-L attaches a type, ; adds a comment. Attach meaning incrementally.

6. Unfold via xrefs

Ctrl-Shift-F follows callers / references. Anchor your search on strings or syscalls (printf / WriteFile / connect).

7. Automate via scripts → save

Codify recurring work in Python / Java via the Script Manager. Save and share in a Shared Project + Ghidra Server.

▸ Decompiler quality is "80 % right, 20 % wrong"

Typical failure modes: lost track of the stack pointer / mis-identified register liveness / wrong function boundaries. When something looks suspicious, go back to the assembly and verify. Right-click → "Override Function Signature" or "Edit Function" to correct types — the Decompiler output often improves dramatically right after.

Essential shortcuts that change your productivity

# Key shortcuts in the Code Browser
L             Rename a symbol (function / variable)
;             Add a comment
Ctrl-Shift-E  Edit the function signature
Ctrl-Shift-F  List Cross References
G             Jump to a given address
N             Graph the next function
Ctrl-L        Apply a data type

Comparison with other RE tools #

Ghidra is not the only RE tool. The major modern ones, used by preference and purpose:

Tool	Price	Decompiler	Strengths	Weaknesses	Fit
Ghidra (NSA / OSS)	Free	○ Built-in, multi-arch (SLEIGH)	Free IDA-equivalent / rare ISAs / Headless automation / Shared Project	Java startup cost / quirky UI / fewer plugins	Beginners / individuals / bulk samples / CI
IDA Pro (Hex-Rays, 1991-)	$$$ (thousands+)	◎ Hex-Rays, top-tier quality (sold separately)	Best-in-class decompiler / industry de facto / rich plugin ecosystem	Expensive / rare ISAs sold separately	Commercial / large SOCs / projects needing top quality
Binary Ninja (Vector 35, 2016-)	$ ($299+)	○ HLIL, multi-tier IRs	Refined UI / UX / clean API design / fast to launch and operate	Free version limited / moderate ISA coverage	Pro individuals / API-heavy users
radare2 / Cutter (OSS, 2006-)	Free	△ pdc / boosted by r2ghidra	CLI-complete / Unix philosophy / many ISAs / pipes	Steep learning curve / weak decompiler	CLI users / automation / CTF

▸ How to choose

Start with Ghidra, buy IDA if you need to is the modern typical learning path. The biggest impact Ghidra had on the industry is removing the "I have to buy IDA first" hurdle. objdump / nm / readelf / strings remain useful as supporting tools — they are not full analysis environments.

How it's used — practical contexts #

Six common practical contexts for Ghidra.

(1) Malware analysis Sample received → Import → Auto-analyze → extract URLs / IPs / API names with Strings → Decompile suspicious functions to read the behaviour. Both Sunburst (SolarWinds, 2020) and WannaCry generated lots of public Ghidra writeups right after disclosure. The typical setup is initial triage in Ghidra, then hand off to dynamic analysis (Cuckoo Sandbox / x64dbg).

(2) Vulnerability research / patch diffing After a CVE drops, diff Microsoft / Adobe patches Before / After to pin down what was changed. Ghidra's Version Tracking helps with function-level alignment and difference display. BinDiff / Diaphora are also commonly used.

(3) CTF Ghidra is the de facto standard for CTF Reverse challenges. Extracting flags from stripped ELF, unpacking the structure of Rust / Go binaries, decompiling custom VMs — Ghidra's flexibility shines.

(4) Firmware / IoT For router / IP camera / embedded firmware: expand with binwalk → extract ELF / raw bin → analyse in Ghidra. Coverage for MIPS / ARM / RISC-V plus the ability to extend to rare embedded CPUs via SLEIGH both pay off here.

(5) Protocol analysis For proprietary network protocols (games / SCADA / old vendor-specific protocols), reverse the packet format from the implementing binary. The standard move is to trace from the receive path via cross-references.

(6) License-check defeat (legally grey) The classic use case for reversing shareware "serial number checks."

▸ Before crossing legal lines — distributing cracks is copyright infringement

Legitimate research is fine, but distributing a crack of commercial software is copyright infringement. Keep it to research on software you own. Even for malware analysis, get samples from legitimate sources such as VirusTotal / Malshare / MalwareBazaar.

Headless Analyzer and scripts — about automation #

Ghidra can be driven from the CLI without the GUI (analyzeHeadless). Used for bulk-sample auto-analysis / CI integration / batch processing.

analyzeHeadless for CI / batch analysis

# Create project + Import + Auto-analyze
$ analyzeHeadless /path/to/project ProjectName \
    -import sample.exe \
    -postScript MyAnalysisScript.py
# Run only a script against an existing project
$ analyzeHeadless /path/to/project ProjectName \
    -process sample.exe \
    -scriptPath ./scripts \
    -postScript ExtractStrings.py

Python (Jython) script — list every function and its callers

# Write in Script Manager → New Python
fm = currentProgram.getFunctionManager()
for func in fm.getFunctions(True):
    print("Function:", func.getName())
    for ref in func.getEntryPoint().getReferenceIteratorTo():
        print("  called from:", ref.getFromAddress())

Pull URL-like / IP-like things out of the strings

import re
listing = currentProgram.getListing()
for data in listing.getDefinedData(True):
    if data.hasStringValue():
        s = data.getValue()
        if re.search(r"https?://|\b\d+\.\d+\.\d+\.\d+\b", str(s)):
            print(data.getAddress(), s)

▸ Community script collections

Ghidra-Scripts (several repos on GitHub): Find Crypt / String Decryption / Anti-VM detection. Ghidra-CTF: generic scripts for CTF. Ghidra Bridge: bridges Ghidra's Jython to external CPython (with the entire PyPI library ecosystem) → lets you integrate angr / capa / yara.

Limits and tricks — "the Decompiler is not perfect" #

The limits of Ghidra (and of every RE tool):

Decompiler output is approximate — stack mis-identification / register type inference failures / kernel code / aggressive optimisation (LTO/PGO) produce errors. Going back to the assembly to verify is the rule
Packing / obfuscation — UPX-class can be auto-unpacked, but commercial packers (VMProtect, Themida) and hand-rolled packers must be dynamically unpacked (i.e., process-dumped) before being passed in
JIT / JVM / .NET / Python — these are intermediate bytecode, not machine code — dnSpy (C#) / jadx (Java) / Decompyle3 (Python) and similar specialist tools fit better
Stripped binaries — without symbols / debug info, all function and variable names have to be reapplied by you
Huge binaries — at 100 MB scale, Auto-analyze takes tens of minutes to hours, and the GUI gets slow. Process in chunks with Headless
JVM startup cost — just ghidraRun takes 10–20 seconds to come up
Idiosyncratic UI — newcomers from IDA / Binary Ninja take a while to adjust

▸ Field-tested tips

Start small — for unknown functions, "trace back from callers (xrefs)" is the most efficient. Top-down from main tends to dead-end
Use the signature DB — Function ID auto-naming for library functions instantly removes cognitive load
Apply structures — once you suspect "this region looks like an XYZ struct" and attach a type, the Decompiler output gets dramatically better
Don't skimp on comments and renames — assume future-you, six months later, is the reader
Scripts for routines — anything you do twice should become a script

Related tools and the wider ecosystem #

Modern analysis is Ghidra plus its surrounding tools, not Ghidra alone.

Role	Tools
Dynamic analysis	x64dbg / OllyDbg / GDB / WinDbg / Frida
Sandbox	Cuckoo Sandbox / Joe Sandbox / Hybrid Analysis / Any.Run
Firmware extraction	binwalk / firmware-mod-kit / unblob
Diff / Version Tracking	BinDiff (Google) / Diaphora
Symbolic execution	angr / Triton / KLEE
Signatures / YARA	yara / yarGen / capa
Unpackers	unipacker / scyllahide / Volatility (memory dumps)
Malware sample platforms	VirusTotal / Malshare / MalwareBazaar
Analysis notebooks	Obsidian / Notion / Markdown + Git

Using Ghidra Bridge to call the Ghidra API from external CPython, then composing with Python libraries for angr (symbolic execution) and capa (capability detection) is the advanced-user style.

▸ Summary

Ghidra combines declarative processor specification via SLEIGH + the P-Code IR + a shared analysis engine to analyse x86 / ARM / MIPS / PowerPC / RISC-V — even unusual embedded CPUs — in a single tool. Its biggest historical contribution is "a free and open-source decompiler in the hands of the general public". The effect of moving the start line of RE learning from "price" to "interest and time" cannot be overstated. If you're starting to learn RE, start with Ghidra, and bring in IDA Pro / Binary Ninja / radare2 as you need them — that is the modern standard route.