OnlyFormat

How to Identify a File's Real Type with Magic Bytes

OnlyFormat Editorial Team··7 min read

Here is an uncomfortable fact about every file on your computer: its name tells you nothing reliable about what it actually is. The .jpg at the end of photo.jpg is a label, not a fact. Rename it to photo.png and the bytes on disk don't change at all — it's still exactly the same JPEG, now wearing a misleading name. This gap between what a file claims to be and what it is causes a surprising number of everyday headaches, and the way to close it is something called magic bytes.

1. The extension is just a label

A file extension is the part of the file name after the last dot. Operating systems use it as a quick hint to decide which application should open the file and which icon to show. That's the entire role it plays. The extension is not stored inside the file, it is not verified against the contents, and changing it does not convert anything. It is purely a naming convention.

This is why "rename it to .jpg" almost never fixes a broken image, and why double-clicking a downloaded file sometimes opens the wrong program or fails entirely. The label and the contents have drifted apart. To know what you're really dealing with, you have to look inside the file — at its first few bytes.

2. What magic bytes are

Almost every binary file format begins with a fixed, distinctive sequence of bytes called a magic number or file signature. It sits at offset 0 — the very start of the file — and acts as a self-declaration: "I am a PNG," "I am a PDF," "I am a ZIP." Because the signature is part of the actual data, it travels with the file no matter how many times someone renames it. Read the signature and you know the truth.

The term "magic number" comes from early Unix, where the kernel inspected the first bytes of a file to decide how to execute it. The idea stuck because it's simple and robust: put an unmistakable fingerprint at the front of the format and everything downstream can identify it cheaply.

3. Common signatures you'll meet

Signatures are usually written in hexadecimal. Here are the ones you'll run into most often:

FormatFirst bytes (hex)Notes
PNG89 50 4E 47 0D 0A 1A 0ASpells ".PNG" with guard bytes
JPEGFF D8 FFStart of Image marker
GIF47 49 46 38ASCII "GIF8" (87a / 89a)
PDF25 50 44 46ASCII "%PDF"
WebP52 49 46 46 … 57 45 42 50RIFF container, "WEBP" at byte 8
ZIP50 4B 03 04ASCII "PK" — and DOCX/XLSX/EPUB
GZIP1F 8B.gz / .tgz streams
BMP42 4DASCII "BM"

Two patterns are worth calling out. RIFF is a generic container: WebP, WAV audio, and AVI video all start with the bytes RIFF, and you have to read the four-character type at byte 8 to tell them apart. ISO-BMFF formats — AVIF, HEIC, MP4, MOV — share an ftyp box at byte 4, and the four-letter "brand" right after it (avif, heic, isom, qt) distinguishes them.

4. Text formats have no magic number

Plain-text formats — SVG, HTML, XML, JSON, CSV, Markdown — generally don't carry a binary signature, because they're just human-readable characters. They're identified instead by their structure: an SVG begins with <svg or an XML declaration, an HTML document with <!doctype html>, a JSON file with { or [. One exception is a byte-order mark (BOM) — the bytes EF BB BF at the start of some UTF-8 text files — which a detector has to skip before reading the content.

5. The ZIP-family trap

The single most confusing signature is ZIP's 50 4B 03 04. The reason: a huge number of modern formats are secretly ZIP files. A Word .docx, an Excel .xlsx, a PowerPoint .pptx, an EPUB e-book, a Java .jar, an Android .apk — all of them are ZIP archives containing XML and assets. So when a byte-level detector tells you a .docx is "a ZIP archive," it's completely correct: the office format is a layer of convention on top of the ZIP container. To see it for yourself, copy a .docx, rename the copy to .zip, and open it — you'll find the XML inside.

6. Why this matters in practice

  • A converter rejects your file. You uploaded image.jpg, but it's really a WebP from a chat app, and the converter keys off the extension. Detect the real type, then convert from that.
  • An image won't open in an old editor. Same cause — a modern format wearing a legacy extension.
  • A "document" behaves strangely. A file claiming to be a PDF that's actually HTML is a common phishing pattern. Checking the signature before opening is a cheap safeguard.
  • You're debugging an upload pipeline. Server-side validation should check magic bytes, not the client-supplied extension or MIME type, both of which are trivially spoofed.

7. How to check a file yourself

The fastest way is our Format Detective tool: drop in any file and it reads the first few kilobytes in your browser, reports the real format, and warns you if the extension doesn't match. Nothing is uploaded. If you prefer the command line, the Unix file command does the same thing (file mystery.bin), and a hex viewer like xxd or hexdump -C lets you read the signature directly. For programmatic use, validate uploads on the server by inspecting the leading bytes rather than trusting the extension.

FAQ

What are magic bytes?

Magic bytes (also called a magic number or file signature) are a short, fixed sequence of bytes at the very start of a file that identifies its format. For example, every PNG file starts with the eight bytes 89 50 4E 47 0D 0A 1A 0A. Because the signature lives inside the file, it identifies the true format regardless of the file's name or extension.

Why can't I trust the file extension?

The extension is just part of the file name — metadata anyone can rename instantly. It isn't stored inside the file and doesn't change the actual data. A PNG renamed to photo.jpg is still a PNG on disk; only the label changed. Apps that decide how to open a file by its extension will then misbehave.

Why does a DOCX show up as a ZIP file?

Because it is one. DOCX, XLSX, PPTX, EPUB, JAR and APK are all ZIP archives internally — a package of XML and resource files. Their magic bytes are the ZIP signature (50 4B 03 04). The office format is a convention layered on top of ZIP, so byte-level detection correctly reports ZIP.

Is checking magic bytes a security measure?

It's a useful one. Extension/content mismatch is a classic phishing and malware trick — an attachment named invoice.pdf that is really an HTML page, for instance. Verifying the true type before opening or forwarding a file catches many of these. It is not a complete security solution, but it's a cheap, sensible habit.

References

  • IANA Media Types registry — official MIME type assignments
  • W3C — PNG (Portable Network Graphics) Specification
  • ISO/IEC 14496-12 — ISO Base Media File Format (ftyp box, AVIF/HEIC/MP4)
  • PKWARE — .ZIP File Format Specification (APPNOTE)
  • ECMA-376 — Office Open XML (DOCX/XLSX/PPTX as ZIP)
  • man page — file(1) and the libmagic database

About the OnlyFormat Editorial Team

OnlyFormat's editorial team is made up of working web developers and image-workflow engineers who ship file-conversion tooling for a living. Every guide is reviewed against primary sources — W3C/WHATWG specifications, IETF RFCs, MDN Web Docs, ISO/IEC media standards, and the official documentation of libraries we actually use in production (libwebp, libjpeg-turbo, libavif, FFmpeg, pdf-lib). We update articles when standards change so the guidance stays current.

Sources we cite: W3C · WHATWG · MDN Web Docs · IETF RFCs · ISO/IEC · libwebp · libavif · FFmpeg · pdf-lib