English
Enterprise

Write a PDF from scratch (01): Hello, World - Building a Minimal Usable PDF

Doclingo TeamJanuary 30, 2026

Write a PDF from scratch (01): Hello, World - Building a Minimal Usable PDF

Series Goal: Understand PDF as a readable file format - starting with a minimal example that "works," and gradually expanding to graphics, multi-page documents, compression, and resource reuse.

Series Directory

  • Part 01 (This Article): Manually create a minimal PDF (1 page + 1 line of text) and use tools to complete it into a standard PDF that can be opened.
  • Part 02: Draw lines/rectangles in the content stream (understanding paths, strokes, fills).
  • Part 03: Multi-page PDF (how the Pages tree is constructed).
  • Part 04: Closer to the real world (compressed streams, resource reuse, optional structures, etc.).

Why Understand the Underlying Structure of PDF?

PDF (Portable Document Format) is one of the most popular page description languages today. Unlike HTML/CSS, which emphasizes "separation of content and presentation, and reflowable" design, PDF emphasizes fixed layout and WYSIWYG - the formatting remains consistent regardless of the device used to open it.

Understanding the underlying structure of PDF has several practical benefits:

  • Debugging PDF Generation Issues: When you encounter errors while generating PDFs with a code library, understanding the underlying structure allows you to quickly locate the problem.
  • Automated Processing: Operations like batch text extraction, document merging, and watermarking require an understanding of the structure for precise execution.
  • Security Auditing: Knowing what content can be embedded in a PDF (JavaScript, attachments, forms, etc.) aids in security analysis.
  • Learning File Format Design: The "object graph + random access" design of PDF is a classic example worth studying.

Preparation

This article will first create a "structurally incomplete but logically correct" hello-broken.pdf, and then use pdftk to automatically complete the key structures and output hello.pdf.

  • Required Tool: pdftk (free command-line tool, supports Windows/macOS/Linux)
  • Output Files: hello-broken.pdf (handwritten), hello.pdf (openable after repair)

Core Concept: The Three Layers of PDF Structure

The most important aspect of understanding PDF is to establish a three-layer mental model:

PDF Three-Layer Mental Model

1. Object Layer (Document Content)

A PDF document consists of many objects, which are interconnected through indirect references (e.g., 2 0 R). Common object types include:

TypeExampleDescription
Name/PageNames starting with /
Integer/Real50, 36.0Numeric values
String(Hello, World!)Enclosed in parentheses
Array[0 0 612 792]Ordered collection
Dictionary<< /Type /Page >>Key-value pair collection
Indirect Reference2 0 RReference to object 2 (generation number 0)
Streamstream...endstreamBinary data (e.g., drawing instructions, images)

2. Content Layer (Page Content)

The sequence of instructions that actually "draws text/graphics on the page," typically written between stream ... endstream. The format is: operands first, operators last.

/F0 36 Tf          ← Operand: /F0, 36  Operator: Tf (set font)
(Hello, World!) Tj ← Operand: string   Operator: Tj (draw text)

3. File Structure Layer

Allows the reader to quickly access any object randomly without having to read from start to finish:

ElementFunction
%PDF-1.xFile header, identifies PDF version
xrefCross-reference table: object number → byte offset
trailerTrailer dictionary: points to root object /Root
startxrefIndicates the starting position of the xref table
%%EOFEnd of file marker

What Objects Are Needed for a Minimal PDF?

A "minimal but displayable" PDF has the following object reference relationships:

PDF Minimal Object Relationship Diagram

Minimal Object List:

ObjectFunctionKey Fields
CatalogRoot object, document entry/Type /Catalog, /Pages
PagesPage tree/Type /Pages, /Kids, /Count
PageSingle page/Type /Page, /MediaBox, /Resources, /Contents, /Parent
ResourcesResource container/Font (font dictionary)
FontFont definition/Type /Font, /BaseFont, /Subtype
ContentsContent streamStream of drawing instructions

Practical Exercise: Manually Create hello-broken.pdf

Create a new file hello-broken.pdf and paste the following content completely:

%PDF-1.0
1 0 obj
<< /Type /Pages
   /Count 1
   /Kids [2 0 R]
>>
endobj

2 0 obj
<< /Type /Page
   /MediaBox [0 0 612 792]
   /Resources 3 0 R
   /Parent 1 0 R
   /Contents [4 0 R]
>>
endobj

3 0 obj
<< /Font
     << /F0
          << /Type /Font
             /BaseFont /Times-Italic
             /Subtype /Type1 >>
     >>
>>
endobj

4 0 obj
<< >>
stream
1. 0. 0. 1. 50. 700. cm
BT
 /F0 36. Tf
 (Hello, World!) Tj
ET
endstream
endobj

5 0 obj
<< /Type /Catalog
   /Pages 1 0 R
>>
endobj

xref
0 6
trailer
<< /Size 6
   /Root 5 0 R
>>
startxref
0
%%EOF

Why is this file "broken"?

We intentionally omitted or incorrectly filled in the following content:

Missing/Error ItemDescription
xref offsetThe actual byte offsets for each object were not filled in.
startxrefFilled in 0, which is not the actual location of the xref.
/LengthThe content stream does not declare its length.
Binary markerMissing the binary identification line at the header.

These are key pieces of information needed by the reader; their absence can lead to the file being unopenable or only partially openable.


Key Content Stream Instructions Explained

The content stream is between stream ... endstream in object 4 0 obj, explained line by line:

1. 0. 0. 1. 50. 700. cm   ← Set transformation matrix (note that 1. represents the floating-point number 1.0)
BT                         ← Begin text object
 /F0 36. Tf                ← Select font F0, size 36pt
 (Hello, World!) Tj        ← Draw string
ET                         ← End text object

Transformation Matrix cm Operator

1 0 0 1 50 700 cm is a 6-element transformation matrix [a b c d e f], corresponding to:

| a  b  0 |     | 1  0  0 |
| c  d  0 |  =  | 0  1  0 |
| e  f  1 |     | 50 700 1 |

When a=1, b=0, c=0, d=1, this is a pure translation matrix, moving the origin of the coordinate system (the point where subsequent drawing operations start) to (50, 700). If not moved, the default origin is at the bottom left corner of the page.

Text Operators

OperatorMeaningExample
BTBegin Text, start text objectBT
ETEnd Text, end text objectET
TfSet font and size/F0 36 Tf
TjDraw string(Hello!) Tj

Use pdftk to Repair into an Openable PDF

Execute the following command in the directory where hello-broken.pdf is located:

pdftk hello-broken.pdf output hello.pdf

Open hello.pdf with any PDF reader, and you should see "Hello, World!" displayed on the page (in Times-Italic font, 36pt, located at the top left of the page).

What Did pdftk Complete for You?

Completed ItemDescription
Binary marker lineAdded a line of non-printable characters after %PDF-1.0 to ensure it is recognized as a binary file.
/LengthCalculated and added the byte length for the content stream.
xref tableCalculated the byte offsets for each object and filled them in.
startxrefFilled in the actual starting position of the xref table.

Why Do We Need xref / trailer / startxref?

Core Purpose: Random Access

Imagine a 500-page PDF; without xref, the reader would have to parse from the beginning to display page 450 - which is too slow.

With xref, the reader can:

  1. Read startxref → find the xref location.
  2. Read trailer → find the root object /Root.
  3. Follow the root object to directly jump to the page 450 object.
  4. Use xref to check the byte offset of that object → directly seek to read it.

Time complexity drops from O(n) to O(1).


Exercises for This Article

It is recommended that you actually modify hello-broken.pdf, then repair it again with pdftk and observe the effects:

ExerciseModificationObservation
AChange (Hello, World!) to another English phraseText change
BChange 36 to 12 or 72Font size change
CChange 50 700 to 50 100Position moves down (PDF coordinate system origin is at the bottom left)
DChange /Times-Italic to /Helvetica or /CourierFont change
EChange /MediaBox [0 0 612 792] to [0 0 595 842]Paper size changes from US Letter to A4

Tip: The PDF coordinate system origin is at the bottom left of the page, with the Y-axis pointing up. (50, 700) means 50pt from the left and 700pt from the bottom.


Frequently Asked Questions

Q: Why use built-in Type1 fonts instead of TrueType?

A: The 14 standard Type1 fonts (Times, Helvetica, Courier, etc.) are required to be built into PDF readers, eliminating the need to embed font files, making it simpler. In real scenarios, fonts usually need to be embedded to ensure cross-platform consistency.

Q: What are the numbers in /MediaBox [0 0 612 792]?

A: The unit is point (1 point = 1/72 inch). 612 × 792 points = 8.5 × 11 inches = US Letter paper. A4 is 595 × 842 points.

Q: What is the generation number (like 0 in 2 0 R)?

A: It is used for incremental updates. When an object is modified, the generation number increases by 1. In a newly created PDF, all object generation numbers are usually 0.


Next Article Preview

In Part 02, we will continue using the "handwritten content stream" approach, adding the most basic graphic path operations:

  • m (moveto), l (lineto): define paths
  • S (stroke): stroke the path
  • re (rectangle), f (fill): draw rectangles and fill them

This will allow you to draw: title text + a horizontal separator line + a rectangle on the same page, progressing from "writing text" to "drawing graphics."

Copyright © 2026 Doclingo. All Rights Reserved.
Products
Document Translation
More Tools
API
Enterprise
Resources
Pricing
App
About
Help Center
Terms of Service
Privacy Policy
Version Updates
Blog
Contact Information
Email: support@doclingo.ai
English
Copyright © 2026 Doclingo. All Rights Reserved.