Can I extract text from scanned PDFs?

Scanned PDFs contain images, not text. You need OCR (Optical Character Recognition) to convert the page images into text first. The accuracy depends on scan quality and resolution.

Is table structure preserved in text extraction?

Plain text cannot represent table formatting. Column alignment is approximated using spaces or tabs. For structured data, consider converting to CSV or extracting to a spreadsheet format instead.

What character encoding is used in the output?

UTF-8 is the standard encoding for extracted text, supporting virtually all languages and special characters. This ensures compatibility with modern text editors, programming languages, and databases.

Can I extract text from password-protected PDFs?

PDFs with a user password (open password) must be unlocked before extraction. PDFs with only an owner password (permissions password) can often still have text extracted, as the content is not encrypted, only restricted.

PDF to Text Extraction Guide

Text Extraction Methods

There are two fundamentally different approaches to getting text out of a PDF, depending on the type of PDF you have:

Direct Extraction (Native PDFs)

Native PDFs — those created from Word, web browsers, or other software — contain embedded text data. The extraction tool reads the text directly from the PDF's internal structure. This is fast, accurate, and preserves the original text exactly as written.

OCR Extraction (Scanned PDFs)

Scanned PDFs contain images of pages, not actual text. Extracting text requires OCR (Optical Character Recognition) to analyze the images and identify characters. OCR is slower, and accuracy depends on scan quality, resolution, and font clarity.

Quick test: Open your PDF and try to select text with your mouse. If individual words highlight, it is a native PDF (direct extraction). If the entire page selects as one image, it is a scanned PDF (needs OCR).

What Is Preserved (and What Is Lost)

Plain text (.txt) is the simplest document format — just characters and line breaks. When converting PDF to text, you gain universal compatibility but lose visual formatting:

Preserved	Lost
Text content (words, numbers)	Fonts and font sizes
Paragraph breaks	Bold, italic, underline styling
Basic line structure	Colors and highlighting
Page order	Images, charts, and graphics
Special characters (UTF-8)	Tables (structure lost, content kept)
Numbering (as text)	Headers and footers (mixed inline)

Handling Multi-Column Layouts

Multi-column documents (academic papers, newspapers, newsletters) present a challenge for text extraction. The extractor must determine the reading order — should it read across both columns or down one column then the next?

Most extractors read content in the correct column order (left column first, then right column). However, elements that span both columns (titles, headers, footnotes) may appear in unexpected positions in the text output.

Tips for column handling:

Review the output for scrambled reading order, especially at column boundaries.
Headers spanning multiple columns usually extract correctly at the top of the text.
Footnotes may appear mid-text rather than at the bottom, since they sit at the bottom of a column.

Tables in Plain Text

Tables lose their visual structure when converted to plain text. Cell content is preserved, but the grid layout disappears. Typical approaches include:

Space-aligned columns: Cell content is padded with spaces to maintain visual column alignment. Works for simple tables with short cell content.
Tab-separated: Cells are separated by tab characters, which can be imported into spreadsheet software.
Sequential text: Cell content is output sequentially, row by row, with minimal structure markers.

For structured table data, consider converting to CSV or XLSX format instead of plain text, as these formats preserve the tabular structure.

Character Encoding

Character encoding determines how text characters are stored as bytes in the output file. The most important encoding options:

UTF-8: The universal standard. Supports virtually every language and symbol, including Chinese, Arabic, Cyrillic, emoji, and mathematical symbols. This is the recommended encoding for almost all use cases.
ASCII: Limited to 128 characters (basic English letters, numbers, punctuation). Non-ASCII characters are lost or replaced with question marks. Only use for legacy systems that cannot handle UTF-8.
Latin-1 (ISO 8859-1): Supports Western European languages. Limited compared to UTF-8 but compatible with some older systems.

Recommendation: Always use UTF-8 encoding unless you have a specific reason not to. It handles every language and is the default for modern text processing tools, programming languages, and databases.

Common Use Cases for Text Extraction

Search indexing: Extract text from PDF archives to make them searchable. Full-text search engines (Elasticsearch, Solr, Lucene) can index the extracted text for fast document retrieval.

Data mining: Extract structured data from reports, invoices, and forms for analysis. Combine with regex patterns or NLP to identify specific data fields (dates, amounts, names).

NLP processing: Feed extracted text into natural language processing pipelines for sentiment analysis, topic modeling, entity extraction, or text classification.

Accessibility: Convert visual PDFs to plain text for screen readers and assistive technologies, making documents accessible to visually impaired users.

Content migration: Extract text from legacy PDF archives when migrating content to new systems, CMS platforms, or databases.

Plagiarism detection: Extract text from submitted documents for comparison against databases and other submissions.