Tables vs Plain Text: Why It Matters
Before choosing a method, check what kind of data your PDF contains. The approach depends entirely on the PDF structure:
| PDF Type | What It Contains | Best Method |
|---|---|---|
| Native tables | Text-based PDF with visible table borders and grid lines | Any method — Convertio is fastest |
| Borderless tables | Columns aligned by spacing, no visible grid | Python (pdfplumber) for precision |
| Scanned PDF | Image of a printed page (no selectable text) | Convertio with OCR enabled |
| Mixed content | Tables + paragraphs + headers on the same page | Python for selective extraction |
Quick test: open your PDF and try selecting text with your mouse. If you can highlight individual words, it's a native (text-based) PDF. If the entire page selects as one block, it's a scanned image — you'll need OCR.
Method 1: Convert Online with Convertio
The fastest option for most users. Convertio handles native PDFs, borderless tables, and even scanned documents with OCR. No installation, no account required.
- Go to convertio.com/pdf-to-csv
- Upload your PDF — drag and drop, or click "Choose PDF File". Max 100 MB.
- For scanned PDFs: select your OCR language from the dropdown before converting.
- Click "Convert to CSV" — conversion takes a few seconds for most files.
- Download the CSV — open it in Excel, Google Sheets, or import into your database.
Convertio processes all pages of your PDF and combines extracted data into a single CSV file. Files are encrypted during transfer and auto-deleted within 2 hours.
Method 2: Python with pdfplumber
pdfplumber is the best Python library for extracting tables from PDFs. It understands both bordered and borderless tables, gives you coordinates for every character, and lets you fine-tune extraction parameters.
Install pdfplumber
pip install pdfplumber
Basic table extraction
This script extracts all tables from every page of a PDF and writes them to a CSV file:
import pdfplumber
import csv
with pdfplumber.open("invoice.pdf") as pdf:
all_rows = []
for page in pdf.pages:
table = page.extract_table()
if table:
all_rows.extend(table)
with open("output.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(all_rows)
print(f"Extracted {len(all_rows)} rows to output.csv")
Handling borderless tables
When tables don't have visible borders, pdfplumber can still detect columns using character positions. Use extract_table() with custom settings:
# For PDFs with no visible table borders
table_settings = {
"vertical_strategy": "text",
"horizontal_strategy": "text",
"snap_y_tolerance": 5,
"intersection_x_tolerance": 15,
}
with pdfplumber.open("report.pdf") as pdf:
page = pdf.pages[0]
table = page.extract_table(table_settings)
for row in table:
print(row)
Batch convert multiple PDFs
import pdfplumber
import csv
from pathlib import Path
for pdf_file in Path("./invoices").glob("*.pdf"):
csv_path = pdf_file.with_suffix(".csv")
with pdfplumber.open(pdf_file) as pdf:
rows = []
for page in pdf.pages:
table = page.extract_table()
if table:
rows.extend(table)
with open(csv_path, "w", newline="") as f:
csv.writer(f).writerows(rows)
print(f"{pdf_file.name} -> {csv_path.name} ({len(rows)} rows)")
Method 3: Microsoft Excel (Get Data)
Microsoft 365 (Excel for 365) can import PDF files directly using the Power Query / Get Data feature. This option is not available in standalone Excel 2016 or 2019 — it requires an active Microsoft 365 subscription. It works well for simple, well-structured tables.
- Open Excel and create a new blank workbook.
- Go to Data → Get Data → From File → From PDF.
- Select your PDF from the file browser.
- Choose the table(s) you want to import from the Navigator panel. Excel will show a preview of each detected table.
- Click "Load" to import the data into your worksheet.
- Save as CSV: File → Save As → choose "CSV (Comma delimited) (*.csv)" as the format.
Limitation: Excel's PDF import works best with simple, bordered tables. It struggles with multi-column layouts, merged cells, and borderless tables. For complex PDFs, use Convertio or Python instead.
Method 4: Google Sheets
Google Sheets doesn't import PDFs directly, but you can use Google Drive's built-in OCR to extract the text first, then copy it into Sheets.
- Upload the PDF to Google Drive.
- Right-click the PDF → Open with → Google Docs. Google will OCR the file and convert it to an editable document.
- Select the table data in the Google Doc and copy it (Ctrl+C / Cmd+C).
- Open a new Google Sheet and paste (Ctrl+V / Cmd+V). The data will fill into cells.
- Clean up the data — adjust column widths, remove extra rows, fix any OCR errors.
- Download as CSV: File → Download → Comma Separated Values (.csv).
Tip: Google's OCR works surprisingly well for scanned PDFs. But the table structure may not survive the copy-paste step intact. For better results with tabular data, use Convertio's direct PDF to CSV converter.
Method Comparison
| Feature | Convertio | Python | Excel | Google Sheets |
|---|---|---|---|---|
| Difficulty | Easy | Advanced | Medium | Easy |
| Installation | None (browser) | Python + pip | Microsoft 365 | None (browser) |
| Bordered tables | Excellent | Excellent | Good | Fair |
| Borderless tables | Good | Excellent | Poor | Poor |
| Scanned PDFs (OCR) | Built-in | With pytesseract | Not supported | Via Google Drive |
| Batch processing | One file at a time | Unlimited | One file at a time | One file at a time |
| Best for | Quick one-off conversions | Automation & complex PDFs | Excel users with simple tables | Quick extraction with OCR |
Tips for Clean CSV Output
- Check the header row. Some PDFs have multi-line headers that get split into separate CSV rows. After conversion, verify that your column headers are on a single row.
- Watch for merged cells. PDF tables often merge cells for group headings. These usually become empty cells in CSV. Fill them manually or with a script after extraction.
- Handle special characters. Commas, quotes, and line breaks inside cell values can break CSV parsing. Good converters (Convertio, pdfplumber) handle escaping automatically. If yours doesn't, wrap values in double quotes.
- Encoding matters. Use UTF-8 encoding when saving CSV to preserve accented characters, currency symbols, and non-Latin text. In Python:
open("out.csv", "w", encoding="utf-8-sig")(the-sigadds a BOM that helps Excel detect UTF-8). - Multi-page tables. When a table spans multiple PDF pages, some tools extract each page as a separate table. In Python, skip the header row on subsequent pages to avoid duplicates.
Common Issues and Fixes
| Problem | Cause | Solution |
|---|---|---|
| Empty CSV output | Scanned PDF (image-based) | Enable OCR in Convertio or use pytesseract |
| All data in one column | Excel opened CSV with wrong delimiter | Use Data → Text to Columns → Delimited → Comma |
| Misaligned columns | Borderless table with uneven spacing | Use pdfplumber with vertical_strategy: "text" |
| Garbled characters | Wrong encoding (usually Latin-1 vs UTF-8) | Open in text editor, save as UTF-8 |
| Duplicate headers | Multi-page table with repeated headers | In Python, skip row 0 on pages after the first |