Parsel: Parse PDFs, Office Documents, and Images in PHP

发布日期 经过

Parsel: Parse PDFs, Office Documents, and Images in PHP image

Parsel parses PDFs, Word documents, spreadsheets, presentations, and images through a single fluent PHP API. Built by 普什帕克·查杰德 , it wraps the liteparse lit CLI and performs all work locally, so files are never sent to an external service. It requires PHP 8.4, and Parsel includes a command to install the lit binary.

A single API across file types

The same fluent API supports all formats. Call text() for plain text or parse() for a structured Document 目的:

使用 Shipfastlabs\Parsel ;
文本 = Parsel :: 文件 'contract.pdf' -> 文本 ();
$document = Parsel :: 文件 'amendment.docx' -> 解析 ();

返回的 Document exposes the full text, document metadata, a pageCount() ,和一个 pages collection. Each page carries its number , width , height , text , and a list of positioned items

Positioned text with coordinates

Every text item includes its location and font information, which is useful for reading tables, finding signature blocks, or mapping clauses back to their position on the page. Each item has text , x , y , width , height , fontName , fontSize ,和一个 confidence score:

$document = Parsel :: 文件 'contract.pdf' -> 解析 ();
foreach (文档) -> 页面 作为 $page) {
foreach (每页) -> 项目 作为 $item) {
回声 "{ $item -> 文本 } @ ({ $item -> }, { $item -> }) \n ;
}
}

Page selection and streaming

You can limit work to specific pages with page() , pages() , pageRange() , 或者 maxPages()page() takes a single page number, pages() accepts a list of numbers or range strings, and pageRange() takes a start and end:

// Just the cover page
摘要 = Parsel :: 文件 'contract.pdf' -> 1 -> 文本 ();
// A mix of individual pages and a range
$选定 = Parsel :: 文件 'contract.pdf' -> 页面 '1-3' , 12 -> 解析 ();
// A continuous range of clauses
$body = Parsel :: 文件 'contract.pdf' -> pageRange 2 , 6 -> 解析 ();

These calls are additive, so you can combine them before parsing, and maxPages() caps the total number of pages processed:

$document = Parsel :: 文件 'contract.pdf'
-> pageRange 1 , 5
-> 12
-> maxPages 20
-> 解析 ();

For large files, lazyPages() processes one page at a time to keep memory use flat:

foreach Parsel :: 文件 'contract.pdf' -> lazyPages () 作为 $page) {
// handle one page at a time
}

OCR for scanned documents

OCR is off by default. Turn it on with withOcr() and pass named arguments for the language, tessdata path, an OCR server URL, and worker count:

文本 = Parsel :: 文件 'signed-contract.png'
-> withOcr
语言 : 'eng' ,
tessdataPath : '/usr/share/tessdata' ,
serverUrl : 'http://localhost:8828/ocr' ,
workers : 8 ,
-> 文本 ();

Rendering page previews

Turn pages into image files with screenshots() , passing an output directory:

$screenshots = Parsel :: 文件 'contract.pdf' -> 截图 存储路径 'previews' ));

Passwords and rendering options

Encrypted files open with withPassword() , and a few chainable methods adjust how lit renders a document before parsing. withDpi() raises the render resolution, which sharpens both OCR and screenshots; preserveSmallText() keeps fine print such as footnotes from being dropped; and withTimeout() sets a per-file time limit so a large document cannot stall a request:

$document = Parsel :: 文件 'contract.pdf'
-> withPassword 'hunter2'
-> withDpi 300
-> preserveSmallText ()
-> withTimeout 120
-> 解析 ();

Testing without the binary

Parsel ships a fake runner so tests don't have to shell out to the real lit binary. You map command fragments to canned output and assert on the commands that were recorded:

$fake = Parsel :: 伪造的 ([
'--format json' => file_get_contents __你__ '/fixtures/contract.json' ),
]);
$document = Parsel :: 文件 'contract.pdf' -> 解析 ();
预计 ($fake -> recordedCommands ()[ 0 ]) -> toContain '--format' , ‘json’ (英文):

安装

Install the package with Composer:

作曲家 要求 shipfastlabs/parsel

Then install the lit 二进制:

vendor/bin/parsel-install-lit

For Office documents and images, you can pull in the additional system dependencies:

vendor/bin/parsel-install-lit --with-system-dependencies

If you'd rather manage lit yourself, for example, in a CI image that already has it, the binary is also available through common package managers (npm, pnpm, bun, pip, cargo) and can be installed independently.

You can view the source and full documentation on GitHub

Yannick Lyn Fatt 的照片

Laravel News 的特约撰稿人和全栈 Web 开发人员。

归档于:
立方体

Laravel 时事通讯

加入超过 4 万名开发者的行列,不错过任何新的技巧、教程等内容。

图像
了解软科技

以每小时 20 美元的价格聘请具备人工智能专业知识的 Laravel 开发人员。48 小时内即可开始工作。

访问 Acquaint Softtech
鱼叉:新一代时间跟踪和发票标志

Harpoon:新一代时间跟踪和发票系统

新一代时间跟踪和计费软件,帮助您的机构规划和预测盈利的未来。

Harpoon:新一代时间跟踪和发票系统
绝不妥协标志

绝不妥协

来自 No Compromises 播客的两位经验丰富的开发者 Joel 和 Aaron 现在可以为您的 Laravel 项目提供服务。⬧ 固定费用 9500 美元/月。⬧ 无冗长的销售流程。⬧ 无需签订合同。⬧ 100% 退款保证。

绝不妥协
Lucky Media 标志

幸运传媒

Get Lucky Now——拥有十余年经验的 Laravel 开发理想之选!

幸运传媒
Tinkerwell 徽标

廷克威尔

Laravel 开发者必备的代码运行器。可在本地和生产环境中体验 AI、自动补全和即时反馈功能。

廷克威尔
SaaSykit:Laravel SaaS 入门套件徽标

SaaSykit:Laravel SaaS 入门套件

SaaSykit 是一个多租户 Laravel SaaS 入门套件,包含运行现代 SaaS 所需的所有功能,例如支付、美观的结账界面、管理面板、用户仪表盘、身份验证、现成组件、统计数据、博客、文档等等。

SaaSykit:Laravel SaaS 入门套件
了解 Softtech 的标志

了解软科技

Acquaint Softtech 提供 AI 就绪的 Laravel 开发人员,48 小时内即可上手,每月费用为 3000 美元,没有冗长的销售流程,并提供 100% 退款保证。

了解软科技
PhpStorm 标志

PhpStorm

首选的 PHP IDE,对 Laravel 及其生态系统提供广泛的开箱即用支持。

PhpStorm
Laravel Cloud 标志

Laravel 云

轻松创建和管理服务器,并在几秒钟内部署 Laravel 应用程序。

Laravel 云
Shift 标志

转移

还在运行旧版本的 Laravel?立即实现 Laravel 自动升级和代码现代化,让您的应用程序保持最新状态。

转移
Kirschbaum 标志

樱桃树

提供创新和稳定性,确保您的Web应用程序取得成功。

樱桃树
In-Memory Eloquent Models with Truffle image

In-Memory Eloquent Models with Truffle

阅读文章
Detect and Resolve Laravel Schema Drift with MigrAlign image

Detect and Resolve Laravel Schema Drift with MigrAlign

阅读文章
Laravel Cloud Adds Scale-to-Zero and Spending Limits image

Laravel Cloud Adds Scale-to-Zero and Spending Limits

阅读文章
Shift + AI = Fully Automated Laravel Upgrades image

Shift + AI = Fully Automated Laravel Upgrades

阅读文章
Laracon AU 2026 Announces Full Speaker Lineup, Schedule, and Workshops image

Laracon AU 2026 Announces Full Speaker Lineup, Schedule, and Workshops

阅读文章
Typed Objects for Eloquent with Expressive image

Typed Objects for Eloquent with Expressive

阅读文章