Parsel
parses PDFs, Word documents, spreadsheets, presentations, and images through a single fluent PHP API. Built by
普什帕克·查杰德
, it wraps the liteparse
lit
CLI and performs all work locally, so files are never sent to an external service. It requires PHP 8.4, and Parsel includes a command to install the
lit
binary.
A single API across file types
The same fluent API supports all formats. Call
text()
for plain text or
parse()
for a structured
Document
目的:
使用
Shipfastlabs\Parsel
;文本
=
Parsel
::
文件
(
'contract.pdf'
)
->
文本
();$document
=
Parsel
::
文件
(
'amendment.docx'
)
->
解析
();
返回的
Document
exposes the full text, document metadata, a
pageCount()
,和一个
pages
collection. Each page carries its
number
,
width
,
height
,
text
, and a list of positioned
items
。
Positioned text with coordinates
Every text item includes its location and font information, which is useful for reading tables, finding signature blocks, or mapping clauses back to their position on the page. Each item has
text
,
x
,
y
,
width
,
height
,
fontName
,
fontSize
,和一个
confidence
score:
$document
=
Parsel
::
文件
(
'contract.pdf'
)
->
解析
();foreach
(文档)
->
页面
作为
$page) {
foreach
(每页)
->
项目
作为
$item) {
回声
"{
$item
->
文本
} @ ({
$item
->
十
}, {
$item
->
和
})
\n
“
;}}
Page selection and streaming
You can limit work to specific pages with
page()
,
pages()
,
pageRange()
, 或者
maxPages()
。
page()
takes a single page number,
pages()
accepts a list of numbers or range strings, and
pageRange()
takes a start and end:
// Just the cover page摘要
=
Parsel
::
文件
(
'contract.pdf'
)
->
页
(
1
)
->
文本
();// A mix of individual pages and a range$选定
=
Parsel
::
文件
(
'contract.pdf'
)
->
页面
(
'1-3'
,
12
)
->
解析
();// A continuous range of clauses$body
=
Parsel
::
文件
(
'contract.pdf'
)
->
pageRange
(
2
,
6
)
->
解析
();
These calls are additive, so you can combine them before parsing, and
maxPages()
caps the total number of pages processed:
$document
=
Parsel
::
文件
(
'contract.pdf'
)
->
pageRange
(
1
,
5
)
->
页
(
12
)
->
maxPages
(
20
)
->
解析
();
For large files,
lazyPages()
processes one page at a time to keep memory use flat:
foreach
(
Parsel
::
文件
(
'contract.pdf'
)
->
lazyPages
()
作为
$page) {
// handle one page at a time}
OCR for scanned documents
OCR is off by default. Turn it on with
withOcr()
and pass named arguments for the language, tessdata path, an OCR server URL, and worker count:
文本
=
Parsel
::
文件
(
'signed-contract.png'
)
->
withOcr
(
语言
:
'eng'
,
tessdataPath
:
'/usr/share/tessdata'
,
serverUrl
:
'http://localhost:8828/ocr'
,
workers
:
8
,)
->
文本
();
Rendering page previews
Turn pages into image files with
screenshots()
, passing an output directory:
$screenshots
=
Parsel
::
文件
(
'contract.pdf'
)
->
截图
(
存储路径
(
'previews'
));
Passwords and rendering options
Encrypted files open with
withPassword()
, and a few chainable methods adjust how
lit
renders a document before parsing.
withDpi()
raises the render resolution, which sharpens both OCR and screenshots;
preserveSmallText()
keeps fine print such as footnotes from being dropped; and
withTimeout()
sets a per-file time limit so a large document cannot stall a request:
$document
=
Parsel
::
文件
(
'contract.pdf'
)
->
withPassword
(
'hunter2'
)
->
withDpi
(
300
)
->
preserveSmallText
()
->
withTimeout
(
120
)
->
解析
();
Testing without the binary
Parsel ships a fake runner so tests don't have to shell out to the real
lit
binary. You map command fragments to canned output and assert on the commands that were recorded:
$fake
=
Parsel
::
伪造的
([
'--format json'
=>
file_get_contents
(
__你__
。
'/fixtures/contract.json'
),]);$document
=
Parsel
::
文件
(
'contract.pdf'
)
->
解析
();预计
($fake
->
recordedCommands
()[
0
])
->
toContain
(
'--format'
,
‘json’
(英文):
安装
Install the package with Composer:
作曲家
要求
shipfastlabs/parsel
Then install the
lit
二进制:
vendor/bin/parsel-install-lit
For Office documents and images, you can pull in the additional system dependencies:
vendor/bin/parsel-install-lit
--with-system-dependencies
If you'd rather manage
lit
yourself, for example, in a CI image that already has it, the binary is also available through common package managers (npm, pnpm, bun, pip, cargo) and can be installed independently.
You can view the source and full documentation on GitHub 。







