| » Metadata |
» Status |
-
Category: File Formats
-
Proposer: John Stokes
-
License: BSD Style
|
|
| » Description |
PDF Reader extracts raw text from a PDF file and returns it as an array of strings.
I've seen many solutions to output PDF files, but few to input PDF files as a data source. To my knowledge, there are no PHP-native solutions to read PDF files greater than version 1.4.
PDF Reader supports PDF versions up to 1.7, including AcroForms (aka FDF), and is written as a PHP 5 class tree. It returns raw text as an array of strings or form fields as an associative array of key/value pairs.
I have no plans to extract images or layout metadata at this time, nor do I plan to support signed or encrypted PDFs unless there's demand. |
| » Dependencies |
» Links |
- Linux (May work on Windows, but untested)
- PHP >= 5.1
|
|
| » Timeline |
» Changelog |
-
First Draft: 2010-07-22
- Proposal: 2010-07-29
|
John Stokes [2010-08-03 23:55 UTC] v0.1.1
- Fixed a bug in which a string without ET operator would have zero length
- Added support for hexadecimal strings embedded in normal strings
- Fixed a bug in which some line breaks are ignored
- Standardized regular expressions for primitive data types as constants
- Shortened lines and adjusted switch/case indents for PEAR compliance
- Moved extractText routines from PDFobject class to PDFpage class in order
to assemble multiple content streams
- Refactored to use a single PDFdecoder instance
v0.1.0
- Initial proposal
- Included basic support for text and form field extraction
- Some known bugs with character mapping non-standard fonts
John Stokes [2010-08-05 21:51 UTC] v0.1.2
- Fixed known bug with character mapping non-standard fonts
- Added limited support for text matrices
John Stokes [2010-08-27 19:27 UTC] - Fixed bug in which Marked Content operators would sometimes appear in form fields
- Removed "exit" and "die" statements for PEAR compliance
- Added error trap for absence of `gzip`
- Implemented package-specific Exceptions
- Name changed to File_PDFreader for PEAR compliance
John Stokes [2010-12-29 21:21 UTC] v0.1.4
- Added support for cascaded filters
- Fixed a bug in which stream object dictionaries may be parsed incorrectly
- Restructuried directories for PEAR compliance
- Added package.xml for PEAR installer
John Stokes [2011-01-05 23:00 UTC] v0.1.5
- Added support for StandardEncoding
- Added support for PDFDocEncoding
- Added support for MacExpertEncoding
- Added support for ASCIIHexDecoded filter
- Added support for ASCII85Decoded filter
John Stokes [2011-02-03 01:47 UTC] v0.1.6
- Fixed a bug in which parent fields without field types throws an exception
- Fixed a bug in which PDF arrays in form labels were not parsed correctly
- Fixed a bug in parsing binary Field Flags
- Brought into E_STRICT compliance
- All output is now UTF-8 encoded (Thanks to Christoph Runkel for this tip)
John Stokes [2011-04-21 19:59 UTC] v0.1.7
- Added a feature to handle nested Page arrays (Thanks again to Christoph Runkel)
- Added a feature to decode hex-encoded form field keys
- Added two new methods to the API: readTextByPage($pageNum) which allows users to read one page at a time for large PDFs and getPages() to return the total number of pages in the document.
|