unlock-pdf-data

PDF Parsers are very useful tools for data researchers, scientists or even journalists. Lots of data is available today online but locked in PDF files. PDF Parser is a simple PHP library to parse PDF files and extract elements like text. It will be great if it could be parse tables, but since the library is under active development, we hope this will be added to the todo list with secure PDF Documents. Extracting data from PDF tables a very hard and tough task, and actually there is only one open source software that can do it correctly but not automated.

Some features of the PDF Parser :

  • Load and parse objects and headers
  • Extract metadata (author, description, keywords, …)
  • Extract text from ordered pages
  • Support for compressed pdf (and not)
  • Support of charset encoding (WinAnsi, MacRoman)
  • Handling of hexa and octal content encoding
  • PSR-0 compliant (autoloader)
  • Compatible with Composer
  • PSR-1 compliant 

Documentation available here, Released under LGPL-v3 license. For more information https://www.pdfparser.org/

LEAVE A REPLY

Please enter your comment!
Please enter your name here