Data is often locked up in pdf reports, and if we’re lucky it may be in a structured (HTML) or semi-structured form (e.g. email). Extracting data from pdfs requires translating the data into plain text while preserving the structure as much as possible.
One open source tool for extracting text from pdfs is the Apache Project tika. Tika does a reasonably good job of extracting text and preserving the structure. For this example, we’ll parse the Brady Campaign’s Mass Shootings in the United States Since 2005 document. Looking at the document we can see that it has a structure of location, date and description. Buried in the description is the source in parentheses…….
See on sproke.blogspot.com.br