I have reviewed two different tools before to extract text from various popular formats – Text Mining Tool and OCR Terminal. These tools allow you to extract text from various image formats, PDF and HTML format, etc. If you are looking for a much broader tool, a utility that can extract text from more formats, then teXtracta will come handy.
It is a tool that works on the principle of IFilter. A COM interface developed by Microsoft for it’s indexing service so that it can index files of various formats. These indexed files are then used in Windows 7/Vista Search, Windows Desktop Search, and so on. You must have appropriate IFilters installed on your computer before you can extract text from various formats using teXtracta. To install the appropriate IFilters, go here.
In this article I will explain how to extract text from a PDF document as an example. First download the appropriate IFilter from the link given above, grab teXtracta from the link give at the end of this article. Now load up the tool and select the single file that you want to process. You can also select a folder, in this way all files inside that folder will be processed. Next check the desired options, such as, Show Text, Save Text, and Include Subdirectories.
When done, finally choose the filters like I have choosen the PDF IFilter as shown in the screenshot below.
When you will select a file or folder, options such as Start Processing, Pause Processing, and Stop Processing will be enabled automatically.
Now hit Start Processing button to begin the text extraction process. If you do not have proper IFilter installed it will notify you immediately, otherwise the process will go smoothly. Note that the time taken by the process will largely depend on the file that you can converting.
If the Save Text option is enabled, the output will be saved in txt format in the same directory where the file or folder is present.
It works on Windows 2000, Windows XP, Windows Vista, and Windows 7. Enjoy!