Apache Tika is a library to detect document types and extract textual contents and metadata from various file formats. Tika provides a single generic API as a universal type detector and content extractor for many file formats. For more information about Tika, please check the Tika website . KNIME Text Processing extension now integrates Tika and this makes our life incredibly easier when importing documents of any type.
As an example, let’s try to read the book “Pride and Prejudice” by Jane Austen. The book can be downloaded for free from the web site of the Gutenberg Project. Format can be HTML, pdf, or epub. We chose epub, but thanks to the new Tika nodes we could have downloaded the book in pdf or HTML format. We could read all of them easily.
Indeed, the document format supported by the Tika library is endless, as you can see by the size of the curoser in the Exclude frame of the configuration dialog of the Tika Parser node we used to read the downloaded epub document of the book “Pride and Prejudice”. Supported formats include: epub, pdf, docx, pptx, fit, html, iboks, jar, midi, mime, mp3, mp4, odc, p7*, png, tiff, and many many more.
Figure 1. Configuration Window of the Tika Parser node. Notice the long list of supported formats in the Exclude frame. We chose to read the epub format.
The resulting word cloud is displayed in figure 2, where you can see Elizabeth at the center as it probably should be for the main book’s character.
The workflow used to import the book document and to build the word cloud is displayed in figure 3.
Figure 2. Word cloud built on the book “Pride and Prejudice” by Jane Austen. Notice the central position of the main character.
Figure 3. KNIME workflow to create the word cloud from the book “Pride and Prejudice”. Notice the Tika Parser node at the beginning to read the epub document.