Converting Apache POI Word documents to clean HTML stripping out styles and superfluous tags

I am currently working on converting Word documents to clean HTML. I have been using Apache POI, but it seems to create messy output similar to MS Word's own HTML saving method. What I really need is a solution like the one offered by . For instance, when converting a table, I prefer not to include any width properties or unnecessary elements, just simple <td> and <tr> tags with maybe some <b> formatting.

Can anyone suggest a better approach to achieve this? I am open to exploring alternative Java APIs for Word to HTML conversion, as I am not bound to using Apache POI.

Answer №1

Have you considered elevating a comment to an answer? An excellent tool for this task is Apache Tika, which is backed by Apache POI and is designed to generate clean and semantically meaningful HTML output.

To implement Apache Tika in parsing to XHTML, you can refer to the instructions provided in the Apache Tika documentation. Here is an example code snippet:

public String parseToHTML() throws IOException, SAXException, TikaException {
  ContentHandler handler = new ToXMLContentHandler();

  AutoDetectParser parser = new AutoDetectParser();
  Metadata metadata = new Metadata();
  try (InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc")) {
    parser.parse(stream, handler, metadata);
    return handler.toString();
  }
}

If you want to test the functionality, you can use the Tika App command line tool, specifying the --xhtml option along with your file to receive a simple and clean XHTML output.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Timeout error on Tomcat7 server in Eclipse

While running tomcat7 from Eclipse, I made sure that both Eclipse and Tomcat are using jdk-1.7.0. However, the server displays an error saying "timeout. Server took more than 45 seconds to boot". The issue arose when I changed the JDK for my project in Ec ...

Roundabout Navigation Styles with CSS and jQuery

My goal is to implement a corner circle menu, similar to the one shown in the image below: https://i.stack.imgur.com/zma5U.png This is what I have accomplished so far: $(".menu").click(function(){ $(".menu-items").fadeToggle(); }); html,body{ colo ...

Running a JAR file on AWS Elastic Beanstalk with Node.js

Currently, I am executing the following code on my Elastic Beanstalk AWS instance using node.js: var exec = require('child_process').exec; var child = exec('java -cp my.jar com.whatever.my.Class -t param1 -u param2'), function (error, ...

Transforming a JSON data structure into a collection of personalized objects

I'm faced with a json object structure like the one shown below. { "key1": { "code1": "name1", "code2": "name2", ... }, "key2": { "code1": "name1", ...

What is the best way to create a design with an image on the left and text on the right side?

I am trying to create a layout that features an image on the left and text on the right in a specific structure: Item 0: value 0 Item 1: value 1 Item 2: value 2 However, the text is appearing all in one row like this: Item 0: value 0 Item 1: value 1 Ite ...

Identifying when a user is idle on the browser

My current project involves developing an internal business application that requires tracking the time spent by users on a specific task. While users may access additional pages or documents for information while filling out a form, accurately monitoring ...

Ensure that the textbox remains valid when the reset button is clicked in Bootstrap

After implementing the code for bootstrap textbox and textarea, I encountered an issue when using the reset button. When entering an invalid shop name followed by a valid address and clicking on the Reset button, it resets the form. However, if I then only ...

"Comparison: Java Installation vs. Enabling Java in Your Web Browser

Is there a method to determine if Java is currently running on your system or if it has been disabled in your browser? Our application relies on Java applets, and we typically use "deployJava.js" to load the applet. However, even when Java is disabled in t ...

Does Div have the capability for text-shadow?

Is there a way to create a shadow effect for DIVs similar to the text-shadow property? While text-shadow is great for adding shadows to individual pieces of text, I'm interested in applying a shadow effect to an entire DIV element. Does anyone have a ...

Adding a lag time between the entrance and exit animations of a ViewAnimator in Android

    I want to add some animation effects to my view transaction with a ViewAnimator. The challenge is to make my view slide out and in from the SAME direction. While I need the in and out animation to occur simultaneously, how can I insert a delay bet ...

Switching the background image with a single click and reverting it back with a double click using jquery

I am working on a project with multiple div elements which need their background images to change upon clicking and revert back to the original image when clicked again. Below is the CSS code for one of the divs, including its Class and ID: .tab_box { wid ...

How can I prevent the arrow keys and space bar from causing my webpage to shift downwards?

I created a game where the arrow keys control movement and the space bar shoots, but every time I press the down key or space bar, it scrolls to the bottom of the web page. Is there a way to adjust the CSS so that it focuses on the game div tag without ...

Changing the color of an image with a click event in HTML

My coding dilemma involves two HTML codes: one displaying a table with grey square buttons, and the other featuring the same table but with orange buttons instead. I am looking for a way to change the color of the button from grey to orange when clicked, ...

Displaying the number zero in TYPO3 Fluid without it being empty or NULL

Within my TYPO3 Fluid template, I am encountering an issue where I need to display a value only if it is not empty. Currently, my approach is as follows: <f:if condition="{myvalue}"> <div class="myclass">{myvalue}</div> </f:if> Th ...

Functions perfectly on Chrome, however, encountering issues on Internet Explorer

Why does the Navigation Bar work in Chrome but not in Internet Explorer? Any insights on this issue? The code functions properly in both Internet Explorer and Chrome when tested locally, but fails to load when inserted into my online website editor. Here ...

What is the best method for deleting the div with id "__next" in Next.js?

I'm currently working on a website using Next.js and I'm aiming to create a header with a position: sticky; effect. Nevertheless, Next.js automatically inserts a div with the attribute id="__next" at the top of my website without my co ...

Expand the div to fit 100% width inside the Blueprint container

I am currently working on a website and considering using the Blueprint CSS framework. You can view the site I'm referencing here: http://jsfiddle.net/timkl/uaSe3/ The Blueprint framework allows for a nice 24 column grid layout. One challenge I&apos ...

Learn how to retrieve key and value data from a JSON file by utilizing the gson library

I am attempting to extract the key and value from a JSON string. Since I do not have prior knowledge of the key, I am unable to directly access the corresponding value using the key. Instead, I need to retrieve the key and value separately. JsonObject json ...

What methods can I use to prevent a negative value from being saved in an array?

I am working on a code that accepts x and y coordinates, but I need to ensure that no negative values are stored in the array. I am having trouble avoiding storing these negative values while looping through the input. Can someone guide me on how I can a ...

Including HTML attributes within a PHP script

Currently, I am working on developing a message to be displayed to the user after they have submitted the contact form using PHP. However, I am facing an issue with styling the message as intended, particularly with elements like bold text and line breaks ...