Converting Apache POI Word documents to clean HTML stripping out styles and superfluous tags

I am currently working on converting Word documents to clean HTML. I have been using Apache POI, but it seems to create messy output similar to MS Word's own HTML saving method. What I really need is a solution like the one offered by . For instance, when converting a table, I prefer not to include any width properties or unnecessary elements, just simple <td> and <tr> tags with maybe some <b> formatting.

Can anyone suggest a better approach to achieve this? I am open to exploring alternative Java APIs for Word to HTML conversion, as I am not bound to using Apache POI.

Answer №1

Have you considered elevating a comment to an answer? An excellent tool for this task is Apache Tika, which is backed by Apache POI and is designed to generate clean and semantically meaningful HTML output.

To implement Apache Tika in parsing to XHTML, you can refer to the instructions provided in the Apache Tika documentation. Here is an example code snippet:

public String parseToHTML() throws IOException, SAXException, TikaException {
  ContentHandler handler = new ToXMLContentHandler();

  AutoDetectParser parser = new AutoDetectParser();
  Metadata metadata = new Metadata();
  try (InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc")) {
    parser.parse(stream, handler, metadata);
    return handler.toString();

If you want to test the functionality, you can use the Tika App command line tool, specifying the --xhtml option along with your file to receive a simple and clean XHTML output.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Using jQuery to create a Select All Checkbox that toggles a specific class

I have come across similar examples in the past, but I have been unable to make it work as intended. The examples I found only use standard checkboxes. I attempted to customize the checkbox using a sprite sheet by applying a class, but I am struggling to a ...

Why are the radio buttons not aligned with the text in the center?

Currently, I am in the process of building a form that includes radio buttons. However, I've encountered an issue where the text next to the radio buttons appears on a different level than the buttons themselves. How can I go about fixing this partic ...

JavaScript-powered horizontal sliderFeel free to use this unique text

I'm new to JS and trying to create a horizontal slider. Here's the current JS code I have: var slideIndex = 0; slider(); function slider() { var i; var x = document.getElementsByClassName("part"); for (i = 0; i < x.length; i++) { x[i].styl ...

Menu stack with content displayed to the right on larger screens

Given the content div's position (up middle), what method can I use to stack the divs in this manner? Is it possible to accomplish with only CSS or is jQuery necessary to move the div content out on larger screens? ...

Despite implementing an event listener, the Google Maps API is failing to resize properly when created using AJAX

Currently, I am facing an issue with the Google Maps API as the map generated is not resizing to fit the div. Instead, it shows a large grey area which is quite frustrating for me. Can someone please assist me in resolving this problem? Below is the code ...

The Challenge of CSS Transition and Checkbox Label Discrepancies

I'm looking to adjust the position of my label when a checkbox is checked. Everything works smoothly if I don't include a transition for the top offset property in the CSS of my label. However, when this transition is added (as seen in the commen ...

Error in Reactive Form: Null Property Reading Issue

Issue: Encountered a Cannot read property errors of null error in the form group. I have implemented a reactive form with validation, but I keep running into this specific error. Here is the complete form control setup: <div class="container mt- ...

How can we stop the jumping of images in the grid? Is there a way to eliminate the jump effect?

Here is a script I am working with: <script src=""></script> <script> $(document).ready(function() { $('.photoset-grid').photose ...

Executing a Java/Selenium script on a different system

I am facing an issue with transferring my small Java program that utilizes Selenium to another computer for use. The program was developed using Java version 1.7_079 and exported from Eclipse into a jar file. I then utilized launch4j to create a Windows ex ...

Can a class be passed to the binding attribute in Angular framework?

We currently have a system in place that is dependent on a numeric value being set to either 1 or 2 in order to display specific icons. This method is functional. <span *ngIf="config.icon" [class.fas]="true" [class.fa-plus]="icon===1" ...

chart.js version 3 does not display the correct date data on the time axis

Struggling to make chart.js work with a time axis is proving to be quite challenging for me. The code I have is as follows: <html> <head> <script src=""></script> <script src="https://cdnjs.clo ...

Prevent floating labels from reverting to their initial position

Issue with Form Labels I am currently in the process of creating a login form that utilizes labels as placeholders. The reason for this choice is because the labels will need to be translated and our JavaScript cannot target the placeholder text or our de ...

Dropdown with multiple selections organized in a hierarchical structure

I am in need of the following : Implement a drop-down menu that reflects hierarchical parent-child relationships. Include a checkbox for each node to allow for selection. If all child nodes are selected, their parent node should be automatically ch ...

Dynamic flexibility for content scrolling across components

Creating a mobile-friendly terms of service page with a title, content, and two buttons is my current project. The specific requirements are: All components should not exceed 100% of the viewport to allow for scrolling The buttons need to stay sticky at t ...

The confirm() function shows up before the background on a blank layout

I'm facing an issue where whenever my confirm() function pops up, the alert box displays correctly but then the background turns blank. I've tried various solutions like moving the entire if statement to the bottom of the page or at the end of th ...

What causes the circular progress bar to disappear when hovering over the MUI React table?

My goal was to create a table with a CircularProgressBar that changes its background color from orange to dark blue when hovering over the row. However, despite my efforts, I couldn't get it to work as intended. Additionally, I wanted the progressBar ...

Java - Selenium - Challenges with Assertion behavior in testing environments

I've encountered some strange behavior while trying to verify the display of an error message. There are two tests, both performing similar actions but with different input values and expected error messages. System.setProperty(" ...

Transfer the photo and save it to my database

I've been working on creating a form to upload images and store them in my database, but I'm facing some challenges: Notice: Undefined index: fileToUpload in C:\xampp\htdocs\confee\admin\upload.php on line 3 Notice: Un ...

What are the best techniques for styling a SELECT box with HTML and CSS?

I've received a form from a talented designer which includes SELECT inputs designed in a unique manner. Despite my attempts to add paddings, adjust backgrounds, I'm unable to replicate the exact appearance of the select box shown in the image. ...

Linkify Method appearing in html.erb file displaying with HTML on the view

Apologies if this question seems basic, but I am encountering an issue when calling the following method on the @page instance variable. The view is displaying HTML tags instead of the desired content... module PagesHelper def linkify_page page rege ...