Internship 2010: Improve PDF Import

From Wiki
Jump to navigationJump to search


Abstract

The PDF Import Extension allows you to import and modify PDF documents. Best results with 100% layout accuracy can be achieved with the "PDF/ODF hybrid file" format, which this extension also enables. A hybrid PDF/ODF file is a PDF file that contains an embedded ODF source file. Hybrid PDF/ODF files will be opened in OpenOffice.org as an ODF file without any layout changes. Users without this extension can open the PDF part of the hybrid file with their PDF viewer.

The PDF Import Extension also allows you to import and modify PDF documents for non hybrid PDF/ODF files. PDF documents are imported in Draw to preserve the layout and to allow basic editing. This is the perfect solution for changing dates, numbers or small portions of text with a minimum loss of formatting information for simple formatted documents.

Goals for a PDF import

The document created by importing a PDF file should resemble the original as close as possible; nevertheless PDF per se does not lend itself to that end easily: most PDF files contain no information about layout or document structure at all. Therefore a PDF file will never be able to be imported on a 1:1 basis. We have to define goals to define what level of similarity must be achieved on a basis of feasibility.

These goals should be treated as paramount:

  • all text that is visible in the original PDF document should be imported
  • text attributes: font family, font size, weight (bold, not bold), style (italic, not italic) should be imported together with the respective text.
  • all drawing elements (images, vector graphics) should be imported.
  • if the implementation has to choose between layout fidelity and editability, lean towards layout.

Additionally there are some goals that would greatly enhance the import result, all of these features can by their nature only be implemented with heuristic methods since PDF (unless the file uses tagged PDF) does not contain structural information. The following text features should be detected (sequence in descending importance):

  • Paragraphs
  • Enumerations
  • Titles
  • Underlined text
  • subscript/superscript

Backlog

This section contains the list of tasks that are going to be done during internship and haven't been started yet.

  1. Pop-up window which allows to replace fonts
  2. Native PDF forms
  3. Processing layout of LaTeX PDF
  4. Import of complex vector graphics elements
  5. Conversion of tables
  6. Import of EPS graphics
  7. RTL (right-to-left) text/font support
  8. Change ContentSink class
  9. Fix disappearing bookmarks
  10. Fix ghostscript pdf import

Current tasks

This section contains the list of tasks that are being done right now.

  1. Misplaced paragraphs

What has been done so far

  1. Introduction
  2. Issue 109708
  3. Issue 105133
  4. Issue 92919
  5. Proper paragraphs
  6. Improving rotated text
  7. Proper paragraphs required code changes
  8. Testing proper paragraphs import
  9. Improving char spaces
  10. Allow import of only selected pages

Problematic Tasks

  1. Issue 90633

Project status

  • The project is accepted for the OpenOffice summer internship program 2010