B Oracle Text Supported Document Formats

This appendix contains a list of the document formats supported by the automatic (AUTO_FILTER) filtering technology. The following topics are covered in this appendix:

About Document Filtering Technology

Oracle Text's automatic filtering technology, licensed from Verity, Inc., enables you to index most document formats. This technology also enables you to convert documents to HTML for document presentation with the CTX_DOC package.

To use automatic filtering for indexing and DML processing, you must specify the AUTO_FILTER object in your filter preference.

To use automatic filtering technology for converting documents to HTML with the CTX_DOC package, you need not use the AUTO_FILTER indexing preference, but you must still set up your environment to use this filtering technology, as described in this appendix.

Latest Updates for Patch Releases

The supported platforms and formats listed in this appendix apply for this release. These supported formats are updated for patch releases. To view the latest formats, refer to the Oracle Technology Network:

http://www.oracle.com/technology/products/text

Restrictions on Format Support

Password-protected documents and documents with password-protected content are not supported by the AUTO_FILTER filter.

For other limitations, refer to sections in this chapter concerning specific document types.

Supported Platforms

Several platforms can take advantage of AUTO_FILTER filter technology.

Supported Platforms

AUTO_FILTER filter technology is supported on the following platforms:

  • Microsoft Windows

    • Server 2003 (x86 and IA-64)

    • XP (Service Packs 1 and 2)

    • 2000 x86 (Service Pack 2)

    • NT 4.0 x86 (Intel) (Service Pack 6a)

  • Sun Solaris 8.0 and 9.0

  • HP-UX 11.0 and 11i, PA-RISC

  • HP-UX 11i v11.23, IA-64

  • IBM AIX 5.1 and 5.2L

  • Red Hat Linux 7.3 and 8.0

  • Red Hat Enterprise Linux AS 2.1 and 3.0 (x86)

  • Red Hat Enterprise Linux AS 3.0 (IA-64)

  • SuSE Linux Standard Server 8 (x86)

Environment Variables

No environment variables need to be set by the user.

Supported Document Formats

The tables in this section list the document formats that Oracle Text supports for filtering. Oracle Text licenses its filtering technology from Verity, Inc.

Document filtering is used for indexing, DML, and for converting documents to HTML with the CTX_DOC package.


Note:

These lists do not represent the complete list of formats that Oracle Text is able to process. The external filter framework enables Oracle Text to process any document format, provided an external filter exists that can filter to text..

Text and Markup

Plain-text, HTML, XHTML, XML, and SGML formats pass through the filter without any conversion.

Format Version Single-byte Asian (and Most Multi-byte) Bi-directional?
ANSI (TXT) all versions Y Y n/a
ASCII (TXT) all versions Y Y n/a
HTML 2.0, 3.2, 4.0 Y Y n/a
IBM DCA/RFT (Revisable Form Text) (DC) SC23-0758-1 character sets 500 and 1026 only N N
Rich Text Format (RTF) 1 through 1.7 Y Y Y
Unicode Text 3, 4 Y Y n/a
XHTML 1.0 Y Y n/a
Generic XML 1.0 Y Y n/a

Word Processing Formats

Format Version Single-byte Asian (and Most Multi-byte) Bi-directional?
Adobe Maker Interchange Format (MIF) 5, 5.5, 6, 7 character set 1252 only N N
Applix Words (AW) 3.11, 4.2, 4.3, 4.4, 4, 41, 4.2 character set 1252 only N N
DisplayWrite (IP) 4 character sets 500 and 1026 only N N
Folio Flat File (FFF) 3.1 character set 1252 only N N
Fujitsu Oasys (OA2) 7 Y Japanese only N
JustSystems Ichitaro (JTD) 8, 9, 10, 12 Y Japanese only N
Lotus AMI Pro (SAM) 2, 3 Y Simplified Chinese, Traditional Chinese, Japanese, and Thai only Y
Lotus Word Pro (LWP) 96, 97, Millennium Edition R9, 9.8 (supported on Windows 32-bit platform only) Y Y Y
Lotus Master (MWP) 96, 97, Millennium Edition R9, 9.8 (supported on Windows 32-bit platform only) Y Y Y
Lotus Master (MWP) 96, 97 (supported on Windows 32-bit platform only) Y Y N
Microsoft Word for PC (DOC) 4, 5, 5.5, 6 character set 1252 only N N
Microsoft Word for Windows (DOC) 1 through 2003 Y N: versions 1-2

Y: versions 6,7,8,95,97,2000,XP,2002,2003

N: versions 1-2

Hebrew only: versions 6,7,8,95

Y: versions 97,2000,XP,2002,2003

Microsoft Word for Windows XML format 2003 (No formatting extracted) Y Y Y
Microsoft Word for Macintosh (DOC) 4, 5, 6, 98 Y (version 98) N (version 98) Y (version 98)
Microsoft Works (WPS) 1 through 2000 Y Japanese only N
Microsoft Windows Write (WRI) 1, 2, 3 Y Japanese only N
OpenOffice (SXW) 1, 1.1 (No formatting extracted) Y Y Y
StarOffice (SXW) 6, 7 (No formatting extracted) Y Y Y
WordPad through 2003 Y Y Y
WordPerfect for Windows (WO) 5, 5.1 Y N Y
WordPerfect for Windows (WPD) 6, 7, 8, 10, 2000, 2002, 11 Y N N
WordPerfect for Macintosh 1.02, 2, 2.1, 2.2, 3, 3.1 Y N N
WordPerfect for Linux 6 Y N N
XyWrite (XY4) 4.12 character set 1252 only N N

Word Processing Filtering Limitations

The following limitations apply to filtering of word processing documents:

  • Mixed-page orientation (landscape and portrait) within the same word processing document is not supported.

  • When text color in a Microsoft Word document is set to Automatic on a dark background, the resulting text is rendered as black. If the text color is explicitly set, the resulting text is rendered correctly in the same color as the original document.

  • If a graphic or table appears in a word processing text box, the filter cannot position it correctly in the HTML output.

  • Nested tables (a table inside another table) in word processing documents are not supported.

  • Comments in Microsoft Word documents are not filtered.

Spreadsheet Formats

Format Version Single-byte Asian (and Most Multi-byte) Bi-directional?
Applix Spreadsheets (AS) 4.2, 4.3, 4.4 character set 1252 only N N
Corel Quattro Pro (QPW, WB3) 6, 7, 8, 10, 2000, 2002, 11 Y N N
Lotus 1-2-3 (123) 96, 97, Millennium Edition R9, 9.8 Y Y Y
Lotus 1-2-3 (WK4) 2, 3, 4, 5 Y Y N
Lotus 1-2-3 Charts (123) 2, 3, 4, 5 Y Y N
Microsoft Excel for Windows (XLS) 2.2 through 2003 Y Y Y
Microsoft Excel for Windows XML format 2003 (No formatting extracted) Y Y Y
Microsoft Excel for Macintosh (XLS) 98 Y N N
Microsoft Excel Charts (XLS) 2, 3, 4, 5, 6, 7 Y Y N
Microsoft Works Spreadsheet (S30,S40) 1, 2, 3, 4 Y N N
OpenOffice (SXC) 1, 1.1 (No formatting extracted) Y Y Y
StarOffice (SXC) 6, 7 (No formatting extracted) Y Y Y

Spreadsheet Format Limitations

The following limitations apply to the filtering of spreadsheets:

  • Cell outline borders in Microsoft Excel spreadsheets are not filtered.

  • Microsoft Excel "Donut," "Radar," "Surface," and custom charts are not supported.

  • Comments in Microsoft Excel spreadsheets are not filtered.

Presentation Formats

Format Version Single-byte Asian (and Most Multi-byte) Bi-directional?
Applix Presents (AG) 4.0, 4.2, 4.3, 4.4 character set 1252 only N N
Corel Presentations (SHW) 6, 7, 8, 10, 2000, 2002, 11 character set 1252 only N N
Lotus Freelance Graphics (PRE) 2, 96, 97, 98, Millennium Edition R9, 9.8 character set 850 only (V96 and higher) N (V96 and higher) N (V96 and higher)
Lotus Freelance Graphics 2 (PRE) 2 Y Japanese, Simplified Chinese, Traditional Chinese, and Thai only N
Microsoft PowerPoint for Windows (PPT) 95 through 2003 Y Japanese, Simplified Chinese, Traditional Chinese, and Korean only Hebrew only
Microsoft PowerPoint for PC (PPT) 4 character set 1252 only Traditional Chinese only N
Microsoft PowerPoint for Macintosh (PPT) 98 Y N Y
Microsoft Project (MPP) 98, 2000, 2002 (XP) character set 1252 only N N
Microsoft Visio (VSD) 6 Y Y N
Microsoft Visio XML format 2003 (No formatting extracted) Y Y Y
OpenOffice (SXI, SXP) 1, 1.1 (No formatting extracted) Y Y Y
StarOffice (SXI, SXP) 6, 7 (No formatting extracted) Y Y Y

Presentation Format Limitations

Hyperlinks are not supported. Hyperlinks within a document are not preserved.

Display Formats

Format Version Single-byte Asian (and Most Multi-byte) Bi-directional?
Adobe Portable Document Format (PDF) 1.1 (Acrobat 2.0) to 1.5 (Acrobat 6.0) Y Japanese, Simplified and Traditional Chinese, and Korean N

Filtering of PDF Format Documents

Multi-byte PDFs are supported, provided the PDF document is created using Character ID-keyed (CID) fonts, predefined CJK CMap files, or ToUnicode font encodings, and the document does not contain embedded fonts. See the Adobe website and the Adobe Acrobat documentation for more information.

To determine the type of font encodings that are used in a PDF, open the PDF document in Adobe Acrobat, and select File->Document Info->Fonts. If the Encodings column lists Custom or Embedded encodings, then you may encounter problems filtering the PDF document.

PDF Filtering Limitations

The following limitations apply to PDF documents:

  • All PDF security attributes are supported except for user and master passwords.

  • Embedded fonts in a PDF document are not filtered correctly.

  • If an unsupported font is encountered during conversion of a PDF document, the default font, Times New Roman, is substituted. If the original font is wider than the substituted font, extra whitespace will appear in the output HTML.

  • The following color spaces are supported:

    • DeviceRGB

    • DeviceGray

    • DeviceCMYK

    • CalGray

    • CalRGB

    Index color spaces are supported as long as they are used with a supported basic color space.

  • Hyperlinks in PDF documents are not supported.

  • All pre-defined CMaps in PDF 1.3 specification are supported. CMaps added in PDF 1.4 and PDF 1.5 specifications are not supported.

  • Annotations, such as notes, sound, or movie, are not supported.

  • The following features of PDF 1.5 for Acrobat 6.0 are not supported:

    • Tagged PDFs

    • Images compressed in JPEG2000

    • Crypt Filter encryption

    • Hidden content in a PDF document, such as, Optional Content and OCG-State Actions

    • Interactive forms

    • Embedded multimedia presentations

    • Digital signatures and signature fields

    • Interactive presentations, that is, navigation between pages and transition actions.

  • Vector images are not supported. Since background colors are defined in PDF as vector images, background colors are also not supported. Raster images are supported.

Graphic Formats

Table B-1 lists the graphic formats that the AUTO_FILTER filter recognizes. This means that indexing a text column that contains any of these formats produces no error. As such, it is safe for the column to contain any of these formats.

Formats are categorized as either embedded graphics or standalone graphics. Embedded graphics are inserted or referenced within a document.


Note:

This filter cannot extract textual information from graphics.

Table B-1 Supported Graphics Formats for AUTO_FILTER Filter

Graphics Format Version Bidirectional?
AutoCAD Drawing format (DWG) R13, R14, and R2000 (standalone only)  
AutoCAD Drawing format (DXF) R13, R14, and R2000 (standalone only)  
Encapsulated PostScript (EPS) (raster only) TIFF header only  
Enhanced Metafile (EMF) no specific version N
Graphics Interchange Format (GIF) 87, 89  
JPEG File Interchange Format no specific version  
Lotus AMIDraw Graphics (SDW) no specific version  
Lotus Pic (PIC) no specific version  
Macintosh Raster (PICT/PCT) 2  
MacPaint (PNTG) no specific version  
Microsoft Windows Bitmap (BMP) no specific version  
PC Paintbrush (PCX) 3  
Portable Network Graphics (PNG) no specific version  
SGI RGB Image (RGB) no specific version  
Sun Raster Image (RS) no specific version  
Tagged Image File (TIFF) 5 N
Truevision TARGA (TGA) 2  
Windows Animated Cursor (ANI) no specific version  
Windows Metafile (WMF) 3 N
WordPerfect Graphics (WPG) 1 N
WordPerfect Graphics 2 (WPG) 2, 7 N