Discussion:
Extracting Office documents from compound files
(too old to reply)
Heath Kelly
2007-06-13 15:52:00 UTC
Permalink
Hi guys,

My objective is to extracted embedded documents from within a Word document.
These documents can be of any type but my test case is a Word document that
contains 16 PDF, Word and Excel documents. Im not really familiar with the
OLE file format or the concepts behind compound document strucutures but have
learnt that a storage = directory and a stream = file. Im programming in c#.

My method is to use the StgOpenStorage api call from ole32.dll to get a
IStorage object. The storage object contains another storage called
“ObjectPool” which contains 16 more storages where I asumme my attached
documents reside. I iterate the objects within these storages and find the
following streams:

For Word documents:
Data
1Table
CompObj
0bjInfo
WordDocument
SummaryInformation
DocumentSummaryInformation

For PDFs:
Ole
CompObj
0bjInfo
CONTENTS

For Excel:
Ole
CompObj
0bjInfo
Workbook
SummaryInformation
DocumentSummaryInformation

Here is a sample of the code I am using to save the stream to the file system:

IStream stream = null;
storage.OpenStream(storeStruct[0].pwcsName, IntPtr.Zero,
(uint)(STGM.READWRITE | STGM.SHARE_EXCLUSIVE), 0, out stream);
if (storeStructB[0].pwcsName == "WordDocument")
{
unsafe
{
uint p;
byte[] bits = new byte[1000];
BinaryWriter output = new
BinaryWriter(File.Open(@"D:\Data\OLE\test.doc", FileMode.Create));
do
{
stream.Read(bits, 1000, new IntPtr(&p));
output.Write(bits, 0, (int)p);
} while (p > 0);
output.Close();
}
}

This works perfectly for PDFs (saving the “content” stream). However, when
I write out the “WordDocument” the resultant document is not correctly format
and won’t open in Word. Excel has similar results, however after Excel
repairs the document it will display, but not in its exact original format.
I assume there are some extra or missing bytes in the streams that are
compromising the file format. I’ve tried combining the other streams in the
Word storage but can’t generate a result.

Can anyone offer me any advise?

Regards,
Heath.
Heath Kelly
2007-06-13 16:56:02 UTC
Permalink
This makes no sense to me. storeStruct[0].type returns 2, indicating that
the object "WordDocument" is indeed a stream. Regardless, I will give your
suggestion a go and report on the outcome.
A Word document is itself a storage, not a stream. Same for an
Excel document. You can try to save the entire substorage as
CopyTo. Not sure if that will work though...
--
=====================================
Alexander Nickolov
Microsoft MVP [VC], MCSD
MVP VC FAQ: http://vcfaq.mvps.org
=====================================
Post by Heath Kelly
Hi guys,
My objective is to extracted embedded documents from within a Word document.
These documents can be of any type but my test case is a Word document that
contains 16 PDF, Word and Excel documents. Im not really familiar with the
OLE file format or the concepts behind compound document strucutures but have
learnt that a storage = directory and a stream = file. Im programming in c#.
My method is to use the StgOpenStorage api call from ole32.dll to get a
IStorage object. The storage object contains another storage called
"ObjectPool" which contains 16 more storages where I asumme my attached
documents reside. I iterate the objects within these storages and find the
Data
1Table
CompObj
0bjInfo
WordDocument
SummaryInformation
DocumentSummaryInformation
Ole
CompObj
0bjInfo
CONTENTS
Ole
CompObj
0bjInfo
Workbook
SummaryInformation
DocumentSummaryInformation
IStream stream = null;
storage.OpenStream(storeStruct[0].pwcsName, IntPtr.Zero,
(uint)(STGM.READWRITE | STGM.SHARE_EXCLUSIVE), 0, out stream);
if (storeStructB[0].pwcsName == "WordDocument")
{
unsafe
{
uint p;
byte[] bits = new byte[1000];
BinaryWriter output = new
do
{
stream.Read(bits, 1000, new IntPtr(&p));
output.Write(bits, 0, (int)p);
} while (p > 0);
output.Close();
}
}
This works perfectly for PDFs (saving the "content" stream). However,
when
I write out the "WordDocument" the resultant document is not correctly
format
and won't open in Word. Excel has similar results, however after Excel
repairs the document it will display, but not in its exact original format.
I assume there are some extra or missing bytes in the streams that are
compromising the file format. I've tried combining the other streams in
the
Word storage but can't generate a result.
Can anyone offer me any advise?
Regards,
Heath.
Alexander Nickolov
2007-06-13 16:44:40 UTC
Permalink
A Word document is itself a storage, not a stream. Same for an
Excel document. You can try to save the entire substorage as
a root storage in another compound document. See IStorage::
CopyTo. Not sure if that will work though...
--
=====================================
Alexander Nickolov
Microsoft MVP [VC], MCSD
email: ***@mvps.org
MVP VC FAQ: http://vcfaq.mvps.org
=====================================
Post by Heath Kelly
Hi guys,
My objective is to extracted embedded documents from within a Word document.
These documents can be of any type but my test case is a Word document that
contains 16 PDF, Word and Excel documents. Im not really familiar with the
OLE file format or the concepts behind compound document strucutures but have
learnt that a storage = directory and a stream = file. Im programming in c#.
My method is to use the StgOpenStorage api call from ole32.dll to get a
IStorage object. The storage object contains another storage called
"ObjectPool" which contains 16 more storages where I asumme my attached
documents reside. I iterate the objects within these storages and find the
Data
1Table
CompObj
0bjInfo
WordDocument
SummaryInformation
DocumentSummaryInformation
Ole
CompObj
0bjInfo
CONTENTS
Ole
CompObj
0bjInfo
Workbook
SummaryInformation
DocumentSummaryInformation
IStream stream = null;
storage.OpenStream(storeStruct[0].pwcsName, IntPtr.Zero,
(uint)(STGM.READWRITE | STGM.SHARE_EXCLUSIVE), 0, out stream);
if (storeStructB[0].pwcsName == "WordDocument")
{
unsafe
{
uint p;
byte[] bits = new byte[1000];
BinaryWriter output = new
do
{
stream.Read(bits, 1000, new IntPtr(&p));
output.Write(bits, 0, (int)p);
} while (p > 0);
output.Close();
}
}
This works perfectly for PDFs (saving the "content" stream). However,
when
I write out the "WordDocument" the resultant document is not correctly
format
and won't open in Word. Excel has similar results, however after Excel
repairs the document it will display, but not in its exact original format.
I assume there are some extra or missing bytes in the streams that are
compromising the file format. I've tried combining the other streams in
the
Word storage but can't generate a result.
Can anyone offer me any advise?
Regards,
Heath.
Michael Phillips, Jr.
2007-06-13 19:26:16 UTC
Permalink
Have you tried using the IOleDataObject interface to extract the documents?

The IOleDataObject understands how to render any format that has a
corresponding registered clipboard format.

If you are able to use OleCreateFromData or OleCreateFromFile, then you may
request an IOleDataObject which can be used to render those formats.
Post by Heath Kelly
Hi guys,
My objective is to extracted embedded documents from within a Word document.
These documents can be of any type but my test case is a Word document that
contains 16 PDF, Word and Excel documents. Im not really familiar with the
OLE file format or the concepts behind compound document strucutures but have
learnt that a storage = directory and a stream = file. Im programming in c#.
My method is to use the StgOpenStorage api call from ole32.dll to get a
IStorage object. The storage object contains another storage called
"ObjectPool" which contains 16 more storages where I asumme my attached
documents reside. I iterate the objects within these storages and find the
Data
1Table
CompObj
0bjInfo
WordDocument
SummaryInformation
DocumentSummaryInformation
Ole
CompObj
0bjInfo
CONTENTS
Ole
CompObj
0bjInfo
Workbook
SummaryInformation
DocumentSummaryInformation
IStream stream = null;
storage.OpenStream(storeStruct[0].pwcsName, IntPtr.Zero,
(uint)(STGM.READWRITE | STGM.SHARE_EXCLUSIVE), 0, out stream);
if (storeStructB[0].pwcsName == "WordDocument")
{
unsafe
{
uint p;
byte[] bits = new byte[1000];
BinaryWriter output = new
do
{
stream.Read(bits, 1000, new IntPtr(&p));
output.Write(bits, 0, (int)p);
} while (p > 0);
output.Close();
}
}
This works perfectly for PDFs (saving the "content" stream). However,
when
I write out the "WordDocument" the resultant document is not correctly
format
and won't open in Word. Excel has similar results, however after Excel
repairs the document it will display, but not in its exact original format.
I assume there are some extra or missing bytes in the streams that are
compromising the file format. I've tried combining the other streams in
the
Word storage but can't generate a result.
Can anyone offer me any advise?
Regards,
Heath.
ZeljkoS
2007-06-14 10:55:00 UTC
Permalink
Hi Heath,

We had the same problem and we solved it (at least for embedded XLS files).
First you need to export embedded XLS storage as a separate OLE2 Compound
Document file (including all XLS streams you mentioned) with .xls extension.
You will get a valid XLS file, but entire contents of the file will be
hidden.
One way to fix that is to open file in MS Excel and use View > Unhide.
As we (GemBox Software) develop native .NET components for OLE2 Compound
Document and XLS handling we used another approach. We use
GemBox.CompoundFile to extract XLS storage and then we use GemBox.Spreadsheet
to rewrite file and erase hidden option. That way our code avoids automation
and COM Interop:

// Extract XLS substorage as a separate file using GemBox.CompoundFile.
Ole2CompoundFile sourceCf = new Ole2CompoundFile();
sourceCf.Load("../../TestEmbedded.doc", true);
Ole2Storage sourceStorage = (Ole2Storage)sourceCf.Root["ObjectPool"];
sourceStorage = (Ole2Storage)sourceStorage["_1243322160"];
Ole2CompoundFile destinationCf = new Ole2CompoundFile();
destinationCf.Root.ImportTree(sourceStorage, true);
destinationCf.Save("ExtractedPlain.xls");
// At this point "ExtractedPlain.xls" is a valid XLS file, but entire
contents is hidden.
// One way to fix that is to open file in MS Excel and use View > Unhide.
// Another approach is to use GemBox.Spreadsheet which rewrites file and
erases hidden option.
ExcelFile ef = new ExcelFile();
ef.LoadXls("ExtractedPlain.xls");
ef.SaveXls("ExtractedModified.xls");

Regards,
Zeljko Svedic
GemBox Software
http://www.gemboxsoftware.com
http://www.gemboxsoftware.com/GBSpreadsheet.htm
http://www.gemboxsoftware.com/CompoundFile.htm
Post by Heath Kelly
Hi guys,
My objective is to extracted embedded documents from within a Word document.
These documents can be of any type but my test case is a Word document that
contains 16 PDF, Word and Excel documents. Im not really familiar with the
OLE file format or the concepts behind compound document strucutures but have
learnt that a storage = directory and a stream = file. Im programming in c#.
My method is to use the StgOpenStorage api call from ole32.dll to get a
IStorage object. The storage object contains another storage called
“ObjectPool” which contains 16 more storages where I asumme my attached
documents reside. I iterate the objects within these storages and find the
Data
1Table
CompObj
0bjInfo
WordDocument
SummaryInformation
DocumentSummaryInformation
Ole
CompObj
0bjInfo
CONTENTS
Ole
CompObj
0bjInfo
Workbook
SummaryInformation
DocumentSummaryInformation
IStream stream = null;
storage.OpenStream(storeStruct[0].pwcsName, IntPtr.Zero,
(uint)(STGM.READWRITE | STGM.SHARE_EXCLUSIVE), 0, out stream);
if (storeStructB[0].pwcsName == "WordDocument")
{
unsafe
{
uint p;
byte[] bits = new byte[1000];
BinaryWriter output = new
do
{
stream.Read(bits, 1000, new IntPtr(&p));
output.Write(bits, 0, (int)p);
} while (p > 0);
output.Close();
}
}
This works perfectly for PDFs (saving the “content” stream). However, when
I write out the “WordDocument” the resultant document is not correctly format
and won’t open in Word. Excel has similar results, however after Excel
repairs the document it will display, but not in its exact original format.
I assume there are some extra or missing bytes in the streams that are
compromising the file format. I’ve tried combining the other streams in the
Word storage but can’t generate a result.
Can anyone offer me any advise?
Regards,
Heath.
Loading...