Heath Kelly
2007-06-13 15:52:00 UTC
Hi guys,
My objective is to extracted embedded documents from within a Word document.
These documents can be of any type but my test case is a Word document that
contains 16 PDF, Word and Excel documents. Im not really familiar with the
OLE file format or the concepts behind compound document strucutures but have
learnt that a storage = directory and a stream = file. Im programming in c#.
My method is to use the StgOpenStorage api call from ole32.dll to get a
IStorage object. The storage object contains another storage called
“ObjectPool” which contains 16 more storages where I asumme my attached
documents reside. I iterate the objects within these storages and find the
following streams:
For Word documents:
Data
1Table
CompObj
0bjInfo
WordDocument
SummaryInformation
DocumentSummaryInformation
For PDFs:
Ole
CompObj
0bjInfo
CONTENTS
For Excel:
Ole
CompObj
0bjInfo
Workbook
SummaryInformation
DocumentSummaryInformation
Here is a sample of the code I am using to save the stream to the file system:
IStream stream = null;
storage.OpenStream(storeStruct[0].pwcsName, IntPtr.Zero,
(uint)(STGM.READWRITE | STGM.SHARE_EXCLUSIVE), 0, out stream);
if (storeStructB[0].pwcsName == "WordDocument")
{
unsafe
{
uint p;
byte[] bits = new byte[1000];
BinaryWriter output = new
BinaryWriter(File.Open(@"D:\Data\OLE\test.doc", FileMode.Create));
do
{
stream.Read(bits, 1000, new IntPtr(&p));
output.Write(bits, 0, (int)p);
} while (p > 0);
output.Close();
}
}
This works perfectly for PDFs (saving the “content” stream). However, when
I write out the “WordDocument” the resultant document is not correctly format
and won’t open in Word. Excel has similar results, however after Excel
repairs the document it will display, but not in its exact original format.
I assume there are some extra or missing bytes in the streams that are
compromising the file format. I’ve tried combining the other streams in the
Word storage but can’t generate a result.
Can anyone offer me any advise?
Regards,
Heath.
My objective is to extracted embedded documents from within a Word document.
These documents can be of any type but my test case is a Word document that
contains 16 PDF, Word and Excel documents. Im not really familiar with the
OLE file format or the concepts behind compound document strucutures but have
learnt that a storage = directory and a stream = file. Im programming in c#.
My method is to use the StgOpenStorage api call from ole32.dll to get a
IStorage object. The storage object contains another storage called
“ObjectPool” which contains 16 more storages where I asumme my attached
documents reside. I iterate the objects within these storages and find the
following streams:
For Word documents:
Data
1Table
CompObj
0bjInfo
WordDocument
SummaryInformation
DocumentSummaryInformation
For PDFs:
Ole
CompObj
0bjInfo
CONTENTS
For Excel:
Ole
CompObj
0bjInfo
Workbook
SummaryInformation
DocumentSummaryInformation
Here is a sample of the code I am using to save the stream to the file system:
IStream stream = null;
storage.OpenStream(storeStruct[0].pwcsName, IntPtr.Zero,
(uint)(STGM.READWRITE | STGM.SHARE_EXCLUSIVE), 0, out stream);
if (storeStructB[0].pwcsName == "WordDocument")
{
unsafe
{
uint p;
byte[] bits = new byte[1000];
BinaryWriter output = new
BinaryWriter(File.Open(@"D:\Data\OLE\test.doc", FileMode.Create));
do
{
stream.Read(bits, 1000, new IntPtr(&p));
output.Write(bits, 0, (int)p);
} while (p > 0);
output.Close();
}
}
This works perfectly for PDFs (saving the “content” stream). However, when
I write out the “WordDocument” the resultant document is not correctly format
and won’t open in Word. Excel has similar results, however after Excel
repairs the document it will display, but not in its exact original format.
I assume there are some extra or missing bytes in the streams that are
compromising the file format. I’ve tried combining the other streams in the
Word storage but can’t generate a result.
Can anyone offer me any advise?
Regards,
Heath.