Project Description
This project is a proof of concept and reference project for people interested in writing a SharePoint document parser. Originally, I was working on a utility that could intercept and interpret properties of Office 2007 formatted documents, updating them in line with the Microsoft provided parser. I found very little in the way of SDK examples, documentation or other work related to this topic, so I decided to publish this on CodePlex for anyone else who may need the reference, and of course for any embellishments others may wish to provide.


As it turns out; Office 2007 documents are actually zip files, and their properties are stored in xml documents inside the zip file. This project implements a generic “zip” parser in that it can poke around inside a zip file and look for xml documents. If it finds any it pulls properties from them and hands them back to SharePoint. It also has the ability to “chain” another parser (e.g. the built in office parser) so that if your parser does not handle the job, you can decide to pass it off to another parser. All of this is configured in a tidy little configuration file (sorry no configuration UI).

First some credits and explanations: I did not write the Zip routines, I found them on Code Project (credits are in the header). I modified them to support the ILockBytes interface SharePoint favors and fixed a few small bugs in it. I also wish to give credit here to Jonas Nilsson for the posting at http://isharepoint.blogspot.com/2007/06/custom-document-parsers-part-2.html, which was the only technically accurate post on topic that I found. While I originally wanted to build this as a C# assembly, try as I might, I could not get it to be fast and stable in C# I got it to work, but it seemed to continue to consume memory as if the .NET GC was never being called (if if do get back to it I will post the addition). . So I did it as a C++ ATL component (not managed code), for those of you who are C++ gurus, I apologize in advance, I use lots of wrapper classes, just to make sure strings and COM pointers don’t leak, so it’s not always pretty but at least everything cleans up after itself. If you poke around a bit you will also see lots of Unicode/MultiByte transitions. On MSDN, the interfaces for ISPDocumentParser are documented as MultiByte only (actually CHAR* which could point to anything) I don’t know if this is true but from the debugger it appears to be true. Since the XML DOM needs BSTRs (which are Unicode) there is a bit of back and forth. While the classes are marked and generally tested as free threaded objects, I have not done a lot of load testing other than sending in large zip files to make sure it does not crash. I have tested up to the default SharePoint limit of 50MB without a problem.

The code base includes two projects; there is of course the actual parser which is the Version3.SharePoint.Zip parser project. It is a Visual Studio 2008 C++ project, it should compile (with appropriate configuration) all the way back to VC6 as I don’t think there is anything odd in it. The interfaces for all the SharePoint stuff are in the IDL file for that project. Of course it’s a COM object so it has to be registered to work. There is also small C# console application that can install and uninstall the parser. Sitting off by itself is a file called FakePropertyBag.h which is a test harness for simulating SharePoint’s property bag.

How it works: when the parser is first created by SharePoint it locates and reads its configuration file (It expects to find it in the same directory it is registered in, a sample is included). The configuration file includes definitions of “Parsers” instances which are xml file names, property names and XPath queries. During a parse, the parser iterates over the parser definitions. If it finds the xml file named in the parser definition, it unzips that file into an XML DOM and executes all of the defined XPath queries. Any XPath query that returns a result is promoted to a SharePoint property. If it does not find anything, and if the configuration includes information for another parser capable of handling zip files, it hands the file off to that parser.

I have not implemented property demotion, except to hand that function off to the other parser and (in the name of shameless self promotion) the only thumbnail it will generate is my company’s logo. But you get the pictureJ.

Installation is easy, register the compiled dll with regsvr32 (I have not tested it as 64 bit), add the configuration file to the same directory as the DLL, then run the C# installer console. It defaults to handling zip file extensions but you can of course modify all of that as the parser really does not care or even know what the file name is.

If you are debugging it and are having trouble connecting to the object, find the ZipParser.h file and look in the method called FinalConstruct. Uncomment the message box routine and recompile. When the object is first instantiated by SharePoint you will get a message box on the server’s console about what process to attach too. Don’t forget its native code so your debugger must recognize it for you to step through it.



I plan on updating the demotion and thumb nail routines over the next few weeks, I will post updates as they become available, feel free to chip in with ideas and help if you like.



Last edited Jun 4, 2008 at 9:52 PM by robginsburg, version 3