Learning Python Part II |
Received the construction drawings today |
Learning Python Part III
It is July 30, 2019, and I am starting on Part 3, Learning the Python programming language. This part is lengthy, but if you want to follow me creating a parser of a Microsoft Word document converting it to HTML code with the help of Python using Visual Studio Code and the Python-DocX library, then this is the blog post you want to read.
Yesterday when I uploaded my previous part of this series, I started by running my newly created program. That went very well. Up until now, I had been copying and pasting images out of the word document into MS Paint, save the image, and then upload each image. The new program saved a lot of time.
I have my own blog engine, and it has a web page for editing the blog. I had to upload the images one by one and replace the filenames in the text with a real image.
In theory, this was a good plan, but when I had done 24 images, the browser crashed somehow. I had not saved any of my work. This was very frustrating.
This leads me to wonder if it could not be possible to improve this workflow. I would like to prepare the images for upload, and I want to prepare the blog text for publishing.
The images are stored in two versions on the webserver: One regular size image and one scaled-down version of the image. This makes no sense for low-resolution images, but this is how it is right now. Each image is displaying the scaled-down version, and besides that, there is a hyperlink to the high-resolution image. This markup needs to be generated, alongside bold, italic, bullets, and all sorts of the regular layout.
It will be necessary to scale images, and I will parse the layout of the text. This is the assignment for this exercise of learning Python.
I found this extensive text on various image manipulation libraries for Python:
https://opensource.com/article/19/3/python-image-manipulation-tools
It looks like python-opencv is a powerful library to use. I have no idea, so let us try that.
pip install opencv-python Successfully installed numpy-1.17.0 opencv-python-4.1.0.25
This is the version 4.1.1 of the documentation of OpenCV: https://docs.opencv.org/4.1.1/
While reading this, I wondered if this was perhaps a very capable library that would be very difficult to understand. Since I could not find what I was looking for, I tried to search for and an example of how to resize an image with OpenCV.
https://medium.com/@manivannan_data/resize-image-using-opencv-python-d2cdbbc480f0
That was great information. I could easily take that snippet and integrate it into my own export program:
August 2, 2019
It is a couple of programming sessions later. It is great fun. The Visual Studio Code editor is actually really pleasant to work with. I have seen that people are working with a black background, but somehow, I had not tested that yet. It is nice.
Visual Studio Code works very well. It has renaming of symbols. I use that a lot to improve the names of variables as the program progresses. The debugger is working well. The Python language source is pleasant to look at.
Rather than talking about individual steps what I learned, I will talk about the general flow of how my assignment is progressing. At the time of writing, now, I am not finished with the program.
I found that OpenCV is a very advanced library, and I will probably look into it again, in the future. What I wanted to do with resizing an image was easy. I create a display image that is always the width of 400 pixels. That is the image you see first. This image has a link to the original image whatever size that might be. The display image has a tilde 1 at the end of the filename.
After doing the resizing, I started on the bigger task, parsing the word document, and produce HTML. The runs have font settings, bold, italic, etc., I started with that challenge. Before looping over the text, I store the first run. Then I created a function CompareRunsProduceMarkup that given two consecutive runs compare them and produce the markup between the runs. So if the second run has bold, then the markup for bold is produced. If the bold ends the markup for ending bold is produced. Easy.
The paragraphs have bullet settings. That is way more complex. My biggest issue with this was that I could not get to the numbering.xml table via the Python-Docx library.
3 August 2019
Today I have time for writing more about the Docx parser program. It is challenging to write about the creation of the program while being clueless about if things are going as they should. Things were not going as they were supposed. Getting to the numbering.xml table was impossible. For me. Since I had already loaded the ZipFile of the word document, I resorted to loading the numbering.xml myself.
When I had the XML data loaded it was straightforward to parse the data. Huh? Straightforward! No, there were quirks here and there.
The XML of a word document comes with namespaces. Lot’s of namespaces. To be able to select a node from the XML you get to deal with the namespaces. Python XML has the function find and findall. They take an argument for the namespace. When that is added, it is possible to select nodes with XPATH.
XPath? No, not at all, a little subset of XPath. You cannot select attributes with this subset. For that, you need to use the attrib list of an element, and here there is no workaround for avoiding elaborate namespaces. So this is what I did.
Numbering is a little convoluted. There is a number starting at 1 for each number style that then refers to an abstract numbering definition starting at 0, and the abstract holds the format. I take this knowledge and put them in three dictionaries so that when parsing the word document I can look up what a number is, and then with that the abstract number, and from there the format, and from there the HTML code for the numbering.
Word has various styles for each level of the numbering, but I don’t emulate that. I just use the format of level zero of the abstract number definition.
Then followed the parsing of the numbering. I used two variables to hold the current state of the numbering.
-
strInList, this is the name of the format of the List Paragraph, as it is found of level 0 of the abstract number list, as I mentioned before. If we are not currently in a list, it is an empty string. This is initialized with an empty string.
-
iInListLevel, this is the level counter as given in the List Paragraph. This is initialized with minus one.
When we detect that we are in a List Paragraph, then we go from no list to a list. What list? That is given by the list number. The list number is resolved through the dictionaries prepared on beforehand. Also, if we go from no list to level 0, that means we the list level difference is 1 compared with our state. So we loop one time over creating a new list.
When we go from level 0 to level one, we create another list.
When we go from a higher level to a lower level, we end lists the number of times the difference is with the list level state.
When we go from being in a list to being outside a list, then we loop again over the list ending code.
When we are done with a paragraph being in a list item, we always end that item.
It was a little Python special with the loops producing the level transitions. I initialize the iInListLevel state to minus one. This means that when starting the first list, there is a transition from minus one to zero. That is done with the range command. Then going from zero to one and so on is the same. Looping, in a positive direction, works fine. Looping from 0 to minus one that did not work, until I found out that the range function has an implicit argument step that is set to one. Going in a negative direction you need explicitly to specify minus one.
Here is the unfinished program so far:
import docx import os import xml import zipfile import sys import cv2 import zlib import xml.etree.ElementTree as ET if len(sys.argv) != 2: print("Please only call me with one parameter") sys.exit() strWordFile = sys.argv[1] doc_DocX = docx.Document(strWordFile) doc_ZipFile = zipfile.ZipFile(strWordFile, 'r') fp = doc_ZipFile.open("word/numbering.xml") numbering_Xml = ET.parse(fp) ns = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'} num_Lst = numbering_Xml.findall("/w:num", ns) dictNumIDToAbstractNum = {} dictNumAbstractIDToStrFormat = {} dictListFormatToHtml = { 'bullet': 'ul', 'decimal': 'ol'} for num_Element in num_Lst: iNumId = int(num_Element.attrib['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}numId'], 10) AbstractNum_Element = num_Element.find("w:abstractNumId", ns) iAbstractNumId = int(AbstractNum_Element.attrib['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val'], 10) dictNumIDToAbstractNum[iNumId] = iAbstractNumId numAbstract_Lst = numbering_Xml.findall("/w:abstractNum", ns) for numAbstract_Element in numAbstract_Lst: iAbstractNumId = int(numAbstract_Element.attrib['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}abstractNumId'], 10) AbstractNumLevel0Format_Element = numAbstract_Element.find("w:lvl[@w:ilvl='0']/w:numFmt", ns) strFormat = AbstractNumLevel0Format_Element.attrib['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val'] dictNumAbstractIDToStrFormat[iAbstractNumId] = strFormat strBaseDir = os.path.dirname(strWordFile) os.chdir(strBaseDir) strBaseName = os.path.basename(os.path.splitext(strWordFile)[0]) def CompareRunsProduceMarkup(run1,run2): """Compare two runs, produce markup""" strMarkup = "" if(run1.bold != run2.bold): if(run1.bold): strMarkup += "≺/em≻" else: strMarkup += "≺em≻" if(run1.italic != run2.italic): if(run1.italic): strMarkup += "≺/i≻" else: strMarkup += "≺i≻" return strMarkup prev_Run = doc_DocX.paragraphs[0].runs[0] strInList = '' # '' = No List, 'bullet' = Bullet, 'decimal' = Decimal iInListLevel = -1 for paragraph_Element in doc_DocX.paragraphs: if (paragraph_Element.style.name == "List Paragraph"): iNumId = paragraph_Element._element.pPr.numPr.numId.val iCurrentListLevel = paragraph_Element._element.pPr.numPr.ilvl.val iAbstractId = dictNumIDToAbstractNum[iNumId] strListFormat = dictNumAbstractIDToStrFormat[iAbstractId] if (strInList != ''): if (iCurrentListLevel ≻ iInListLevel): #Higher list level for i in range(iInListLevel, iCurrentListLevel): print("≺" + dictListFormatToHtml[strInList] + "≻") elif (iCurrentListLevel ≺ iInListLevel): #Lower list level for i in range(iInListLevel, iCurrentListLevel, - 1): print("≺/" + dictListFormatToHtml[strInList] + "≻") print("≺li≻") else: strInList = strListFormat # from -1 to 0 for i in range(iInListLevel, iCurrentListLevel): print("≺" + dictListFormatToHtml[strInList] + "≻") print("≺li≻") iInListLevel = iCurrentListLevel else: if (strInList != ''): for i in range(iInListLevel, -1, -1): print("≺/" + dictListFormatToHtml[strInList] + "≻") strInList = '' iInListLevel = -1 print("≺p≻") for current_Run in paragraph_Element.runs: if len(current_Run.element.drawing_lst) ≻ 0: for drawing_Element in current_Run.element.drawing_lst: inline_NodeList = list(drawing_Element) print(inline_NodeList[0].graphic.graphicData.pic.blipFill.blip.embed) strImageId = inline_NodeList[0].graphic.graphicData.pic.blipFill.blip.embed image_Rel = doc_DocX.part.rels[strImageId] strImageTgt = image_Rel._baseURI[1:] + "/" + image_Rel.target_ref try: ZipInfo = doc_ZipFile.NameToInfo[strImageTgt] strFileName = strBaseName + "_" + os.path.basename(ZipInfo.filename) strFileNameResized = os.path.splitext(strFileName)[0] + "~1" + os.path.splitext(strFileName)[1] ZipInfo.filename = strFileName doc_ZipFile.extract(ZipInfo, strBaseName) iW = 400. imgOrig = cv2.imread(strBaseDir + "" + strBaseName + "" + strFileName) iHeight, iWidth, iDepth = imgOrig.shape fImgScale = iW/iWidth iNewX,iNewY = imgOrig.shape[1]*fImgScale, imgOrig.shape[0]*fImgScale imgResized = cv2.resize(imgOrig,(int(iNewX),int(iNewY)),cv2.INTER_AREA) cv2.imwrite(strBaseDir + "" + strBaseName + "" + strFileNameResized, imgResized) paragraph_Element.add_run(ZipInfo.filename) except KeyError: print("ERROR: Did not find {} in zip file".format(strImageTgt)) else: print("Extracted {}".format(strImageTgt)) strMarkup = CompareRunsProduceMarkup(prev_Run, current_Run) if (strMarkup != ""): print(strMarkup) prev_Run = current_Run print(current_Run.text) print("≺/p≻") if (paragraph_Element.style.name == "List Paragraph"): print("≺/li≻") doc_DocX.save(strWordFile) del doc_DocX
With the new scheme, it will not be necessary to modify the word document and save it because I am generating the entire HTML source code, but that part is inherited from the previous program.
4 August 2019
I have four wishes from my parser before I am pleased with it:
-
I would like to be able to support floating images as well so that I can specify that images should float left or float right etc.
-
It would be nice to support tables.
-
It would be useful that if I specify a specific style named Code then this style would be recognized and that code would be marked up as pre.
-
When a hyperlink is found in the word document, it is marked up as a hyperlink.
Oh my dear, will I blog about this for half a year or is this done in a couple of weeks?
So now I start with point 4. First of all, I need a test case for this. I add an image and change it to float and align left and another to align right, and then I need to analyze how this is done in word, and then I can parse this and produce the markup.
The word markup is actually really fun for me. It has three meanings:
-
the amount added to the cost price of goods to cover overheads and profit.
-
the process or result of correcting text in preparation for printing.
-
a set of tags assigned to elements of a text to indicate their relation to the rest of the text or dictate how they should be displayed.
I was genuinely thinking that meaning three was the only meaning, until seven years ago. There were words for markup that I used before I figured out all the meanings of markup, but those were in Swedish. Now it is time for floating images.
There are a couple of interesting things to note on how word handles floating images. Do you see the anchor symbol? That is the anchor point for the image in the text. It should be there in the text flow of the XML, we will come to that in a moment.
Next, there are numerous ways to specify that the image is floating on the left side. Namely, that the text will float on the right side of the image. Duh!
Right-click the image and select size and position. Go to the tab Text Wrapping. In there, we find several wrapping styles. The style that the previous program implemented is “In line with [the] text.” Haha, Word is missing the article. To that, I will add “Square.” All the other wrappings styles will be ignored by me.
The “Wrap text” selection is indicating how the text will be wrapped around the image. So if we want the image to float left, we need to wrap the text on the right side and vice versa.
Is that all? No. There is a position tab, as well. This combination of telling the text to wrap on a specific side, as we just did, in combination with that, we can specify the position independently is total high-level rocket science for most people. So please have a look at the position tab so that we can talk about rocket science.
In the position tab, we can set the horizontal alignment to the Right or the Left relative to the Column. That is, regardless of the setting of how the text is flowing around the image.
There are two ways to go here.
-
Either we hook in on how the text will flow and make the image float in the opposite direction. This gives the possibility that things are messed up, namely that the image is set to the left side, but the text is set to flow only on the left side, and there is no room left for the text to flow.
-
The other way to go is to ignore the wrap text altogether. We leave it on “both sides,” and if there is room over on a side, then that is fine. I decided I will do this option.
I will ignore all other settings.
When trying to figure out how this works, I found that if an actual image location is placed at say left side relative to the column, but the setting is not yet set to left side relative to column, then it will not store the chosen value because the result of making the change had no effect in moving the image, it looked the same. It is as if Word is reasoning, “so why store the change of the settings if it did not move anything”? When changing the position of the image by dragging the image slightly and the going back to the setting dialog again and doing it again then indeed the setting will be stored as well. Is it a bug in Microsoft Word?
So now I found a spelling mistake and a bug in word, this is going well. The anchor point is the location at the paragraph of which the floating image is embedded.
Here is an example of a drawing with wp:position is set to being relative to column left with wrap square on both sides.
And here is the same for the right side:
Here is an example of the situation with an inline image that we already worked on in the previous program:
The route we need to go to get to the rIds are different inline versus floating, but the rIds is there so that with rIds is the same mechanism.
The drawing starting with inline is what we implemented in the previous program. Now we need to detect if drawing is holding a wp:anchor child.
5 August 2019
I started searching for the child's node name.
Currently, my program looks like this:
Here the variable drawing_Element holds the drawing element. Then I assign the child list to a variable named inline_Nodelist, and that is not a correct name. I cannot know that the child is an inline element before I tested that.
I do need to look into the child to test the tag.
When I established that indeed there is an anchor element, then I assign it to a variable anchor_Element.
Inside the anchor element, I would like to get to the graphic element. For that, I would like to use the Xpath method called find. Now, as you recall this only works if I added the namespace to the namespace dictionary.
I know that the graphic element of the anchor is the eights child, but can I trust that it is always going to be the eights child?
That is why it is better to use the find method. For the rest, the program is missing all sorts of checks and balances but that something I am fine with.
Then I surprised myself when I came up with this source code:
When looking carefully at the code, I realized that I don’t need to know if the child is an anchor or an inline to get to the graphic element. After refactoring such as that the graphic element is found regardless of if the first child of the drawing element was an anchor or an inline, and also extracting the alignment in case the image was floating.
If the image was inline, then the strAlign will become an empty string.
Now we need to form the markup for the images.
if (strAlign != ''): strFloat = " float:" + strAlign + ";" strImageMarkup = "≺a href="/media/{0}" target="_blank"≻ ≺img height="{1}" src="/media/{2}" style = "margin: 5px;{3}" width="400" /≻≺/a≻".format(strFileNameOriginalSize, round(iNewY), strFileNameResized, strFloat) print(strImageMarkup)
That was easy!
Next on my TODO list is the support for tables. I start on that tomorrow.
6 August 2019
It is a new day, and today, I will start implementing support for simple tables. I made a test case:
I will not support merged cells.
The XML for this table looks like this:
≺w:p w14:paraId="3E69C6E3" w14:textId="11235E63" w:rsidR="002F5D70" w:rsidRDefault="002F5D70"≻ ≺w:r≻ ≺w:t≻Here is a table:≺/w:t≻ ≺/w:r≻ ≺/w:p≻ ≺w:tbl≻ ≺w:tblPr≻ ≺w:tblStyle w:val="TableGrid"/≻ ≺w:tblW w:w="0" w:type="auto"/≻ ≺w:tblLook w:val="04A0" w:firstRow="1" w:lastRow="0" w:firstColumn="1" w:lastColumn="0" w:noHBand="0" w:noVBand="1"/≻ ≺/w:tblPr≻ ≺w:tblGrid≻ ≺w:gridCol w:w="1413"/≻ ≺w:gridCol w:w="1276"/≻ ≺w:gridCol w:w="1275"/≻ ≺w:gridCol w:w="1276"/≻ ≺/w:tblGrid≻ ≺w:tr w:rsidR="002F5D70" w14:paraId="477F86D3" w14:textId="77777777" w:rsidTr="002F5D70"≻ ≺w:tc≻ ≺w:tcPr≻ ≺w:tcW w:w="1413" w:type="dxa"/≻ ≺/w:tcPr≻ ≺w:p w14:paraId="1D04BCAA" w14:textId="14A6278D" w:rsidR="002F5D70" w:rsidRDefault="002F5D70"≻ ≺w:r≻ ≺w:t≻Col1 Row1≺/w:t≻ ≺/w:r≻ ≺/w:p≻ ≺/w:tc≻ ≺w:tc≻ ≺w:tcPr≻ ≺w:tcW w:w="1276" w:type="dxa"/≻ ≺/w:tcPr≻ ≺w:p w14:paraId="59C695B4" w14:textId="1C2308C9" w:rsidR="002F5D70" w:rsidRDefault="002F5D70"≻ ≺w:r≻ ≺w:t≻Col2 Row1≺/w:t≻ ≺/w:r≻ ≺/w:p≻ ≺/w:tc≻ ≺w:tc≻ ≺w:tcPr≻ ≺w:tcW w:w="1275" w:type="dxa"/≻ ≺/w:tcPr≻ ≺w:p w14:paraId="326A1353" w14:textId="13C52FB2" w:rsidR="002F5D70" w:rsidRDefault="002F5D70"≻ ≺w:r≻ ≺w:t≻Col3 Row1≺/w:t≻ ≺/w:r≻ ≺/w:p≻ ≺/w:tc≻ ≺w:tc≻ ≺w:tcPr≻ ≺w:tcW w:w="1276" w:type="dxa"/≻ ≺/w:tcPr≻ ≺w:p w14:paraId="2A743770" w14:textId="0829171F" w:rsidR="002F5D70" w:rsidRDefault="002F5D70"≻ ≺w:r≻ ≺w:t≻Col4 Row1≺/w:t≻ ≺/w:r≻ ≺/w:p≻ ≺/w:tc≻ ≺/w:tr≻ ≺w:tr w:rsidR="002F5D70" w14:paraId="715B36A1" w14:textId="77777777" w:rsidTr="002F5D70"≻ ≺w:tc≻ ≺w:tcPr≻ ≺w:tcW w:w="1413" w:type="dxa"/≻ ≺/w:tcPr≻ ≺w:p w14:paraId="02641CE8" w14:textId="65D67470" w:rsidR="002F5D70" w:rsidRDefault="002F5D70"≻ ≺w:r≻ ≺w:t≻Col1 Row2≺/w:t≻ ≺/w:r≻ ≺/w:p≻ ≺/w:tc≻ ≺w:tc≻ ≺w:tcPr≻ ≺w:tcW w:w="1276" w:type="dxa"/≻ ≺/w:tcPr≻ ≺w:p w14:paraId="55B591FF" w14:textId="30B971EC" w:rsidR="002F5D70" w:rsidRDefault="002F5D70"≻ ≺w:r≻ ≺w:t≻Col2≺/w:t≻ ≺/w:r≻ ≺w:r≻ ≺w:t xml:space="preserve"≻ Row2≺/w:t≻ ≺/w:r≻ ≺/w:p≻ ≺/w:tc≻ ≺w:tc≻ ≺w:tcPr≻ ≺w:tcW w:w="1275" w:type="dxa"/≻ ≺/w:tcPr≻ ≺w:p w14:paraId="31AED9CD" w14:textId="55543895" w:rsidR="002F5D70" w:rsidRDefault="002F5D70"≻ ≺w:r≻ ≺w:t≻Col3≺/w:t≻ ≺/w:r≻ ≺w:r≻ ≺w:t xml:space="preserve"≻ Row2≺/w:t≻ ≺/w:r≻ ≺/w:p≻ ≺/w:tc≻ ≺w:tc≻ ≺w:tcPr≻ ≺w:tcW w:w="1276" w:type="dxa"/≻ ≺/w:tcPr≻ ≺w:p w14:paraId="1BE6691D" w14:textId="1B0BA1C5" w:rsidR="002F5D70" w:rsidRDefault="002F5D70"≻ ≺w:r≻ ≺w:t≻Col4≺/w:t≻ ≺/w:r≻ ≺w:r≻ ≺w:t xml:space="preserve"≻ Row2≺/w:t≻ ≺/w:r≻ ≺/w:p≻ ≺/w:tc≻ ≺/w:tr≻ ≺w:tr w:rsidR="002F5D70" w14:paraId="131272F9" w14:textId="77777777" w:rsidTr="002F5D70"≻ ≺w:tc≻ ≺w:tcPr≻ ≺w:tcW w:w="1413" w:type="dxa"/≻ ≺/w:tcPr≻ ≺w:p w14:paraId="534BA810" w14:textId="45B716D8" w:rsidR="002F5D70" w:rsidRDefault="002F5D70"≻ ≺w:r≻ ≺w:t xml:space="preserve"≻Col1 ≺/w:t≻ ≺/w:r≻ ≺w:r≻ ≺w:t≻Row≺/w:t≻ ≺/w:r≻ ≺w:r≻ ≺w:t≻3≺/w:t≻ ≺/w:r≻ ≺/w:p≻ ≺/w:tc≻ ≺w:tc≻ ≺w:tcPr≻ ≺w:tcW w:w="1276" w:type="dxa"/≻ ≺/w:tcPr≻ ≺w:p w14:paraId="187B451B" w14:textId="28C7A35A" w:rsidR="002F5D70" w:rsidRDefault="002F5D70"≻ ≺w:r≻ ≺w:t≻Col2≺/w:t≻ ≺/w:r≻ ≺w:r≻ ≺w:t xml:space="preserve"≻ Row3≺/w:t≻ ≺/w:r≻ ≺/w:p≻ ≺/w:tc≻ ≺w:tc≻ ≺w:tcPr≻ ≺w:tcW w:w="1275" w:type="dxa"/≻ ≺/w:tcPr≻ ≺w:p w14:paraId="4C16E7BD" w14:textId="0DD27A35" w:rsidR="002F5D70" w:rsidRDefault="002F5D70"≻ ≺w:r≻ ≺w:t≻Col3≺/w:t≻ ≺/w:r≻ ≺w:r≻ ≺w:t xml:space="preserve"≻ Row3≺/w:t≻ ≺/w:r≻ ≺/w:p≻ ≺/w:tc≻ ≺w:tc≻ ≺w:tcPr≻ ≺w:tcW w:w="1276" w:type="dxa"/≻ ≺/w:tcPr≻ ≺w:p w14:paraId="25C509B7" w14:textId="5EA6FF4F" w:rsidR="002F5D70" w:rsidRDefault="002F5D70"≻ ≺w:r≻ ≺w:t≻Col4≺/w:t≻ ≺/w:r≻ ≺w:r≻ ≺w:t xml:space="preserve"≻ Row3≺/w:t≻ ≺/w:r≻ ≺/w:p≻ ≺/w:tc≻ ≺/w:tr≻ ≺/w:tbl≻
We can see that the table lives next to the paragraph. The Python-Docx library paragraphs collection is walking over paragraphs, so how I that going to work? This is the first thing we need to find out, in the Visual Studio Code Debugger.
Oh, we have a problem. The table is ignored. From an object-oriented perspective, this is what we should expect. A list of paragraphs cannot suddenly hold a table, that would be illogical because that is an object of another class.
So what should we do? Now equipped with newly gained knowledge about how Python-Docx is built on top of XML. We need to go back and refactor the loopings over the paragraphs, then we will be fine. But how?
I will need to loop over the children of the body element of the document and then for each child I will need to find out if it is a paragraph then handle it like that, and when it is a table, then I need to handle that.
The solution was to walk over all children of the body element. If the child was found to be a paragraph, then I initialized an instance of paragraph class myself. It is the first time I did this. It worked very well. Then I left the rest of the handling of the paragraph as it was already created. I had to indent the entire if clause but other than that it worked as before.
The table?
This was uncharted territory for my part.
I could not initialize the table object the same way as the paragraph. Why is that?
There was no Table in the Docx class!
The documentation said there should be a Table class:
Look, it should work. No?
Perhaps it is not included by default?
Added the explicit import of the table.
And now?
It works! The child is a table, and we initialize the table object. Awesome, now parse this thing and produce the markup.
Ooh. In the cells, there are paragraphs. Would it not be nice to send these paragraphs to my “routine” for handling paragraphs. There is just one issue, I have the if-statement of handling a paragraph, but it is not a separate routine. Perhaps I need to make it into a function so that I can call it from the handling of a cell?
So with that, the table parsing is put on hold. Instead, I start refactoring my code. I need to move the handling of the paragraph into a separate function. Then I get this:
That is cool. I moved the code into a function, and now I get to play with the rules for local vs. global variables. All state variables will need to be declared global inside the function before I can use them.
Here is the refactored entire program. Here is the handling of the paragraph changed into a function with the table handling in the making marked as a comment:
import docx import os import xml import zipfile import sys import cv2 import zlib import docx.table import xml.etree.ElementTree as ET if len(sys.argv) != 2: print("Please only call me with one parameter") sys.exit() strWordFile = sys.argv[1] doc_DocX = docx.Document(strWordFile) doc_ZipFile = zipfile.ZipFile(strWordFile, 'r') fp = doc_ZipFile.open("word/numbering.xml") numbering_Xml = ET.parse(fp) ns = { 'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main', 'a': 'http://schemas.openxmlformats.org/drawingml/2006/main', 'wp': 'http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing' } num_Lst = numbering_Xml.findall("/w:num", ns) dictNumIDToAbstractNum = {} dictNumAbstractIDToStrFormat = {} dictListFormatToHtml = { 'bullet': 'ul', 'decimal': 'ol'} for num_Element in num_Lst: iNumId = int(num_Element.attrib['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}numId'], 10) AbstractNum_Element = num_Element.find("w:abstractNumId", ns) iAbstractNumId = int(AbstractNum_Element.attrib['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val'], 10) dictNumIDToAbstractNum[iNumId] = iAbstractNumId numAbstract_Lst = numbering_Xml.findall("/w:abstractNum", ns) for numAbstract_Element in numAbstract_Lst: iAbstractNumId = int(numAbstract_Element.attrib['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}abstractNumId'], 10) AbstractNumLevel0Format_Element = numAbstract_Element.find("w:lvl[@w:ilvl='0']/w:numFmt", ns) strFormat = AbstractNumLevel0Format_Element.attrib['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val'] dictNumAbstractIDToStrFormat[iAbstractNumId] = strFormat strBaseDir = os.path.dirname(strWordFile) os.chdir(strBaseDir) strBaseName = os.path.basename(os.path.splitext(strWordFile)[0]) def CompareRunsProduceMarkup(run1,run2): """Compare two runs, produce markup""" strMarkup = "" if(run1.bold != run2.bold): if(run1.bold): strMarkup += "≺/em≻" else: strMarkup += "≺em≻" if(run1.italic != run2.italic): if(run1.italic): strMarkup += "≺/i≻" else: strMarkup += "≺i≻" return strMarkup def ProduceMarkupOfParagraph(paragraph_Element): """Parse a paragraph and produce the markup for it""" strMarkup = "" global strInList global prev_Run global iInListLevel if (paragraph_Element.style.name == "List Paragraph"): iNumId = paragraph_Element._element.pPr.numPr.numId.val iCurrentListLevel = paragraph_Element._element.pPr.numPr.ilvl.val iAbstractId = dictNumIDToAbstractNum[iNumId] strListFormat = dictNumAbstractIDToStrFormat[iAbstractId] if (strInList != ''): if (iCurrentListLevel ≻ iInListLevel): #Higher list level for i in range(iInListLevel, iCurrentListLevel): strMarkup += "≺" + dictListFormatToHtml[strInList] + "≻" elif (iCurrentListLevel ≺ iInListLevel): #Lower list level for i in range(iInListLevel, iCurrentListLevel, - 1): strMarkup += "≺/" + dictListFormatToHtml[strInList] + "≻" strMarkup += "≺li≻" else: strInList = strListFormat # from -1 to 0 for i in range(iInListLevel, iCurrentListLevel): strMarkup += "≺" + dictListFormatToHtml[strInList] + "≻" strMarkup += "≺li≻" iInListLevel = iCurrentListLevel else: if (strInList != ''): for i in range(iInListLevel, -1, -1): strMarkup += "≺/" + dictListFormatToHtml[strInList] + "≻" strInList = '' iInListLevel = -1 strMarkup += "≺p≻" for current_Run in paragraph_Element.runs: if len(current_Run.element.drawing_lst) ≻ 0: for drawing_Element in current_Run.element.drawing_lst: anchor_Element = list(drawing_Element)[0] graphic_Element = anchor_Element.find('a:graphic', ns) strAlign = "" if (list(drawing_Element)[0].tag == "{http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing}anchor"): anchor_Element = list(drawing_Element)[0] positionH_Element = anchor_Element.find("wp:positionH", ns) strRelativeFrom = positionH_Element.attrib["relativeFrom"] if (strRelativeFrom == "column"): align_Element = positionH_Element.find("wp:align", ns) strAlign = align_Element.text strImageId = graphic_Element.graphicData.pic.blipFill.blip.embed image_Rel = doc_DocX.part.rels[strImageId] strImageTgt = image_Rel._baseURI[1:] + "/" + image_Rel.target_ref try: ZipInfo = doc_ZipFile.NameToInfo[strImageTgt] strFileNameOriginalSize = strBaseName + "_" + os.path.basename(ZipInfo.filename) strFileNameResized = os.path.splitext(strFileNameOriginalSize)[0] + "~1" + os.path.splitext(strFileNameOriginalSize)[1] ZipInfo.filename = strFileNameOriginalSize doc_ZipFile.extract(ZipInfo, strBaseName) iW = 400. imgOrig = cv2.imread(strBaseDir + "" + strBaseName + "" + strFileNameOriginalSize) iHeight, iWidth, iDepth = imgOrig.shape fImgScale = iW/iWidth iNewX,iNewY = imgOrig.shape[1]*fImgScale, imgOrig.shape[0]*fImgScale imgResized = cv2.resize(imgOrig,(int(iNewX),int(iNewY)),cv2.INTER_AREA) cv2.imwrite(strBaseDir + "" + strBaseName + "" + strFileNameResized, imgResized) paragraph_Element.add_run(ZipInfo.filename) strFloat = '' if (strAlign != ''): strFloat = " float:" + strAlign + ";" strImageMarkup = "≺a href="/media/{0}" target="_blank"≻ ≺img height="{1}" src="/media/{2}" style = "margin: 5px;{3}" width="400" /≻≺/a≻".format(strFileNameOriginalSize, round(iNewY), strFileNameResized, strFloat) strMarkup += strImageMarkup except KeyError: print("ERROR: Did not find {} in zip file".format(strImageTgt)) else: print("Extracted {}".format(strImageTgt)) strRunMarkup = CompareRunsProduceMarkup(prev_Run, current_Run) if (strRunMarkup != ""): strMarkup += strRunMarkup prev_Run = current_Run strMarkup += current_Run.text strMarkup += "≺/p≻" if (paragraph_Element.style.name == "List Paragraph"): strMarkup += "≺/li≻" return strMarkup prev_Run = doc_DocX.paragraphs[0].runs[0] strInList = '' # '' = No List, 'bullet' = Bullet, 'decimal' = Decimal iInListLevel = -1 for child_Element in doc_DocX.element.body: if (child_Element.tag == '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}tbl'): table_Element = docx.table.Table(child_Element, doc_DocX) strTableMarkup = "≺table≻" # for tableRow in list(table_Element.rows): # strTableMarkup += "≺tr≻" # for tableCell in tableRow.cells: # strTableMarkup += "≺/tr≻" print(strTableMarkup) elif (child_Element.tag == '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p'): paragraph_Element = docx.text.paragraph.Paragraph(child_Element, doc_DocX) print(ProduceMarkupOfParagraph(paragraph_Element)) else: raise Exception("Unknown child element {}".format(child_Element.tag)) doc_DocX.save(strWordFile) del doc_DocX
With that, I can parse the table again.
Just from curiosity, how is this done? Is it so that if I got a table in a table, then it is stored next to the paragraph in the tc element just like tables are stored next to paragraphs in the body?
Yes, it does!
That means two things:
-
The handling of tables needs to be a function as well because it might need to be called from within the table handling routine.
-
The same handling of children as I created for the body needs to happen inside the routine, walking the children of the TC.
That was dead easy! (Perhaps not)
No of-course not. Look at the ProduceMarkupOfTable function. I had to add detection of the tcPr element as well, but it is not doing anything. There I could make use of the pass statement.
import docx import os import xml import zipfile import sys import cv2 import zlib import docx.table import xml.etree.ElementTree as ET if len(sys.argv) != 2: print("Please only call me with one parameter") sys.exit() strWordFile = sys.argv[1] doc_DocX = docx.Document(strWordFile) doc_ZipFile = zipfile.ZipFile(strWordFile, 'r') fp = doc_ZipFile.open("word/numbering.xml") numbering_Xml = ET.parse(fp) ns = { 'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main', 'a': 'http://schemas.openxmlformats.org/drawingml/2006/main', 'wp': 'http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing' } num_Lst = numbering_Xml.findall("/w:num", ns) dictNumIDToAbstractNum = {} dictNumAbstractIDToStrFormat = {} dictListFormatToHtml = { 'bullet': 'ul', 'decimal': 'ol'} for num_Element in num_Lst: iNumId = int(num_Element.attrib['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}numId'], 10) AbstractNum_Element = num_Element.find("w:abstractNumId", ns) iAbstractNumId = int(AbstractNum_Element.attrib['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val'], 10) dictNumIDToAbstractNum[iNumId] = iAbstractNumId numAbstract_Lst = numbering_Xml.findall("/w:abstractNum", ns) for numAbstract_Element in numAbstract_Lst: iAbstractNumId = int(numAbstract_Element.attrib['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}abstractNumId'], 10) AbstractNumLevel0Format_Element = numAbstract_Element.find("w:lvl[@w:ilvl='0']/w:numFmt", ns) strFormat = AbstractNumLevel0Format_Element.attrib['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val'] dictNumAbstractIDToStrFormat[iAbstractNumId] = strFormat strBaseDir = os.path.dirname(strWordFile) os.chdir(strBaseDir) strBaseName = os.path.basename(os.path.splitext(strWordFile)[0]) def CompareRunsProduceMarkup(run1,run2): """Compare two runs, produce markup""" strMarkup = "" if(run1.bold != run2.bold): if(run1.bold): strMarkup += "≺/em≻" else: strMarkup += "≺em≻" if(run1.italic != run2.italic): if(run1.italic): strMarkup += "≺/i≻" else: strMarkup += "≺i≻" return strMarkup def ProduceMarkupOfParagraph(paragraph_Element): """Parse a paragraph and produce the markup for it""" strMarkup = "" global strInList global prev_Run global iInListLevel if (paragraph_Element.style.name == "List Paragraph"): iNumId = paragraph_Element._element.pPr.numPr.numId.val iCurrentListLevel = paragraph_Element._element.pPr.numPr.ilvl.val iAbstractId = dictNumIDToAbstractNum[iNumId] strListFormat = dictNumAbstractIDToStrFormat[iAbstractId] if (strInList != ''): if (iCurrentListLevel ≻ iInListLevel): #Higher list level for i in range(iInListLevel, iCurrentListLevel): strMarkup += "≺" + dictListFormatToHtml[strInList] + "≻" elif (iCurrentListLevel ≺ iInListLevel): #Lower list level for i in range(iInListLevel, iCurrentListLevel, - 1): strMarkup += "≺/" + dictListFormatToHtml[strInList] + "≻" strMarkup += "≺li≻" else: strInList = strListFormat # from -1 to 0 for i in range(iInListLevel, iCurrentListLevel): strMarkup += "≺" + dictListFormatToHtml[strInList] + "≻" strMarkup += "≺li≻" iInListLevel = iCurrentListLevel else: if (strInList != ''): for i in range(iInListLevel, -1, -1): strMarkup += "≺/" + dictListFormatToHtml[strInList] + "≻" strInList = '' iInListLevel = -1 strMarkup += "≺p≻" for current_Run in paragraph_Element.runs: if len(current_Run.element.drawing_lst) ≻ 0: for drawing_Element in current_Run.element.drawing_lst: anchor_Element = list(drawing_Element)[0] graphic_Element = anchor_Element.find('a:graphic', ns) strAlign = "" if (list(drawing_Element)[0].tag == "{http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing}anchor"): anchor_Element = list(drawing_Element)[0] positionH_Element = anchor_Element.find("wp:positionH", ns) strRelativeFrom = positionH_Element.attrib["relativeFrom"] if (strRelativeFrom == "column"): align_Element = positionH_Element.find("wp:align", ns) strAlign = align_Element.text strImageId = graphic_Element.graphicData.pic.blipFill.blip.embed image_Rel = doc_DocX.part.rels[strImageId] strImageTgt = image_Rel._baseURI[1:] + "/" + image_Rel.target_ref try: ZipInfo = doc_ZipFile.NameToInfo[strImageTgt] strFileNameOriginalSize = strBaseName + "_" + os.path.basename(ZipInfo.filename) strFileNameResized = os.path.splitext(strFileNameOriginalSize)[0] + "~1" + os.path.splitext(strFileNameOriginalSize)[1] ZipInfo.filename = strFileNameOriginalSize doc_ZipFile.extract(ZipInfo, strBaseName) iW = 400. imgOrig = cv2.imread(strBaseDir + "" + strBaseName + "" + strFileNameOriginalSize) iHeight, iWidth, iDepth = imgOrig.shape fImgScale = iW/iWidth iNewX,iNewY = imgOrig.shape[1]*fImgScale, imgOrig.shape[0]*fImgScale imgResized = cv2.resize(imgOrig,(int(iNewX),int(iNewY)),cv2.INTER_AREA) cv2.imwrite(strBaseDir + "" + strBaseName + "" + strFileNameResized, imgResized) paragraph_Element.add_run(ZipInfo.filename) strFloat = '' if (strAlign != ''): strFloat = " float:" + strAlign + ";" strImageMarkup = "≺a href="/media/{0}" target="_blank"≻ ≺img height="{1}" src="/media/{2}" style = "margin: 5px;{3}" width="400" /≻≺/a≻".format(strFileNameOriginalSize, round(iNewY), strFileNameResized, strFloat) strMarkup += strImageMarkup except KeyError: print("ERROR: Did not find {} in zip file".format(strImageTgt)) else: print("Extracted {}".format(strImageTgt)) strRunMarkup = CompareRunsProduceMarkup(prev_Run, current_Run) if (strRunMarkup != ""): strMarkup += strRunMarkup prev_Run = current_Run strMarkup += current_Run.text strMarkup += "≺/p≻" if (paragraph_Element.style.name == "List Paragraph"): strMarkup += "≺/li≻" return strMarkup def ProduceMarkupOfTable(table_Element): """Parse a table and produce the markup for it""" strMarkup = "≺table≻" for tableRow in list(table_Element.rows): strMarkup += "≺tr≻" for tableCell in tableRow.cells: strMarkup += "≺td≻" for child_Element in list(tableCell._element): if (child_Element.tag == '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}tbl'): tableChild_Element = docx.table.Table(child_Element, doc_DocX) strMarkup += ProduceMarkupOfTable(tableChild_Element) elif (child_Element.tag == '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p'): paragraph_Element = docx.text.paragraph.Paragraph(child_Element, doc_DocX) strMarkup += ProduceMarkupOfParagraph(paragraph_Element) elif (child_Element.tag == '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}tcPr'): pass # Ignore else: raise Exception("Unknown child element {}".format(child_Element.tag)) strMarkup += "≺/td≻" strMarkup += "≺/tr≻" strMarkup += "≺/table≻" return strMarkup prev_Run = doc_DocX.paragraphs[0].runs[0] strInList = '' # '' = No List, 'bullet' = Bullet, 'decimal' = Decimal iInListLevel = -1 for child_Element in doc_DocX.element.body: if (child_Element.tag == '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}tbl'): table_Element = docx.table.Table(child_Element, doc_DocX) print(ProduceMarkupOfTable(table_Element)) elif (child_Element.tag == '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p'): paragraph_Element = docx.text.paragraph.Paragraph(child_Element, doc_DocX) print(ProduceMarkupOfParagraph(paragraph_Element)) else: raise Exception("Unknown child element {}".format(child_Element.tag)) doc_DocX.save(strWordFile) del doc_DocX
With this, I finished point 2 of my TODO list. Next up is handling the text of style Code. But I will do that tomorrow.
August 7, 2019
Today I will parse styles. The idea is that I will define a style, with a specific name, and then mark the text with that style to indicate that it is source code that should be displayed with the pre tag.
By now, you know what I do first. I open my test document in Microsoft Word and insert a test case. I define a new style like this. (This screenshot says modify when I first created it, it said Create Style)
Then I checked the “New documents based on this template” because that way I have it every time I open a new document. I also changed the style to have no spell-checking. Please note that the “Style for [the] following paragraph” is displaying both a paragraph and a character. Haha, Microsoft is missing a correct article. This is the second time I find this.
Then I selected texts and assigned it to my new style.
Here is my test case where the grey texts are marked as Code.
Then I make a backup of the document, and then I change the extension of the backup to ZIP and unpack the content of the archive. Here is the style of the text Python-Docx in the source:
And here is the first lines of the source code marked as Code:
The run can have a style definition, and there it is called CodeChar, and the paragraph can have a style definition, and there it is called Code.
There are three modes for styles:
-
Paragraph style can only be applied to full paragraphs.
-
Character style can only be applied to sub-ranges of paragraphs.
-
Paragraph AND Character style, this can be both.
It is the most convenient to specify that a style is of mode 3. A rhetorical question on this, how do you think that Microsoft solves this?
With a table of course, and not any table, the styles table. This is stored in the zip archive of the word document.
There is much information in the style table file. It has style definitions types for paragraph, character, and tables. A style definition can link to another style.
August 8, 2019
On the one hand, it is tempting to hard-code the name of the styles and detect these. On the other hand, it is perhaps not so difficult to do it correctly?
Suppose the style information is loaded into a dictionary where the key is the name of the style. Then when I find a run with a style definition CodeChar then I can look it up in the style table. I will get the character entry of the style. From there can I look up again the linked style, and arrive at the paragraph style Code.
It is tempting to store all the information about each style, but that is not necessary at this stage of the program. If I need more information at a later stage, then I can add that. It is not good to make a program complex for possible future use.
It will suffice with a dictionary with the key of each style, and the value is the id of the style linked to.
How do I get to the styles table? Is that available in Python-DocX?
In Python-Docx there are styles. It has somehow a relation to the document, but it is not easy from a first glance to get to the information that I need. The goal of the library is to produce a new word document from scratch and what I am doing is parsing an existing document. Am I better off with another library?
It is not good to give up. What is the documentation saying?
Look! There is a place for styles I had not found. In the document class:
Looks empty to me, what am I doing wrong?
Ooh, wait a moment. I need to look into MY document. That is the instantiation of the document I have not the class library itself.
August 9, 2019
It got stuck in finding the style table in Python-DocX. Today I will resume my search.
This is confusing. Did I not search here already? So now I have all the information I could possibly need? No, not at all. The link is missing.
August 11, 2019
Already a couple of days since I worked on my Python project. I managed to do other things, though. Now I am back, and I made up my mind on the missing link. I will get it myself.
I even looked into the Python-DocX library if I could find the spot where the link should be and perhaps add it myself (and then submit an update to the Python-DocX people), but I am just a beginner in this game.
With that, I am left with loading that table myself. This is similar to the way I opened and parsed the numbers document.
As a consequence, I get back into using the XPath from ElementTree, but it has only a subset of XPath. The XPath language is complex as it is. It is frustrating having that it is not working and then have to figure out if it is due to the syntax of the query I have given or if it is due to the ElementTree implementation of XPath. What is what?
I went searching for better implementations for XPath, and they exist.
I will try to get LXML and try that.
I already have it installed?
There is documentation for lxml at https://lxml.de/parsing.html
Unable to open ‘etree.pyx’: Unable to read the file (Error: File not found)
I searched for this problem for hours. Then I came to this page:
https://stackoverflow.com/questions/37642258/cant-parse-tag-from-xml-using-python
It is in itself slightly unrelated. I, for example, use import lxml instead of the form they use. I switched the order such that lxml is imported before the regular XML library. No idea if that helped.
However, this answer was interesting:
I provided the namespace as suggested here above, and it worked.
Here I open the styles part of the zip as a file pointer and then let lxml parse it directly from the zip in memory, and then I extract the style nodes with type as a paragraph with the help of XPath. The style nodes are returned as a list.
And here is the snippet of Python, creating a dictionary by reading the styles.xml and finding how the custom styles are linked to each other.
fpStyles = doc_ZipFile.open("word/styles.xml") styles_Xml = lxml.etree.parse(fpStyles) style_Lst = styles_Xml.xpath("/w:styles/w:style[(@w:type='paragraph' or @w:type='character') and @w:customStyle='1']", namespaces=ns) dictStyleIDLinkToOtherStyle = {} for style_Element in style_Lst: strStyleId = style_Element.attrib['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}styleId'] strLinkToOtherStyle = style_Element.xpath("w:link/@w:val", namespaces=ns)[0] dictStyleIDLinkToOtherStyle[strStyleId] = strLinkToOtherStyle
It is worth noting that this kind of XPath expression is considerably more complex than the subset of XPath that the standard XML can handle. Here it is also possible to get the value of an attribute via the XPath expression, which is not possible with the subset form of XPath.
So now I can get back at where I was August 7. The run properties can have a style, and also the paragraph can have a style. Let’s focus on the run properties for a while, and then we come back to the paragraph in a moment. Do you remember the first thing I did with where I parsed the run properties? It was August 2 when I talked about this last time. I created a function called CompareRunsProduceMarkup. This is the function that will need to handle the character style changes.
It is funny that CompareRunsProduceMarkup is the first function ever that I made with Python. Feels like ages ago. This snippet is showing the new function:
def CompareRunsProduceMarkup(run1,run2): """Compare two runs, produce markup""" strMarkup = "" if(run1.bold != run2.bold): if(run1.bold): strMarkup += "≺/em≻" else: strMarkup += "≺em≻" if(run1.italic != run2.italic): if(run1.italic): strMarkup += "≺/i≻" else: strMarkup += "≺i≻" if (run1.style.style_id != run2.style.style_id): strStyleRun1 = "n/a" strStyleRun2 = "n/a" if (run1.style.style_id == 'Code' or (dictStyleIDLinkToOtherStyle.__contains__(run1.style.style_id) and dictStyleIDLinkToOtherStyle[run1.style.style_id] == 'Code')): strStyleRun1 = "Code" if (run2.style.style_id == 'Code' or (dictStyleIDLinkToOtherStyle.__contains__(run2.style.style_id) and dictStyleIDLinkToOtherStyle[run2.style.style_id] == 'Code')): strStyleRun2 = "Code" if (strStyleRun1 == "Code" and strStyleRun2 != "Code"): strMarkup += "≺/pre≻" if (strStyleRun1 != "Code" and strStyleRun2 == "Code"): strMarkup += "≺pre≻" return strMarkup
And the result looks like this:
Now it is time to parse the paragraph style, but that is for another day.
August 12, 2019
Today I will look into parsing the style of the paragraph. Here is the original function of ProduceMarkupOfParagraph. It has become a big function!
def ProduceMarkupOfParagraph(paragraph_Element): """Parse a paragraph and produce the markup for it""" strMarkup = "" global strInList global prev_Run global iInListLevel if (paragraph_Element.style.name == "List Paragraph"): iNumId = paragraph_Element._element.pPr.numPr.numId.val iCurrentListLevel = paragraph_Element._element.pPr.numPr.ilvl.val iAbstractId = dictNumIDToAbstractNum[iNumId] strListFormat = dictNumAbstractIDToStrFormat[iAbstractId] if (strInList != ''): if (iCurrentListLevel ≻ iInListLevel): #Higher list level for i in range(iInListLevel, iCurrentListLevel): strMarkup += "≺" + dictListFormatToHtml[strInList] + "≻" elif (iCurrentListLevel ≺ iInListLevel): #Lower list level for i in range(iInListLevel, iCurrentListLevel, - 1): strMarkup += "≺/" + dictListFormatToHtml[strInList] + "≻" strMarkup += "≺li≻" else: strInList = strListFormat # from -1 to 0 for i in range(iInListLevel, iCurrentListLevel): strMarkup += "≺" + dictListFormatToHtml[strInList] + "≻" strMarkup += "≺li≻" iInListLevel = iCurrentListLevel else: if (strInList != ''): for i in range(iInListLevel, -1, -1): strMarkup += "≺/" + dictListFormatToHtml[strInList] + "≻" strInList = '' iInListLevel = -1 strMarkup += "≺p≻" for current_Run in paragraph_Element.runs: if len(current_Run.element.drawing_lst) ≻ 0: for drawing_Element in current_Run.element.drawing_lst: anchor_Element = list(drawing_Element)[0] graphic_Element = anchor_Element.find('a:graphic', ns) strAlign = "" if (list(drawing_Element)[0].tag == "{http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing}anchor"): anchor_Element = list(drawing_Element)[0] positionH_Element = anchor_Element.find("wp:positionH", ns) strRelativeFrom = positionH_Element.attrib["relativeFrom"] if (strRelativeFrom == "column"): align_Element = positionH_Element.find("wp:align", ns) strAlign = align_Element.text strImageId = graphic_Element.graphicData.pic.blipFill.blip.embed image_Rel = doc_DocX.part.rels[strImageId] strImageTgt = image_Rel._baseURI[1:] + "/" + image_Rel.target_ref try: ZipInfo = doc_ZipFile.NameToInfo[strImageTgt] strFileNameOriginalSize = strBaseName + "_" + os.path.basename(ZipInfo.filename) strFileNameResized = os.path.splitext(strFileNameOriginalSize)[0] + "~1" + os.path.splitext(strFileNameOriginalSize)[1] ZipInfo.filename = strFileNameOriginalSize doc_ZipFile.extract(ZipInfo, strBaseName) iW = 400. imgOrig = cv2.imread(strBaseDir + "" + strBaseName + "" + strFileNameOriginalSize) iHeight, iWidth, iDepth = imgOrig.shape fImgScale = iW/iWidth iNewX,iNewY = imgOrig.shape[1]*fImgScale, imgOrig.shape[0]*fImgScale imgResized = cv2.resize(imgOrig,(int(iNewX),int(iNewY)),cv2.INTER_AREA) cv2.imwrite(strBaseDir + "" + strBaseName + "" + strFileNameResized, imgResized) paragraph_Element.add_run(ZipInfo.filename) strFloat = '' if (strAlign != ''): strFloat = " float:" + strAlign + ";" strImageMarkup = "≺a href="/media/{0}" target="_blank"≻ ≺img height="{1}" src="/media/{2}" style = "margin: 5px;{3}" width="400" /≻≺/a≻".format(strFileNameOriginalSize, round(iNewY), strFileNameResized, strFloat) strMarkup += strImageMarkup except KeyError: print("ERROR: Did not find {} in zip file".format(strImageTgt)) else: print("Extracted {}".format(strImageTgt)) strRunMarkup = CompareRunsProduceMarkup(prev_Run, current_Run) if (strRunMarkup != ""): strMarkup += strRunMarkup prev_Run = current_Run strMarkup += current_Run.text strMarkup += "≺/p≻" if (paragraph_Element.style.name == "List Paragraph"): strMarkup += "≺/li≻" return strMarkup
The general flow of this function is to maintain a state and compare with that state. I think it will work if I introduce a state indicating that we are in a style of name Code.
When I look at the current source code, I notice that the numbering is also based on the style. How is that working when there is both numbering and a special style. Let us figure out!
Ooh, this reveals a problem with the number parsing:
My assumption was that a paragraph has the style “List Paragraph” to be a number item, but that is not true. It has to have a number properties element though.
That is much better.
So now we can get back to parsing the style. Added this to the beginning of the paragraph parsing function:
And this was the result:
That looked fine enough. Later in the same document, I had other samples of full paragraphs:
This is not good. The paragraph with “Day 3” starts with an emphasis, and that is good (Comment from future me, I meant to say strong). Then before that paragraph ends the en of the emphasis should come, but it is not there.
How is that working in Word? Is it so that if a paragraph starts a bold, then it ends that bold before leaving the paragraph?
To find out I select bold right across the end of the paragraph into the next:
Please note that the text already has HTML markup. I will not do any special handling of that.
Yes. Microsoft Word defines bold for the first part, then it does it again for the beginning of the next paragraph.
With that, it is correct to end the run markup at the end of the paragraph. So let us do that first to see how far we are getting. It is only possible to end ongoing run markup if the state is stored. So I will need to remember that.
The run comparison is useless, doing it this way. The idea was that I should compare two runs to get the markup changes, but if I will keep the status of the runs, then I just need the current run and compare with the state. I surrender, I will do that.
Changing the name of CompareRunsProduceMarkup into AnalyzeRunProduceMarkup. That is so easy in Visual Studio Code, just hit F2
Here is the result after a bit of changing here and there:
Much better! Here is the example of the start of bold in one para and ending it in another:
Observe that the text HAD HTML markup, so that is just fine. (When the program was almost finished, I found out that I mixed up ‘em’ and ‘strong’. Until the end, I will be talking about em while meaning strong.)
In the code, I could now introduce an else clause to the for loop handling the runs so that when all runs had finished, I could call the function to close all open runs. Fantastic!
My ever first test case of the style still works, that is great.
Oops, there something went wrong. The closing of the em should not be there.
Oooh, I am comparing the state with the current value of bold or italic, but if they are not initialized, the state is set to None.
The solution to that is to use the value None as False, and when it is True, it is indeed true. That way, when the values are not defined, they compare fine.
Much better!
With that, I am done with the style Code! Do you still recall my TODO list from 4 August 12, 2019? In that list, I have one item left: Parsing hyperlinks and generate the markup for the hyperlink.
But that is for another day.
August 13, 2019
Today I will look at parsing the hyperlink. The way I do this is that I introduce a test case in a test document, and then I look into how a feature is implemented and then I adjust the parser to handle the feature as I would like to.
The hyperlink lives next to the runs inside a paragraph. If the code looks like this (and it does):
Then it is not far fetched to think that a hyperlink element will not be found when walking the runs. I tried this, and this assumption is correct. I will not see the hyperlink if I walk the runs.
We had this challenge already earlier when we worked on parsing tables. If I look at the paragraph_Element._element as a list then I get to see the children including the hyperlink.
My function for parsing the paragraph has a loop over runs changed to be a loop over paragraph child elements.
When doing that each child has to be tested for what it is, and for that, I create a function ParseParagraphChildElement.
When a run is found it is sent to the original handling of the run refactored into a function ProduceMarkupOfRun.
When a hyperlink is found it is sent to a new function handling those:
It also makes use of the function parsing children of paragraph elements.
It works, look!
With that, I finished the goals of my TODO list. Now I just need to wrap up and deploy this in a functioning program. The original program, from the previous part of this series, modified the word document and then we saved it. Up until now, I have printed the output to the console. It would be better to put the output in a file.
Tried my program and all this time I confused em with italic. It should be strong!
It turns out that writing the text to a file is not that complicated. I import codecs and then opening a file with the filename I want and then instead print to console I did write to the file. Then close the file and done!
August 14, 2019
I tried my program on this very blog post!!!
Halfway through the first code section the italic style took over the text and then it never released it. I figured out that when including html in the text and it contains an html code, like my program listing is doing, then that is interpreted as real html.
The solution is to escape the html codes in code. For this I introduced a new state for paragraphs in style code.
I included an awful lot of images in this blog post. GREAT!
Python has been a great experience so far. The Python-DocX library has not been entirely helpful to me but it is a good project. The makers of it should keep up the good work! My program is missing a lot of possible checks an balances, I am fine with that. I will fix them as they arrive.
With that I finish this blog post with the final program. BYE!
A few moments later I finished the upload and found out that my blog engine handled the pre lines horribly. It could not handle having a pre statement for every line. To cope with that I had to introduce another state bPrevParaInStyleCode that keeps the possible state of the previous line in case it was of style Code and if so only begin and end the pre statement at the beginning of an entire block or end of block.
I also had to remove para markup and replace that with newlines when in Code. So there you have it. This is so much fun! I can blog a hundred fold faster with this program. Haha…
import docx import os import lxml import xml from io import StringIO,BytesIO import zipfile import sys import cv2 import zlib import docx.table import codecs import html import xml.etree.ElementTree as ET if len(sys.argv) != 2: print("Please only call me with one parameter") sys.exit() strWordFile = sys.argv[1] doc_DocX = docx.Document(strWordFile) doc_ZipFile = zipfile.ZipFile(strWordFile, 'r') fpNumbering = doc_ZipFile.open("word/numbering.xml") numbering_Xml = ET.parse(fpNumbering) ns = { 'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main', 'a': 'http://schemas.openxmlformats.org/drawingml/2006/main', 'wp': 'http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing' } num_Lst = numbering_Xml.findall("./w:num", ns) dictNumIDToAbstractNum = {} dictNumAbstractIDToStrFormat = {} dictListFormatToHtml = { 'bullet': 'ul', 'decimal': 'ol'} for num_Element in num_Lst: iNumId = int(num_Element.attrib['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}numId'], 10) AbstractNum_Element = num_Element.find("w:abstractNumId", ns) iAbstractNumId = int(AbstractNum_Element.attrib['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val'], 10) dictNumIDToAbstractNum[iNumId] = iAbstractNumId numAbstract_Lst = numbering_Xml.findall("./w:abstractNum", ns) for numAbstract_Element in numAbstract_Lst: iAbstractNumId = int(numAbstract_Element.attrib['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}abstractNumId'], 10) AbstractNumLevel0Format_Element = numAbstract_Element.find("w:lvl[@w:ilvl='0']/w:numFmt", ns) strFormat = AbstractNumLevel0Format_Element.attrib['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val'] dictNumAbstractIDToStrFormat[iAbstractNumId] = strFormat fpStyles = doc_ZipFile.open("word/styles.xml") styles_Xml = lxml.etree.parse(fpStyles) style_Lst = styles_Xml.xpath("/w:styles/w:style[(@w:type='paragraph' or @w:type='character') and @w:customStyle='1']", namespaces=ns) dictStyleIDLinkToOtherStyle = {} for style_Element in style_Lst: strStyleId = style_Element.attrib['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}styleId'] strLinkToOtherStyle = style_Element.xpath("w:link/@w:val", namespaces=ns)[0] dictStyleIDLinkToOtherStyle[strStyleId] = strLinkToOtherStyle strBaseDir = os.path.dirname(strWordFile) os.chdir(strBaseDir) strBaseName = os.path.basename(os.path.splitext(strWordFile)[0]) def AnalyzeStateChangesInRunProduceMarkup(run): """Compare a run with the state, produce markup""" global bInRunBold global bInRunItalic global bInRunStyleCode strMarkup = "" if(run.bold != bInRunBold): if(run.bold): strMarkup += "≺strong≻" bInRunBold = True else: strMarkup += "≺/strong≻" bInRunBold = None if(run.italic != bInRunItalic): if(run.italic): strMarkup += "≺em≻" bInRunItalic = True else: strMarkup += "≺/em≻" bInRunItalic = None bRunInStyleCode = False if ('Code' == run.style.style_id or (dictStyleIDLinkToOtherStyle.__contains__(run.style.style_id) and dictStyleIDLinkToOtherStyle[run.style.style_id] == 'Code')): bRunInStyleCode = True if (bRunInStyleCode != bInRunStyleCode): if (bRunInStyleCode): strMarkup += "≺i≻" else: strMarkup += "≺/i≻" bInRunStyleCode = bRunInStyleCode return strMarkup def EndOngoingRunMarkup(): """End ongoing runs, produce markup""" global bInRunBold global bInRunItalic global bInRunStyleCode strMarkup = "" if (bInRunStyleCode): strMarkup += "≺/i≻" bInRunStyleCode = None if(bInRunItalic): strMarkup += "≺/em≻" bInRunItalic = None if(bInRunBold): strMarkup += "≺/strong≻" bInRunBold = None return strMarkup def ProduceMarkupOfRun(current_Run): """This function produces the markup of a run""" global bInParaStyleCode global bInRunStyleCode strMarkup = "" if len(current_Run.element.drawing_lst) ≻ 0: for drawing_Element in current_Run.element.drawing_lst: anchor_Element = list(drawing_Element)[0] graphic_Element = anchor_Element.find('a:graphic', ns) strAlign = "" if (list(drawing_Element)[0].tag == "{http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing}anchor"): anchor_Element = list(drawing_Element)[0] positionH_Element = anchor_Element.find("wp:positionH", ns) strRelativeFrom = positionH_Element.attrib["relativeFrom"] if (strRelativeFrom == "column"): align_Element = positionH_Element.find("wp:align", ns) strAlign = align_Element.text strImageId = graphic_Element.graphicData.pic.blipFill.blip.embed image_Rel = doc_DocX.part.rels[strImageId] strImageTgt = image_Rel._baseURI[1:] + "/" + image_Rel.target_ref ZipInfo = doc_ZipFile.NameToInfo[strImageTgt] strFileNameOriginalSize = strBaseName + "_" + os.path.basename(ZipInfo.filename) strFileNameResized = os.path.splitext(strFileNameOriginalSize)[0] + "~1" + os.path.splitext(strFileNameOriginalSize)[1] ZipInfo.filename = strFileNameOriginalSize doc_ZipFile.extract(ZipInfo, strBaseName) iW = 400. imgOrig = cv2.imread(strBaseDir + "" + strBaseName + "" + strFileNameOriginalSize) iHeight, iWidth, iDepth = imgOrig.shape fImgScale = iW/iWidth iNewX,iNewY = imgOrig.shape[1]*fImgScale, imgOrig.shape[0]*fImgScale imgResized = cv2.resize(imgOrig,(int(iNewX),int(iNewY)),cv2.INTER_AREA) cv2.imwrite(strBaseDir + "" + strBaseName + "" + strFileNameResized, imgResized) paragraph_Element.add_run(ZipInfo.filename) strFloat = '' if (strAlign != ''): strFloat = " float:" + strAlign + ";" strMarkup += "≺a href="/media/{0}" target="_blank"≻ ≺img height="{1}" src="/media/{2}" style = "margin: 5px;{3}" width="400" /≻≺/a≻".format(strFileNameOriginalSize, round(iNewY), strFileNameResized, strFloat) strRunMarkup = AnalyzeStateChangesInRunProduceMarkup(current_Run) if (strRunMarkup != ""): strMarkup += strRunMarkup if (bInParaStyleCode or bInRunStyleCode): strMarkup += html.escape(current_Run.text) else: strMarkup += current_Run.text return strMarkup def ParseParagraphChildElement(paragraphChild_Element): """This function parse the childs of a paragraph like runs etc""" strMarkup = "" if (paragraphChild_Element.tag == '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink'): strMarkup += ProduceMarkupOfHyperlink(paragraphChild_Element) elif (paragraphChild_Element.tag == '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}r'): run = docx.text.run.Run(paragraphChild_Element, doc_DocX) strMarkup += ProduceMarkupOfRun(run) elif (paragraphChild_Element.tag == '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}bookmarkStart'): pass elif (paragraphChild_Element.tag == '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}bookmarkEnd'): pass elif (paragraphChild_Element.tag == '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}pPr'): pass elif (paragraphChild_Element.tag == '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}proofErr'): pass else: raise Exception("Unknown child element of paragraph") return strMarkup def ProduceMarkupOfHyperlink(hyperlink_Element): """This function produce the markup of a hyperlink element""" link_id = hyperlink_Element.attrib['{http://schemas.openxmlformats.org/officeDocument/2006/relationships}id'] image_Rel = doc_DocX.part.rels[link_id] image_Rel.target_ref strMarkup = "≺a href = "{}"≻".format(image_Rel.target_ref) for paragraphChild_Element in list(hyperlink_Element): strMarkup += ParseParagraphChildElement(paragraphChild_Element) strMarkup += "≺/a≻" return strMarkup def ProduceMarkupOfParagraph(paragraph_Element): """Parse a paragraph and produce the markup for it""" strMarkup = "" global strInParaList global iInParaListLevel global bInRunStyleCode global bInParaStyleCode global bPrevParaInStyleCode if (paragraph_Element.style.style_id == "Code"): if (not(bPrevParaInStyleCode)): strMarkup += "≺pre≻" bInParaStyleCode = True else: if (bPrevParaInStyleCode): strMarkup += "≺/pre≻" if (paragraph_Element._p.pPr != None and paragraph_Element._p.pPr.numPr != None): iNumId = paragraph_Element._element.pPr.numPr.numId.val iCurrentListLevel = paragraph_Element._element.pPr.numPr.ilvl.val iAbstractId = dictNumIDToAbstractNum[iNumId] strListFormat = dictNumAbstractIDToStrFormat[iAbstractId] if (strInParaList != ''): if (iCurrentListLevel ≻ iInParaListLevel): #Higher list level for i in range(iInParaListLevel, iCurrentListLevel): strMarkup += "≺" + dictListFormatToHtml[strInParaList] + "≻" elif (iCurrentListLevel ≺ iInParaListLevel): #Lower list level for i in range(iInParaListLevel, iCurrentListLevel, - 1): strMarkup += "≺/" + dictListFormatToHtml[strInParaList] + "≻" strMarkup += "≺li≻" else: strInParaList = strListFormat # from -1 to 0 for i in range(iInParaListLevel, iCurrentListLevel): strMarkup += "≺" + dictListFormatToHtml[strInParaList] + "≻" strMarkup += "≺li≻" iInParaListLevel = iCurrentListLevel else: if (strInParaList != ''): for i in range(iInParaListLevel, -1, -1): strMarkup += "≺/" + dictListFormatToHtml[strInParaList] + "≻" strInParaList = '' iInParaListLevel = -1 if (not(bInParaStyleCode)): strMarkup += "≺p≻" for paragraphChild_Element in list(paragraph_Element._element): strMarkup += ParseParagraphChildElement(paragraphChild_Element) else: strMarkup += EndOngoingRunMarkup() if (bInParaStyleCode): strMarkup += " " else: strMarkup += "≺/p≻" if (paragraph_Element._p.pPr != None and paragraph_Element._p.pPr.numPr != None): strMarkup += "≺/li≻" if (paragraph_Element.style.style_id == "Code"): bInParaStyleCode = False bPrevParaInStyleCode = True else: bPrevParaInStyleCode = False return strMarkup def ProduceMarkupOfTable(table_Element): """Parse a table and produce the markup for it""" strMarkup = "≺table≻" for tableRow in list(table_Element.rows): strMarkup += "≺tr≻" for tableCell in tableRow.cells: strMarkup += "≺td≻" for child_Element in list(tableCell._element): if (child_Element.tag == '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}tbl'): tableChild_Element = docx.table.Table(child_Element, doc_DocX) strMarkup += ProduceMarkupOfTable(tableChild_Element) elif (child_Element.tag == '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p'): paragraph_Element = docx.text.paragraph.Paragraph(child_Element, doc_DocX) strMarkup += ProduceMarkupOfParagraph(paragraph_Element) elif (child_Element.tag == '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}tcPr'): pass # Ignore else: raise Exception("Unknown child element {}".format(child_Element.tag)) strMarkup += "≺/td≻" strMarkup += "≺/tr≻" strMarkup += "≺/table≻" return strMarkup strInParaList = '' # '' = No List, 'bullet' = Bullet, 'decimal' = Decimal iInParaListLevel = -1 bInParaStyleCode = False bPrevParaInStyleCode = False bInRunStyleCode = False bInRunBold = False bInRunItalic = False if not os.path.exists(strBaseDir + "" + strBaseName): os.mkdir(strBaseDir + "" + strBaseName) file = codecs.open(strBaseDir + "" + strBaseName + "" + strBaseName + ".htm", "w", "utf-8") file.write(u'ufeff') for child_Element in doc_DocX.element.body: if (child_Element.tag == '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}tbl'): table_Element = docx.table.Table(child_Element, doc_DocX) file.write(ProduceMarkupOfTable(table_Element)) elif (child_Element.tag == '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p'): paragraph_Element = docx.text.paragraph.Paragraph(child_Element, doc_DocX) file.write(ProduceMarkupOfParagraph(paragraph_Element)) elif (child_Element.tag == '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}sectPr'): pass else: raise Exception("Unknown child element {}".format(child_Element.tag)) file.close() del doc_DocX
I moved from Sweden to The Netherlands in 1995.
Here on this site, you find my creations because that is what I do. I create.