.docx转HTML 小小书 XXshu

最近，我收到客户的要求，要求在Web浏览器中显示Word文档内容的功能。您可能已经猜到，最初的想法是寻找一个可以为我做的PHP库。经过数天的尝试测试无法解决的其他选项，我最终决定改用Python。

python-docx – python-docx 0.8.7文档

编辑描述

python-docx.readthedocs.io

使用Python自动执行无聊的工作

虽然PDF文件非常适合以易于人们打印和阅读的方式布置文本，但它们却不…

automatetheboringstuff.com

我的搜索将我带到了上面的两个资源，我最初发誓圣诞节对我来说太早了。 python-docx似乎很流行并且经常使用，但是在完成任务的一半时，我意识到文档并不那么全面。大多数文档都集中在创建文档上，而不是在阅读文档上。读取图像变得特别困难。非内嵌图片更加令人困惑。您可以阅读图像，但无法按发生的顺序阅读。还不清楚如何按顺序阅读文本和图像。

这是我设法完成工作的方式。

如果您还没有，.docx格式将转换为XML和一堆资源文件。您可以通过获取.docx文档并以添加.zip扩展名的方式对其进行重试。因此，如果将其命名为sample.docx，则将其重命名为sample.docx.zip。然后将其解压缩，您将看到构成文档的所有不同文件。 python-docx软件包在构建时就考虑到了这一点。因此，原则上，您只是在解析XML。

使用pip install python-docx安装python-docx软件包

 来自__future__ import（ 
  absolute_import，division，print_function，unicode_literals 
  ） 
 从docx导入文档 
 从docx.document导入文档为_Document 
 从docx.oxml.text.paragraph导入CT_P 
 从docx.oxml.table导入CT_Tbl 
 从docx.table导入_Cell，表 
 从docx.text.paragraph导入段落 
 将xml.etree.ElementTree导入为ET 
 导入时间 
  def index（）： 
 路径=“ /path/to/sample.docx” 
  document = docx.Document（path）body =“” 
  list_items = [] 
 用于iter_block_items（document）中的块： 
 如果isinstance（块，段落）： 
  tmp_heading_type = get_heading_type（块） 
 如果re.match（“ List \ sParagraph”，tmp_heading_type）： 
  list_items.append（“ ” + block.text +“ 
”） 
 其他： 
 图片= render_image（document，block，dir_path，book_id） 
 如果len（list_items）> 0： 
 正文+ = render_list_items（list_items） 
  list_items = [] 
 如果len（images）> 0： 
 身体=身体+图片 
 其他： 
 身体=身体+ render_runs（block.runs） 
  elif isinstance（块，表）： 
 正文+ = render_table（block） 

 返回身体 

  def iter_block_items（父项）： 
  “” 
 生成对* parent *中每个段落和表格子项的引用， 
 按文档顺序。 每个返回值都是Table或Table的实例。 
 段。  *“父母” *通常是指主要 
 文档对象，但也适用于_Cell对象，该对象本身可以 
 包含段落和表格。 
  “” 
 如果isinstance（parent，_Document）： 
  parent_elm = parent.element.body 
  elif isinstance（父级，_Cell）： 
  parent_elm = parent._tc 
 其他： 
 引发ValueError（“某事不正确”） 
 对于parent_elm.iterchildren（）中的孩子： 
 如果isinstance（child，CT_P）： 
 产生段落（孩子，父母） 
  elif isinstance（child，CT_Tbl）： 
 产量表（孩子，父母） 
  def table_print（block）： 
 表=块 
 对于table.rows中的行： 
 对于row.cells中的单元格： 
 对于cell.paragraphs中的段落： 
 打印（paragraph.text，''，end =''） 
  ＃y.write（paragraph.text） 
  ＃y.write（''） 
 打印（“ \ n”） 
  ＃y.write（“ \ n”） 
  def render_table（block）： 
 表=块 
  html =“ ” 
 对于table.rows中的行： 
  html + =“ ” 
 对于row.cells中的单元格： 
  html + =“ ” 
  html + =“ ” 
  html + =“ ” 
 对于cell.paragraphs中的段落： 
  html + = paragraph.text +“” 
  html + =“ 
” 
 返回html 
  def render_runs（运行）： 
  html =“ ” 
 用于连续运行： 
  html = html + run.text 
  html + =“ 
” 
 返回html 
  def render_list_items（items）： 
  html =“ ” 
 对于项目中的项目： 
  html + =项目 
  html + =“ 
” 
 返回html 
  def get_heading_type（block）： 
 返回block.style.name 
  def render_image（document，par，dir_path，book_id）： 
  “”“获取段落中的所有图像 
  ：param par：来自docx的段落对象 
  ：return：r：embed的列表 
  “” 
  ID = [] 
 根= ET.fromstring（par._p.xml） 
 命名空间= { 
  '一种'：” 
  http://schemas.openxmlformats.org/drawingml/2006/main“，\ 
  'r'：” 
  http://schemas.openxmlformats.org/officeDocument/2006/relationships“，\ 
  'wp'：” 
  http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"}inlines = root.findall（'.// wp：inline'，命名空间） 
 对于内联： 
  imgs = inline.findall（'.// a：blip'，命名空间） 
 对于img中的img： 
  id = img.attrib ['{{{0}}}嵌入'.format（namespace ['r']）] 
  ids.append（id） 
 内联= root.findall（'.// wp：anchor'，命名空间） 
 对于内联： 
  imgs = inline.findall（'.// a：blip'，命名空间） 
 对于img中的img： 
  id = img.attrib ['{{{0}}}嵌入'.format（namespace ['r']）] 
  ids.append（id） 
 响应=“” 
 如果len（ids）> 0： 
 对于ID中的ID： 
  image_part = document.part.related_parts [id] 
  millis = int（round（time.time（）* 1000）） 
  file_name = str（id）+“-” + str（book_id）+“-” + str（millis）+“ .png” 
  fr =打开（dir_path +“ /” + file_name，“ wb”） 
  fr.write（image_part._blob） 
  fr.close（） 
 响应+ =“ <img src ='” + file_name +“'class ='img响应'/>” 
 返回响应

您可能需要花一些时间阅读早期的资源。这将帮助您了解python-docx领域中的段落，运行，标题类型等含义。

我提供的代码是实际项目的简化版本。不幸的是，目前我无法共享所有内容，但我希望这可以提供一个很好的起点。