Python - 处理 Word 文档

  • 简述

    要阅读 word 文档,我们需要名为 docx 的模块的帮助。我们首先安装docx,如下图。然后编写一个程序,使用 docx 模块中的不同函数逐段读取整个文件。
    我们使用以下命令将 docx 模块放入我们的环境中。
    
     pip install docx 
    
    在下面的示例中,我们通过将每一行附加到段落并最终打印出所有段落文本来读取 word 文档的内容。
    
    import docx
    def readtxt(filename):
        doc = docx.Document(filename)
        fullText = []
        for para in doc.paragraphs:
            fullText.append(para.text)
        return '\n'.join(fullText)
    print (readtxt('path\Tutorialspoint.docx'))
    
    当我们运行上述程序时,我们得到以下输出 -
    
    CAINIAOYA originated from the idea that there exists a class of readers who respond
    better to online content and prefer to learn new skills at their own pace from the comforts 
    of their drawing rooms. 
    The journey commenced with a single tutorial on HTML in 2006 and elated by the response it generated, 
    we worked our way to adding fresh tutorials to our repository which now proudly flaunts 
    a wealth of tutorials and allied articles on topics ranging from programming languages 
    to web designing to academics and much more.
    
  • 阅读个别段落

    我们可以使用段落属性从 word 文档中读取特定段落。在下面的示例中,我们仅从 word 文档中读取第二段。
    
    import docx
    doc = docx.Document('path\Tutorialspoint.docx')
    print len(doc.paragraphs)
    print doc.paragraphs[2].text
    
    当我们运行上述程序时,我们得到以下输出 -
    
    The journey commenced with a single tutorial on HTML in 2006 and elated by the response 
    it generated, we worked our way to adding fresh tutorials to our repository 
    which now proudly flaunts a wealth of tutorials and allied articles on topics 
    ranging from programming languages to web designing to academics and much more.