Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>The problem is that you have a Word .doc file, not a text file.</p> <p>A Word file is essentially a sequence of runs of formatted text. (It's actually more complicated than that, it's a tree of all kinds of things, some of which are runs of text, but let's keep it simple for now.)</p> <p>In "classic" Word format, each of these runs is a string of bytes. In modern (DOCX, aka Office 2007, aka Office Open XML, aka WordML) Word format, each of them is an XML node in a tree. But either way, if you have one run that ends with <code>:</code> and another that starts with <code>1</code>, you won't found <code>:1</code> in the file; you'll find <code>:</code> followed by a bunch of cruft (the end of one binary object and the start of the next, or the end of one XML node and the start of the next, possibly with other objects/nodes in between), followed by the <code>1</code>.</p> <p>There's no good way to deal with this without actually parsing the Word format.</p> <p>So, the big question is, which format do you have? </p> <hr> <p>If it's DOCX, that's basically just an XML file, or a ZIP file with XML files inside of it, which you can parse natively in Python—or, better, use a module like <a href="https://pypi.python.org/pypi/docx/0.2.1" rel="nofollow"><code>docx</code></a> that does all the hard work for you.</p> <p>If it's classic DOC, the only way to parse it is to read the reverse-engineered documentation that people have written up over the years and write some nasty code to deal with it. Or, of course, you can use some code that someone's already written. In this case, I don't know of any python modules that will help, but you can control the <a href="http://www.winfield.demon.nl" rel="nofollow"><code>antiword</code></a> program pretty easily via <code>subprocess</code>.</p> <hr> <p>Or, alternatively, if you have a program that can read the files on your machine, like Word or Wordpad/Write on Windows, or iWork Pages on Mac, or OpenOffice.org/Libre Office on any platform, you can script that. Python has nice wrappers to talk to COM interfaces on Windows and AppleScript interfaces on Mac, and OO.o/Libre is built to be scriptable.</p> <p><a href="http://www.galalaly.me/index.php/2011/09/use-python-to-parse-microsoft-word-documents-using-pywin32-library/" rel="nofollow">This blog post</a> is a nice example of using Word on Windows via <code>pywin32</code> to do things to doc files. You can use this as a starting point for your own code to extract the text from each file, or to make Word to the searching for you, or to just save a plain text copy of each file which you can then do whatever you want with. There are hundreds of other such examples all over the web, as well as similar examples for using <code>appscript</code> or <code>ScriptingBridge</code> to do the equivalent on the Mac, or using VBA instead of Python to script from inside Word, etc. To find out what functions are available when scripting Word, see the <a href="http://msdn.microsoft.com/en-us/library/office/ee861527.aspx" rel="nofollow">Word 2013 developer reference</a>, or the similar docs for earlier versions if you don't have 2013, or just "Open Dictionary" in AppleScript Editor and look at Word's dictionary if you have a Mac.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload