Project Page | Download | How to Use
Queequeg is a tiny English grammar checker for non-native speakers who are not used to verb conjugation and number agreement. We especially focus on people who're writing academic papers or business documents where thorough checking is required. We aim to reduce this laborious work with automated checking. Queequeg is named after a character in Herman Melville's masterpiece.
Paraphrases plays an important role in the variety and complexity of natural language documents. However, they add to the difficulty of natural language processing. Here we describe a procedure for obtaining paraphrases from news articles. Articles derived from different newspapers can contain paraphrases if it indeed report the same event on the same day. We exploit these two feature by using Named Entity recognition. Our approach is based on the assumption that named entities are preserved across paraphrases. We applied our method to articles of two domains and obtained notable example.
Queequeg (command name:
$ qq -Wall sample.txt
sample.txt:0: (S:Paraphrases) (V:plays) an important ...
(number disagreement between "paraphrases" and "plays")
sample.txt:0: ... variety and (complexity) of natural ...
sample.txt:2: ... difficulty of (natural language) processing .
sample.txt:4: ... paraphrases if (S:it) indeed (V:report) the same ...
(number disagreement between "it" and "report")
sample.txt:5: We exploit (Det:these two) (N:feature) by using ...
("feature" should be in plural form)
sample.txt:5: ... by using (Named Entity recognition) .
sample.txt:8: ... and obtained (notable example) .
(an article needed, or should be plural)
Different types of errors are shown in different colors. A number displayed at the beginning of each line is the line number in a file.
Currently Queequeg recognizes the following document formats: plain text, LaTeX and HTML.
Download the archive file in the follwing page. (about 60kbytes)
where the environment variable
$ make dict WORDNET=/src/wordnet/dict
WORDNETshould be the pathname of the
dict/directory in WordNet package.
dict.cdbis generated. Otherwise
Just feed Queequeg a file you want to check (command name:
Queequeg issues warnings based on the follwing types of grammatical errors:
-Walloption to enable this feature.)
||Verbose mode. It displays the name of errors.|
||Quiet mode. It doesn't display file names.|
||Force it to recognize all files as plain text format. Each paragraph is separated with an empty line in plain text format.|
||Force it to recognize all files as HTML format.|
||Force it to recognize all files as LaTeX format.|
||Specify the pathname of a system dictionary (dict.txt or dict.cdb). By default, it tries to find a dictionary file located at the same directory.|
The following options are for debugging purpose:
f||Specify the debuglevel as integer.|
||Specify the stage to which the process is performed.
The default is |
||Specify which type of errors should be checked.
Acceptable values are
The current version of Queequeg reports lots of false positives which should not be reported generally.
For example, a sentence "my paper clip" looks like consisting of a noun phrase. But actually an error is reported since this can also be regarded as "my paper clip[s]", where the last "s" is missing. Also, a noun phrase "three additional links" also generates a number disagreement warning though, this is because a singular noun called "links" is contained in a system dictionary file.
Determiner checking tends to generate more false positives, because Queequeg don't know if a target noun is mass noun or not. Normally, material names such as "meat" or "water", or abstract nouns such as "information" need not take any article. However WordNet doesn't have this kind of information. (Some dictionaries like COMLEX do have this, but I didn't use them because they cannot be freely distributed.)
Queequeg identifies grammatical errors with pattern
recognition based on simple finite automata (i.e. regexps) and
unification of features assigned on each portion of an expression.
It doesn't parse a sentence to earn speed and coverage. The core
part of checking is done in
POS tagging is performed in two phases. First it looks up
dictionaries and obtains multiple candidates for each word
sentence.py, dictionary.py), then tries to fix
several tags using regexp based pattern matching
We used a modified version of Penn Treebank tagset. Plural form of pronouns (PRPS) and determiners (DTS) are extended so that Queequeg identifies the number of a noun group by looking the POS tag assigned to each noun.
Unlike other natural language systems, Queequeg cannot assume a given sentence is grammatical. It decreases the accuracy of POS tagging.
Queequeg comes with ABSOLUTELY NO WARRANTY. This software is distributed under the GNU General Public License.
We need more testers! Feel free to send us any comments or bug reports.