Project Page | Download | How to Use
Queequeg is a tiny English grammar checker for non-native speakers who are not used to verb conjugation and number agreement. We especially focus on people who're writing academic papers or business documents where thorough checking is required. We aim to reduce this laborious work with automated checking. Queequeg is named after a character in Herman Melville's masterpiece.
Paraphrases plays an important role in the variety and complexity of natural language documents. However, they add to the difficulty of natural language processing. Here we describe a procedure for obtaining paraphrases from news articles. Articles derived from different newspapers can contain paraphrases if it indeed report the same event on the same day. We exploit these two feature by using Named Entity recognition. Our approach is based on the assumption that named entities are preserved across paraphrases. We applied our method to articles of two domains and obtained notable example.
Queequeg (command name: qq
) prints the following
results for the above document:
$ qq -Wall sample.txt
-- sample.txt
sample.txt:0: (S:Paraphrases) (V:plays) an important ...
(number disagreement between "paraphrases" and "plays") sample.txt:0: ... variety and (complexity) of natural ...
sample.txt:2: ... difficulty of (natural language) processing .
sample.txt:4: ... paraphrases if (S:it) indeed (V:report) the same ...
(number disagreement between "it" and "report") sample.txt:5: We exploit (Det:these two) (N:feature) by using ...
("feature" should be in plural form) sample.txt:5: ... by using (Named Entity recognition) .
sample.txt:8: ... and obtained (notable example) .
(an article needed, or should be plural)
Different types of errors are shown in different colors. A number displayed at the beginning of each line is the line number in a file.
Currently Queequeg recognizes the following document formats: plain text, LaTeX and HTML.
Download the archive file in the follwing page. (about 60kbytes)
/usr/local/queequeg-0.9
).
$ make dict WORDNET=/src/wordnet/dict
where the environment variable WORDNET
should be
the pathname of the dict/
directory in WordNet package.
/usr/share/wordnet
.)dict.cdb
is generated.
Otherwise dict.txt
is generated.
qq
. Have your shell look into this path.
You may create a symbolic link in some directory like /usr/local/bin
to
qq
.
(It tries to find a dictionary file located at the same directory.)
Just feed Queequeg a file you want to check (command name:
qq
). It recognizes the document formats
automatically based on its extention (.tex, .html or .htm).
Queequeg issues warnings based on the follwing types of grammatical errors:
-Wall
option to enable this feature.)
Also qq
accepts the following command line
options:
Option | Feature |
-v |
Verbose mode. It displays the name of errors. |
-q |
Quiet mode. It doesn't display file names. |
-p |
Force it to recognize all files as plain text format. Each paragraph is separated with an empty line in plain text format. |
-l |
Force it to recognize all files as HTML format. |
-t |
Force it to recognize all files as LaTeX format. |
-s pathname |
Specify the pathname of a system dictionary (dict.txt or dict.cdb). By default, it tries to find a dictionary file located at the same directory. |
The following options are for debugging purpose:
Option | Feature |
-D debuglevel |
fSpecify the debuglevel as integer. |
-S stage |
Specify the stage to which the process is performed.
The default is grammar (to check grammatical error).
Acceptable values are
token (tokenize input files),
sentence (split sentences),
pos0 (pos tagging phase 1), or
pos1 (pos tagging phase 2).
|
-W type1,type2,... |
Specify which type of errors should be checked.
Acceptable values are
sv1 (a subject and a verb placed across a prepotitional phrase),
sv2 (a subject and a verb placed adjacently),
sv3 (a subject and a verb in "there-be" type syntax), or
det (determiner requirement),
plural (numbers of nouns).
Values should be separated with comma.
The default is sv1,sv2,sv3,plural .
Value all is also accepted for specifying every type of errors.
|
The current version of Queequeg reports lots of false positives which should not be reported generally.
For example, a sentence "my paper clip" looks like consisting of a noun phrase. But actually an error is reported since this can also be regarded as "my paper clip[s]", where the last "s" is missing. Also, a noun phrase "three additional links" also generates a number disagreement warning though, this is because a singular noun called "links" is contained in a system dictionary file.
Determiner checking tends to generate more false positives, because Queequeg don't know if a target noun is mass noun or not. Normally, material names such as "meat" or "water", or abstract nouns such as "information" need not take any article. However WordNet doesn't have this kind of information. (Some dictionaries like COMLEX do have this, but I didn't use them because they cannot be freely distributed.)
setup.py
.
Queequeg identifies grammatical errors with pattern
recognition based on simple finite automata (i.e. regexps) and
unification of features assigned on each portion of an expression.
It doesn't parse a sentence to earn speed and coverage. The core
part of checking is done in constraint.py
and
unification.py
.
POS tagging is performed in two phases. First it looks up
dictionaries and obtains multiple candidates for each word
(sentence.py, dictionary.py
), then tries to fix
several tags using regexp based pattern matching
(postagfix.py
).
We used a modified version of Penn Treebank tagset. Plural form of pronouns (PRPS) and determiners (DTS) are extended so that Queequeg identifies the number of a noun group by looking the POS tag assigned to each noun.
Unlike other natural language systems, Queequeg cannot assume a given sentence is grammatical. It decreases the accuracy of POS tagging.
Queequeg comes with ABSOLUTELY NO WARRANTY. This software is distributed under the GNU General Public License.
We need more testers! Feel free to send us any comments or bug reports.