Tools for Linguistic Word Processing

Word processing in linguistics poses certain problems, because of the special needs of linguistics. The information on this page may help you overcome some of these problems. I assume that most of you use MS Word; I don't -- I use Corel's WordPerfect, which has some features that are useful for linguistics which, as far as I know, MS Word doesn't have. Unfortunately, WordPerfect is not marketed here in Israel. Also, this is geared to users of MS Windows. I don't know anything about Linux, and Linux users probably know a lot more than I do about computers than I do.

The major difficulties in linguistic word processing are the following:

Phonetic characters are not part of the normal character set.
Accent marks may need to be placed on characters; only certain combinations are part of standard character sets.
Logical and mathematical symbols and Greek letters are not part of the normal character set.
Trees, with tree lines, are difficult to draw.
Feature matrices, attribute-value matrices (as used in LFG and HPSG), and certain other linguistic formalisms involve non-linear structures, and word processors are designed for linear typing.

Note: Because there are specialized symbols on this page, not all characters will necessarily display on your computer. It depends on what fonts you have on your system, what browser you are using, and how your browser is configured.

Phonetic characters are not part of the normal character set.

Free phonetic fonts are available on the Internet from the Summer Institute of Linguistics.

Note, by the way, that some phonetic characters are part of the standard "ANSI" (English, not Hebrew) character set:

æ (ae) is character 0230. (The uppercase version is 0198)
œ (oe) is character 0156. (The uppercase version is 0140)
ð (eth: voiced dental fricative) is character 0240. (The uppercase version is 0208)
þ (thorn: not usually used as a phonetic symbol) is character 0254. (The uppercase version is 0222)
ø (slashed o) is 0248. (The uppercase version is 0216)

To get these characters, hold down the Alt key and, with the Alt key held down, type the number on the keypad on the right side of the keyboard (not the numbers on top of the keyboard). Then release the Alt key and the character should appear. (If a Hebrew character appears, change the font to an English/Western character set font.)

For phonetic symbols with diacritics (accent marks), see the accent mark section.

Logical and mathematical symbols and Greek letters are not part of the normal character set.

People often substitute the closest regular character for these, like using an apostrophe instead of a prime (or bar for X-bar). But these characters (at least most of them) are included in the "Symbol" font, which is a standard part of Windows. The following list gives the character to type in the Symbol font to get the desired character. To insert one of these characters: Change the font to "Symbol". If there is a keyboard key listed, type it. If there is a number: hold the Alt key down, type the number on the keypad to the right of the main keyboard (not the numbers on the top of the keyboard), and release the Alt key. (Note if you have a Unicode-compliant version of Word, and fonts with the appropriate characters, you can insert these as Unicode characters.)

∀ universal quantifier symbol: " (make sure "smart quotes" are turned off)
∃ existential quantifier symbol: $
∴ therefore: \
¬ not: 0216 (or 0172 in a normal font, like Times New Roman)
∧ and: 0217
∨ or: 0218
∞ infinity: 0165

⟨ left angle bracket: 0225 (note: this is not the same as the "less than" character <)
⟩ right angle bracket: 0241 (note: this is not the same as the "greater than" character >)
∅ empty set: 0198
∩ intersection: 0199
∪ union: 0200
⊃ proper superset: 0201
⊇ superset: 0202
⊄ not subset: 0203
⊂ proper subset: 0204
⊆ subset: 0205
∈ element: 0206
∉ not element: 0207
∍ such that (reverse "element"): ' (make sure that "smart quotes" are turned off)
⊥ bottom: ^
′ prime: 0162 (this is NOT the same as the apostrophe character ')
″ double prime: 0178 (this is NOT the same as the double-quote character ")
° degree: 0176
∏ product (big pi): 0213
∑ summation (big sigma): 0229

∗ mid-line (not raised) asterisk (as in metrical grids): *
+ plus sign: + (the "Symbol" plus sign is positioned differently, and often works better for features)
⊕ circled plus: 0197
− minus sign: -
± plus-or-minus: 0177 (also 0177 in normal fonts like Times New Roman)
× times: 0180 (Or 0215 in normal fonts like Times New Roman)
⊗ circled times: 0196
÷ divided by: 0184 (or 0247 in normal fonts like Times New Roman)
≠ unequal: 0185
≡ equivalent: 0186
≈ approximately equal: 0187
≅ congruent: @
∼ similar (also morphological alternation): ~
≤ less than or equal: 0163
≥ greater than or equal: 0179

↔ left-right arrow: 0171
← left arrow: 0172
↑ up arrow: 0173
→ right arrow: 0174
↓ down arrow: 0175
⇔ left-right double arrow: 0219
⇐ left double arrow: 0220
⇑ up double arrow: 0221
⇒ right double arrow: 0222
⇓ down double arrow: 0223

Alpha Α uppercase: A α lowercase: a

Beta Β uppercase: B β lowercase: b

Gamma Γ uppercase: G γ lowercase: g

Delta Δ uppercase: D δ lowercase: d

Epsilon Ε uppercase: E ε lowercase: e

Zeta Ζ uppercase: Z ζ lowercase: z

Eta Η uppercase: H η lowercase: h

Theta Θ uppercase: Q θ lowercase: q ϑ alternate lowercase: J

Iota Ι uppercase: I ι lowercase: i

Kappa Κ uppercase: K κ lowercase: k

Lambda Λ uppercase: L λ lowercase: l

Mu Μ uppercase: M μ lowercase: m

Nu Ν uppercase: N ν lowercase: n

Xi Ξ uppercase: X ξ lowercase: x

Omicron Ο uppercase: O ο lowercase: o

Pi Π uppercase: P π lowercase: p ϖ alternate lowercase: v

Rho Ρ uppercase: R ρ lowercase: r

Sigma Σ uppercase: S σ lowercase: s ς final lowercase: V

Tau Τ uppercase: T τ lowercase: t

Upsilon Υ uppercase: U υ lowercase: u ϒ alternate uppercase: 0161

Phi Φ uppercase: F ϕ closed lowercase: f φ open lowercase: j

Chi Χ uppercase: C χ lowercase: c

Psi Ψ uppercase: Y ψ lowercase: y

Omega Ω uppercase: W ω lowercase: w

Accent marks may need to be placed on characters; only certain combinations are part of standard character sets.

Those of us who use WordPerfect don't need to worry about this, because WordPerfect has a feature called "Overstrike", which allows any two characters to be superimposed. It also includes every accent mark imaginable in its character set (well, almost: there is no double-grave accent). Unfortunately, as far as I know, MS Word has never implemented an overstrike-like feature. Some accented characters are part of the standard "ANSI" (English, not Hebrew) character set, including š (if you can't see that, it is s with a "hachek" or wedge on it-- the phonetic symbol for the sh sound) and other combinations that are used in the spelling of the standard European languages. If you have Windows installed with support for Eastern Europe, Turkish, and the Baltic languages, you have more such characters available. The easiest way to see what you have is to run Windows' "WordPad" program and see what variants of "Times New Roman" are listed in the font list. (This may work in MS Word, too.) See below for Windows-supported letter-accent combinations.

When all of the above fails (and it very well might), download the free phonetic fonts from the Summer Institute of Linguistics. These fonts include accent marks which automatically combine with the character before them. Using these fonts, you can put any accent mark on anything. (It is usually easier, and looks better, to use built-in characters listed below when available.)

The following are available combinations in Windows. To insert them in a file, hold down the Alt key, type the number on the keypad on the right side of your keyboard (not with the numbers on the top row of the main keyboard), and then release the Alt key.

A
à with grave accent (top left/bottom right): uppercase 0192; lowercase 0224
á with acute accent (top right/bottom left): uppercase 0193; lowercase 0225
â with circumflex (^): uppercase 0194; lowercase 0226
ã with tilde (~): uppercase 0195; lowercase 0227
ä with umlaut (two dots over it): uppercase 0196; lowercase 0228
å with ring over it: uppercase 0197; lowercase 0229
ā with macron (straight line over it): Baltic character set: uppercase 0194; lowercase 0226
ă with breve (short sign; like a little "u"): East European (CE) character set: uppercase 0195; lowercase 0227
ą with "Polish hook" (little hook under the right side of the letter): East European (CE) character set: uppercase 0165; lowercase 0185

C
ç with cedilla (as in French): uppercase 0199; lowercase 0231
ć with acute accent (top right/bottom left): East European (CE) character set: uppercase 0198; lowecase 0230
č with hachek (wedge; little v) [phonetic "ch"] East European (CE) character set: uppercase 0200; lowercase 0232

D
Ďď with hachek (wedge; little v) or apostrophe: East European (CE) character set: uppercase 0207; lowercase 0239
đ with crossbar (not the same as eth): East European (CE) character set: uppercase 0208; lowercase 0240

E
è with grave accent (top left/bottom right): uppercase 0200; lowercase 0232
é with acute accent (top right/bottom left): uppercase 0201; lowercase 0233
ê with circumflex (^): uppercase 0202; lowercase 0234
ë with umlaut (two dots over it): uppercase 0203; lowercase 0235
ē with macron (straight line over it): Baltic character set: uppercase 0199; lowercase 0231
ė with one dot over it: Baltic character set: uppercase 0203; lowercase 0235
ę with "Polish hook" (little hook under the right side of the letter): East European (CE) character set: uppercase 0202; lowercase 0234

G
ğ with breve (short sign; like a little "u"): Turkish character set: uppercase 0208; lowercase 0240
Ģģ with cedilla (like under French c; lower case has an apostrophe over the letter): Baltic character set: uppercase 0204; lowercase 0236

I
ı an undotted i: Turkish character set: 0253
İ an uppercase dotted I: Turkish character set: 0221
ì with grave accent (top left/bottom right): uppercase 0204; lowercase 0236
í with acute accent (top right/bottom left): uppercase 0205; lowercase 0237
î with circumflex (^): uppercase 0206; lowercase 0238
ï with umlaut (two dots over it): uppercase 0207; lowercase 0239
ī with macron (straight line over it): Baltic character set: uppercase 0206; lowercase 0238
į with "Polish hook" (little hook under the right side of the letter): Baltic character set: uppercase 0193; lowercase 0225

K
ķ with cedilla (like under French c): Baltic character set: uppercase 0205; lowercase 0237

L
ĺ with acute accent (top right/bottom left): East European (CE) character set: uppercase 0197; lowercase 0229
ļ with cedilla (like under French c): Baltic character set: uppercase 0207; lowercase 0239
ł with diagonal stroke: East European (CE) character set: uppercase 0163; lowercase 0179

N
ñ with tilde(~) (Spanish ny): uppercase 0209; lowercase 0241
ń with acute accent (top right/bottom left): East European (CE) character set: uppercase 0209; lowercase 0241
ņ with cedilla (like under French c): Baltic character set: uppercase 0210; lowercase 0242
ň with hachek (wedge; little v) East European (CE) character set: uppercase 0210; lowercase 0242

O
ò with grave accent (top left/bottom right): uppercase 0210; lowercase 0242
ó with acute accent (top right/bottom left): uppercase 0211; lowercase 0243
ô with circumflex (^): uppercase 0212; lowercase 0244
õ with tilde (~): uppercase 0213; lowercase 0245
ö with umlaut (two dots over it): uppercase 0214; lowercase 0246
ō with macron (straight line over it): Baltic character set: uppercase 0212; lowercase 0244
ő with double acute accent: East European (CE) character set: uppercase 0213; lowercase 0245

R
ŕ with acute accent (top right/bottom left): East European (CE) character set: uppercase 0192; lowercase 0224
ŗ with cedilla (like under French c): Baltic character set: uppercase 0170; lowercase 0186
ř with hachek (wedge; little v) East European (CE) character set: uppercase 0216; lowercase 0248

S
ś with acute accent (top right/bottom left): East European (CE) character set: uppercase 0140; lowercase 0156
ş with cedilla (like under French c): East European (CE) character set: uppercase 0170; lowercase 0186
š with hachek (wedge; little v) [phonetic "sh"] uppercase 0138; lowercase 0154

T
ţ with cedilla (like under French c): East European (CE) character set: uppercase 0222; lowercase 0254
Ťť with hachek (wedge; little v) or apostrophe East European (CE) character set: uppercase 0141; lowercase 0157

U
ù with grave accent (top left/bottom right): uppercase 0217; lowercase 0249
ú with acute accent (top right/bottom left): uppercase 0218; lowercase 0250
ú with circumflex (^): uppercase 0219; lowercase 0251
ü with umlaut (two dots over it): uppercase 0220; lowercase 0252
ů with ring over it: East European (CE) character set: uppercase 0217; lowercase 0249
ū with macron (straight line over it): Baltic character set: uppercase 0219; lowercase 0251
ű with double acute accent: East European (CE) character set: uppercase 0219; lowercase 0251
ų with "Polish hook" (little hook under the right side of the letter): Baltic character set: uppercase 0216; lowercase 0248

Y
ý with acute accent (top right/bottom left): uppercase 0221; lowercase 0253
ÿ with umlaut (two dots over it): uppercase 0159; lowercase 0255

Z
ź with acute accent (top right/bottom left): East European (CE) character set: uppercase 0143; lowercase 0159
ż with dot over it: East European (CE) character set: uppercase 0175; lowercase 0191
ž with hachek (wedge; little v) [phonetic "zh"] East European (CE) character set: uppercase 0142; lowercase 0158

Trees, with tree lines, are difficult to draw.

Yes they are. If you don't mind paying, there is a font called "Arboreal" (believe it or not) which has all sorts of tree-line characters. Since I do mind paying, I have no experience with Arboreal. What I do is use my word processor's drawing capabilities, and just draw in the lines. If your word processor has a drawing feature, it is worth taking the extra few minutes to draw the lines that way than drawing them in manually on the hard copy. But there is no solution to this one that is both free and easy. Sorry!

Feature matrices, attribute-value matrices (as used in LFG and HPSG), and certain other linguistic formalisms involve non-linear structures, and word processors are designed for linear typing.

What I have in mind is stuff like the following:

feature matrix:

attribute-value matrix (LFG f-structure):

phrase structure rule:

These are non-linear, and thus not "normal" word processor material. However, many word processors today have a tool bundled with them that is capable of producing these structures, although that is not its intended purpose. The intended use is to produce mathematical formulae and equations, which are also not linear, so the tool in question is often called an "equation editor" or a "mathematical equation editor" or something like that. If you don't see such an option as part of your word processor, check the installation CD, as it may not have been installed.

The problem with the equation editors with which I am familiar is that, since they are designed to produce mathematical equations, there are certain default settings which need to be circumvented. To the best of my knowledge, MS Word is always bundled with a version of the MathType equation editor by Design Science, so the tips below are geared to that product. I do not know if the interface is the same as the one I'm familiar with, though, (and different version of Word may come with different versions of the equation editor) so I will try to be as general as possible. A fuller version of MathType can be purchased from Design Science; it has some additional features which are useful, but it costs over $100. If anyone out there uses WordPerfect and wants tips on using the "WordPerfect 5-7" equation editor, with which I also have experience, contact me. I used the MathType equation editor as bundled with WordPerfect in laying out my book Lexical-Functional Grammar.

I won't give general instructions here, as the MathType equation editor is relatively self-explanatory. But do have a look around to see what formatting options there are. One thing to keep in mind when you create a matrix is to keep the alignment correct for what you are doing. Phonological features and attributes and values in attribute-value matrices, for example, should line up on the left, not be centered.

The biggest problem with equation editors is that they assume that text they don't recognize is a mathematical variable. Mathematical variables are normally typeset as italics, so if you try to type in, for example, "-back" you will find that the word "back" is italicized. What you need to do is tell the equation editor to use function "style" instead of math "style". (In my copy this is in the Style menu.) Also, be aware of whether you are typing a hyphen or a minus; for a hyphen you need "text" style, for a minus "math" style. There are also subtle differences between the "text" plus sign and the "math" plus sign; for features the "math" version is better. (For typing extended text, it is easier to use "text" style than "function" style because "text" style allows you to insert spaces.)

For LFG attribute-value matrices, there is the additional problem of the fact that LFG convention uses small capitals for attribute names (and for some values, too). While the equation editor does not "have" small capitals, they can be easily simulated by typing capitals but changing the point size. The equation editor has several preset point sizes. What I did was take the "Symbol" size (which is actually supposed to be larger than ordinary text for mathematical purposes) and defined it to be 80% of the regular point size. So whenever I need small capitals, I specify the "Symbol" size and type in capitals. (In the full version of MathType, there are two "user-defined" sizes, so one of these can be used instead of "Symbol".)

One thing which cannot be done in the MathType equation editor as bundled with word processors is the boxed numbers used as tags in HPSG attribute-value matrices. They can be produced in the full (expensive) version of MathType. (If you download the trial version of MathType, you can save a boxed tag to disk, and then read it back in even after the trial period is over and MathType turns into MathType Lite.) They can also be finaigled in the WordPerfect 5-7 equation editor.

Alpha	Α uppercase: A	α lowercase: a
Beta	Β uppercase: B	β lowercase: b
Gamma	Γ uppercase: G	γ lowercase: g
Delta	Δ uppercase: D	δ lowercase: d
Epsilon	Ε uppercase: E	ε lowercase: e
Zeta	Ζ uppercase: Z	ζ lowercase: z
Eta	Η uppercase: H	η lowercase: h
Theta	Θ uppercase: Q	θ lowercase: q	ϑ alternate lowercase: J
Iota	Ι uppercase: I	ι lowercase: i
Kappa	Κ uppercase: K	κ lowercase: k
Lambda	Λ uppercase: L	λ lowercase: l
Mu	Μ uppercase: M	μ lowercase: m
Nu	Ν uppercase: N	ν lowercase: n
Xi	Ξ uppercase: X	ξ lowercase: x
Omicron	Ο uppercase: O	ο lowercase: o
Pi	Π uppercase: P	π lowercase: p	ϖ alternate lowercase: v
Rho	Ρ uppercase: R	ρ lowercase: r
Sigma	Σ uppercase: S	σ lowercase: s	ς final lowercase: V
Tau	Τ uppercase: T	τ lowercase: t
Upsilon	Υ uppercase: U	υ lowercase: u	ϒ alternate uppercase: 0161
Phi	Φ uppercase: F	ϕ closed lowercase: f	φ open lowercase: j
Chi	Χ uppercase: C	χ lowercase: c
Psi	Ψ uppercase: Y	ψ lowercase: y
Omega	Ω uppercase: W	ω lowercase: w