This manual is for liblouisutdml (version 2.7.0, 20 September 2017), an xml to Braille Translation Library.
This file may contain code borrowed from the Linux screenreader BRLTTY, Copyright © 1999-2009 by the BRLTTY Team.
Copyright © 2004-2009 ViewPlus Technologies, Inc. www.viewplus.com and Copyright © 2006,2009 Abilitiessoft, Inc. www.abilitiessoft.org.
This file is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser (or library) General Public License (LGPL) as published by the Free Software Foundation; either version 3, or (at your option) any later version.
This file is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser (or Library) General Public License LGPL for more details.
You should have received a copy of the GNU Lesser (or Library) General Public License (LGPL) along with this program; see the file COPYING. If not, write to the Free Software Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
liblouisutdml is a software component which can be incorporated into
software packages to provide the capability of translating any file in
the computer lingua franca xml format or plain text into properly
transcribed
braille. This includes translation into grade two, if desired,
mathematical codes, etc. It also includes formatting according to a
style sheet which can be modified by the user. The first
program into which liblouisutdml has been incorporated is
file2brl. This program will translate an xml or text file
into an embosser-ready braille file. It is not necessary to know xml,
because MSWord and other word processors can export files in this
format. If the word processor has been used correctly
file2brl will produce an excellent braille file.
Users who want to generate Braille using file2brl will be
interested in Transcribing XML files with file2brl. Those who
wish to change the output generated by liblouisutdml should read
Customization Configuring liblouisutdml. If you encounter a type
of xml file with which liblouisutdml is not familiar you can learn how
to tell it how to process that file by reading Connecting with the xml Document. If you wish to implement a new braille mathematics
code read Implementing Braille Mathematics Codes. Finally,
computer programmers who wish to use liblouisutdml in their software can
find the information they need in Programming with liblouisutdml.
You will also find it advantageous to be acquainted with the companion library liblouis, which is a braille translator and back-translator (see Overview in Liblouis User’s and Programmer’s Manual).
At the moment, actual transcription with liblouisutdml is done with the
command-line (or console) program file2brl. The line to type
is:
file2brl [OPTIONS] [-f config-file] [infile] [outfile]
The brackets indicate that something is optional. You will see that
nothing is required except the program name itself, file2brl.
The various optional parts control how the program will behave, as
follows:
This option causes file2brl to print a help message
describing usage and exit.
This option causes file2brl to display the version
information and exit.
This specifies the configuration file which tells file2brl
how to do the transcription. (It may be a list of file names separated
by commas.) This file specifies such things as the number of cells per
line, the number of lines per page, The translation tables to be used,
how paragraphs and headings are to be formatted, etc. If this part of
the command line is omitted, file2brl assumes that the
configuration file is named preferences.cfg. If the
configuration file name contains a pathname file2brl will
consider this as a path on which to look for files that it needs
(see Files and Paths). If no pathname is given the standard paths
are searched and finally the current directory. To make
file2brl search the current directory first, precede the
file name with ./.
back-translate. The input file must be a braille file, such as
.brf. The output file is a back-translation of this file. It
may be in either plain-text or xhtml (html), according to the setting
of backFormat in the outputFormat section of the
configuration file. Html files will contain page numbers and emphasis.
To get good html, the liblouis table must have the entry ‘space
\e 1b’ so that it will pass through escape characters. The
html.sem file must also contain the line ‘pagenum
pagenum’. Text output files simply have a blank line between
paragraphs. Encoding of text files is controlled by the
outputEncoding setting. Html files are always in UTF-8.
Reformat. The input file must be a braille file, such as .brf. The output is a braille file formatted according to the configuration file. It is advisable to set backFormat to html, since this will preserve print page numbers and emphasis. This option can be useful for changing the line length and page length of a braille file, for example, from 40 to 32 cells. It is also an excellent way to check the accuracy of liblouis tables. The original page numbers at the tops and bottoms of pages are discarded, and new ones are generated.
Consider the document to be a text file, even if it is xml or html.
The document is an h(t)ml file, not xhtml. This option is useful with
files downloaded from the Web in source form. Without it, the program
will first try to parse the file as an xml document, producing lots of
error messages. It will then try the html parser. With this option, it
goes directly to the html parser. See also the formatFor
configuration (see formatFor setting) file setting, which enables
you to format the braille output for viewing in a browser.
Poorly formatted input translation. Infile is any text file such as may
have been obtained by extracting the text in a pdf file. The input
file may also be an xml or html file which is so poorly formatted that
better braille can be obtained by ignoring the formatting.
file2brl tries to guess paragraph breaks. The output is
generally reasonably formatted, that is, with reasonable paragraph
breaks.
Treat each block of text ending in a newline as a paragraph. If there are two newline characters a blank line will be inserted before the next paragraph.
This option enables you to specify configuration settings on the
command line instead of changing the configuration file. You can use
as many -C options as you wish. Any settings can be specified
except those having to do with styles. See Configuration Settings Index, for a list of available settings. These must be specified in
configuration files. The settings may be in any order. They override
any settings in liblouisutdml.ini or in the configuration file used
by file2brl.
This option enables you to specify where the log file and other temporary files will be written.
This option will cause file2brl and liblouisutdml to print
error messages to file2brl.log instead of stderr. The file will
be in the current directory. This option is particularly useful if
file2brl is called by a GUI script or Web application.
This is the name of the input file containing the material to be transcribed. The file may be either an xml file or a text file. The -b, -r and -p options discussed above provide for other types of files and processing. Typical xml files are those provided by www.bookshare.org or those derived from a word processor by saving in xml format. If a text file is used paragraphs and headings should be separated by blank lines. In such a file there is no way to distinguish between paragraphs and headings, so they will all be formatted as paragraphs, as specified by the configuration file. However, if you want a blank line in the braille transcription use two consecutive blank lines in the text file.
This is the name of the output file. It will be transcribed as specified by the configuration file and the -C configuration settings. The following paragraphs provide more information on both the input and output files.
file2brl is set up so that it can be used in a "pipe". To do
this, omit both infile and outfile. Input is then taken from the
standard input unit.
The first file name encountered (a word not preceded by a minus sign) is taken to be the input file and the second to be the output file. If you wish input to be taken from stdin and still want to specify an output file, use one minus sign (‘-’) for the input file.
If only the program name is typed file2brl assumes that the
configuration file is preferences.cfg, input is from the
standard input unit, and output is to the standard output unit.
See the previous section on using file2brl. This program
recognizes text files automatically and transcribes them according to
the information in the configuration files. Paragraphs must be
separated with a blank line. If you want a blank line in the output use
two blank lines.
file2brl -p infile outfile
Some text documents, such as those derived from pdf files, and even
some xml and html documents, are so poorly formatted that you can get
better braille by ignoring whatever markup they contain. The
-p option of file2brl does this. It ignores xml or
html markup and uses heuristics to find the beginning of paragraphs.
Its choices are usually good. Note that it does not work with rtf
files.
file2brl -t infile outfile
The -t option prevents file2brl from trying to
transcribe infile as an xml document. This will produce a lot of error
messages. file2brl will then try the html parser. Note that
xhtml documents are actually xml.
The operation of liblouisutdml is controlled by two types of files: semantic-action files and configuration files. The former are discussed in the section Connecting with the xml Document - Semantic-action Files (see Connecting with the xml Document - Semantic-Action Files). The latter are discussed in this section. A third type of file, braille translation tables, is discussed in the liblouis documentation (see Overview in Liblouis User’s and Programmer’s Manual). Another section of the present document which may be of interest is Implementing Braille Mathematical Codes (see Implementing Braille Mathematics Codes).
Besides files, liblouisutdml can also be controlled by configuration
strings, which are character strings in memory containing configuration
settings separated by end-of-line characters. Such strings can be
generated by the -C option on the file2brl command
line, by the configstring and configtweak semantic
actions, or by passing a string to the lbu_initialize function.
The information below applies to file2brl as much as to
liblouisutdml.
Before discussing configuration files in detail it is worth noting
that the application program has access to the information in the
configuration files by calling the liblouisutdml function
lbu_initialize. This function returns a pointer to a data
structure containing the configuration information. The calling program
must include the header file louisutdml.h. You do not need to call
lbu_initialize unless you need the facilities which it provides.
A configuration file specification may contain more than one file
name, separated by commas. liblouisutdml will process these files in
sequence, merging the information they contain. The first file name
may also contain a path. liblouisutdml will search for the files it
needs first on this path. To make it search first the current
directory precede the first file name with ./. After the path,
if any, has been evaluated, but before reading any of the files,
liblouisutdml reads in a file called liblouisutdml.ini. This
file can contain any configuration settings, but it usually contains
only the minimum ones for liblouisutdml to operate properly. You may
alter the values in the distribution liblouisutdml.ini, but you
should not delete any settings. Do not specify
liblouisutdml.ini as your configuration file. This will lead to
error messages and program termination. If a configuration file read
in later contains a particular setting name, the value specified
simply replaces the one specified in liblouisutdml.ini or any
previously read configuration file.
Originally, configuration files contained four main sections,
outputFormat, translation, xml and style.
The section names, except for style are now optional. In
addition, a configuration file can contain an include entry. This
causes the file named on that line to be read in at the point where
the line occurs. The sections need not follow each other in any
particular order, nor is the order of settings within each section
important. The section names, except for style are optional. In
this document and in the liblouisutdml.ini file, where section
and setting names consist of more than one word, the first letter of
each word following the initial one is capitalized. This is merely for
readability. The case of the letters in these names is ignored by the
program. Section and setting names may not contain spaces.
In addition to liblouisutdml.ini the distribution also contains
a number of configuration files. The most important of these is
preferences.cfg, which contains all possible settings and a
"default" value for each. You should use this file as a reference.
It is the file read by the file2brl command-line interface
program if no configuration file is given.
Here, then, is an explanation of each section and setting in the preferences.cfg file. When you look at this file you will see that the section names start at the left margin, while the settings are indented one tab stop. This is done for readability. it has no effect on the meaning of the lines. You will also see lines beginning with a number sign (‘#’), which are comments. Blank lines can also be used anywhere in a configuration file. In general, a section name is a single word or combination of unspaced words. However, each style has a section of its own, so the word ‘style’ is followed by a space then by the name of the style. Setting lines begin with the name of the setting, followed by at least one space or tab, followed by the value of the setting. A few settings have two values.
This section specifies the format of the output file (or string).
cellsPerLine 40The number of cells in a braille line.
linesPerPage 25The number of lines on a braille page
interpoint noWhether or not the output will be used to produce interpoint braille. This affects the placement of page numbers and may affect other things in the future. The only two values recognized are ‘yes’ and ‘no’.
lineEnd \r\nThis specifies the control characters to be placed at the end of each output line. These characters vary from one intended use of the output to another. Most embossers require the carriage-return and line-feed combination specified above. However, a braille display may work best with just one or the other. Any valid control characters can be specified.
pageEnd \fThe control Character to be given at the end of a page. Here it is a forms-feed character, but it can be something else if deeded.
fileEnd ^zThe control character to be placed at the end of the file, here a control-z.
printPages yesWhether or not to show print page numbers if they are given in the xml input. The two valid values are ‘yes’ and ‘no’.
braillePages yesWhether or not to format the output into pages. Here the value is
‘yes’, for use with an embosser. However the user of a braille
display may wish to specify ‘no’, so as not to be bothered with
page numbers and forms feed characters. If no is specified the lines
will still be of the length given in cellsPerLine, but the
value of linesPerPage will be ignored.
paragraphs yesWhether or not to format the output into paragraphs, using appropriate styles. If ‘no’ is specified, what would be a paragraph is output simply as one long line. Applications that wish to do their own formatting may specify ‘no’.
beginningPageNumber 1This is the number to be placed on the first Braille page if
braillePages is yes. This is useful when producing multiple
Braille volumes.
printPageNumberAt topIf print page numbers are given in the xml input file they will be
placed at the top of each braille page in the right-hand corner. If
pageSeparator is set to ‘yes’, a page separator line will
also be produced on the Braille page where the print page break
actually occurs. You may also specify ‘bottom’ for this setting.
braillePageNumberAt bottomThe braille page number will be placed in the bottom right-hand corner
of each page. If interpoint yes has been specified only odd
pages will receive page numbers. You may also specify ‘top’ for
this setting. If print page numbers and Braille page numbers are both
placed at the top or bottom, they are rendered next to each other with
a space in between.
continuePages yesPrint page numbers can be prefixed with a letter (a, b, c, etc.) on continued pages. The two valid values are ‘yes’ and ‘no’.
pageSeparator yesA page separator line (or page break indicator), a line of unspaced Braille dots 36, will be placed wherever a print page break occurs. No page separator lines are placed on the first or last line of a Braille page, and no page separator lines are shown when the new print page coincides with a new Braille page.
pageSeparatorNumber yesShow a page number at the far right margin of a page separator line. No space is left between the separator line and the first symbol of the page number.
ignoreEmptyPages yesAn empty page occurs when a pagenum tag is immediately followed
by another pagenum tag. By default, empty pages are completely
ignored. If you specify ‘no’ for this setting, a sequence of
pagenum tags will lead to a combined print page number:
the number of the first empty page is combined with that of the page
on which text reappears, e.g. 5-7. If lettered continuation pages are
required (see continuePages), they carry only the number of the
page on which text reappears.
printPageNumberRange noBy default, only the page number of the first print page on a
Braille page is shown at the top or bottom. However, if
printPageNumberRange is set to ‘yes’, the range of
print pages contained in the current Braille page is displayed. If the
first page in this range is a continued print page, it is prefixed
with a letter as usual (see continuePages).
mergeUnnumberedPages yesPage breaks without a page number can simply be ignored. This means that unnumbered print pages will be treated as if they were a part of the preceding page. You can also specify ‘no’ for this setting.
pageNumberTopSeparateLine yesWhether or not to provide a separate line for page numbers when they
are placed at the top of a Braille page. The two valid values are
‘yes’ and ‘no’. A print page number range (see
printPageNumberRange) at the top of a page is always displayed
on a separate line.
pageNumberBottomSeparateLine yesWhether or not to provide a separate line for page numbers when they are placed at the bottom of a Braille page.
hyphenate noIf ‘yes’ is specified words will be hyphenated at the ends of
lines if a hyphenation table is available. In contracted English
Braille hyphenation is not generally used, but it can save
considerable space. The hyphenation table is specified as part of the
table list in the literaryTextTable setting of the translation
section.
outputEncoding ascii8This specifies that the output is to be in the form of 8-bit ASCII characters. This is generally used if the output is intended directly for a braille embosser or display. The other values of encoding are ‘UTF8’, ‘UTF16’ and ‘UTF32’. These are useful if the application will process the output further, such as for generating displays of braille dots on a screen.
inputTextEncoding ascii8This setting is used to specify the encoding of an input text file. The valid values are ‘UTF8’ and ‘ascii8’.
formatFor textDeviceThis setting specifies the type of device the output is intended for. ‘textDevice’ is any device that accepts plain text, including embossers. You can also specify ‘browser’. In this case the output will be formatted for viewing in a browser. If the input file contains links, they will be preserved and can be used in the normal way. The text will be translated into braille with the correct line length. Math and computer material will be translated appropriately. These files work well in lynx and Internet Explorer, not so well in elinks and Firefox (Before Jaws 10).
backFormat plainThis setting specifies the format of back-translated files. ‘Plain’ specifies plain-text, while ‘html’ specifies xhtml. The latter is always encoded in UTF-8. Plain-text files can be encoded in ascii8, UTF-8 or UTF-16. Html is strongly recommended, since it will preserve print page numbering and emphasis.
backLineLength 70This setting specifies the length of lines in back-translated files, whether in plain-text or html. This is mainly for human readability. Lines may sometimes be somewhat longer.
lineFill 'This setting defines the fill character that will be used before the page numbers in the table of contents for example. The default fill character is an apostrophe (dot 3).
This section specifies the liblouis translation tables to be used for various purposes.
literaryTextTable en-us-g2.ctbThe table used for producing literary braille. This may be either contracted or uncontracted.
uncontractedTable en-us-g1.ctbThe table used for producing uncontracted or Grade One braille. This setting appears to be superfluous and may be eliminated in the future.
compbrailleTable en-us-compbrl.ctbThe table used for producing large amounts of output in computer braille, such as computer programs. The computer braille table is usually combined with one of the two tables above.
mathtextTable en-us-mathtext.ctbThis table specifies how the non-mathematical parts of math books are to be translated. In many cases it will be the same as literaryTextTable or uncontractedTable. For books translated with the Nemeth Code it is different, because this code requires modification of standard Grade Two.
MathexpTable nemeth.ctbThis is the table used to translate mathematical expressions.
editTable nemeth_edit.ctbWhen the output includes both mathematics and text there may be errors where one type of translation directly follows another. The editTable removes these errors.
This section provides various information for the processing of xml files.
semanticFiles *,nemeth.semThis setting gives a list of semantic-action files. These files are read in the sequence given in the list. Here the first member of the list is an asterisk (‘*’). This means that the corresponding file is to be named by taking the root element of the document and appending ‘.sem’. This asterisk member may occur anywhere in the list.
xmlheader <?xml version='1.0' encoding='UTF8' standalone='yes'?>This line gives the xml header to be added to strings produced by
programs like Mathtype that lack one.
entity nbsp ^1This line defines an entity or substitution in an xml file. It is one of those that has two values. The first is the thing to be replaced, and the second is the replacement. As many entity lines as necessary can be used. The information they contain is added to the information provided by xmlHeader. In liblouisutdml.ini this line is commented out, because specifying it at this point would prevent the user from specifying his own xmlheader.
internetAccess yesThe computer has an internet connection and liblouisutdml may obtain information necessary for the processing of this file from the Internet. If this setting is ‘no’ liblouisutdml will not try to use the internet. The necessary information may, however, be provided on the local machine in the form of a "dtd" file.
newEntries yesliblouisutdml may create a new semantic-action file (beginning with new_) for a document with an unknown root element or a file (beginning with appended_) containing new entries for an existing semantic-action file. Both kinds of files are placed on the current directory. If this setting is ‘no’ liblouisutdml will not create a file of new entries and if it encounters a document with an unknown root element it will issue an error message. Setting newEntries to ‘no’ may be useful if users should not be bothered with the minutiae of semantic-action files.
The following sections all deal with styles. Each style has its own
section. Style section names are unlike other section names in that they
consist of the word style, followed by a space, followed by a style
name. With some exceptions, styles are not hard-coded. The user may
define any style desired, with any name except document,
para, heading1, heading2, heading3,
heading4, contentsheader, contents1,
contents2, contents3 and contents4. The first two
are needed for basic formatting. The others are needed for the table of
contents tool. The user must define settings for these styles as for any
others. This is done in liblouisutdml.ini, which also contains
definitions and settings for many other styles. The user can add styles
at any time in her/his own configuration files.
Styles can be nested. That is, a document may contain a section of one style, and inside this may be a section of another style. For example, you might have styles named frontMatter, titlePage, dedication, contents, and so on. Your document might contain a section of style frontMatter. Inside this section might be subsections of styles titlePage, dedication, contents, and so on. Inside the titlePage section there might be other sections with styles heading1, para, centered, etc.
Your frontMatter style might also define the "persistent" style setting
braillePageNumberFormat roman. This setting will apply to all the
styles nested within frontMatter, unless they have a setting other than
‘normal’, which is the default and means ordinary braille page
numbers. However, the titlePage style might have the setting
braillePageNumberFormat blank. This will apply to all styles
nested within it. When the titlePage section ends, the frontMatter
setting ‘roman’ will be restored. The
‘braiblePageNumberFormat’ setting is an example of a "persistent"
style setting. Most settings apply only to the style for which they are
declared.
Below are the settings for the predefined style names. The ‘document’ style contains all possible settings. The others contain only settings that are different from the defaults.
This is a predefined style name. All settings have their default values. The user must specify any other values. If a "persistent" style setting is specified, it will apply to the whole ducument.
linesBefore 0This setting gives the number of blank lines which should be left before the text to which this style applies. It is set to a non-zero value for some header styles.
linesAfter 0The number of blank lines which should be left after the text to which this style applies.
leftMargin 0The number of cells by which the left margin of all lines in the text should be indented. Used for hanging indents, among other things. This is a "persistent" setting, so by default all nested styles will inherit the setting.
rightMargin 0The equivalent of ‘leftMargin’ for the right side of the page. This is also a persistent setting.
firstLineIndent 0The number of cells by which the first line is to be indented relative to leftMargin. firstLineIndent may be negative. If the result is less than 0 it will be set to 0. This setting is persistent.
translate contractedThis setting is currently inactive. It may be used in the future. This setting tells how text in this style should be translated. Possible values are ‘contracted’, ‘uncontracted’, ‘compbrl’, ‘mathtext’ and ‘mathexpr’.
skipNumberLines noIf this setting is ‘yes’ the top and bottom lines on the page will be skipped if they contain braille or print page numbers. This is useful in some of the mathematical and graphical styles.
format leftJustifiedThe format setting controls how the text in the style will be formatted. Valid values are ‘leftJustified’, ‘rightJustified’, ‘centered’, ‘computerCoded’, ‘alignColumnsLeft’, ‘alignColumnsRight’, and ‘contents’. The first three are self-explanatory. ‘computerCoded’ is used for computer programs and similar material. The next two are used for tabular material. ‘alignColumnsLeft’ causes the left ends of columns to be aligned. ‘alignColumnsRight’ causes the right ends of columns to be aligned. ‘contents’ is used only in styles specifically intended for tables of contents. In the case of ‘leftJustified’, ‘rightJustified’ and ‘centered’, nested styles inherit this setting by default.
newPageBefore noIf this setting is ‘yes’, the text will begin on a new page. This is useful for certain mathematical and graphical styles. Page numbers are handled properly.
newPageAfter noIf this setting is ‘yes’ any remaining space on the page after the material covered by this style is handled is left blank, except for page numbers.
rightHandPage noif this setting is ‘yes’ and interpoint is yes the material covered by this style will start on a right-hand page. This may cause a left-hand page to be left blank except for page numbers. If interpoint is ‘no’ this setting is equivalent to newPageBefore.
braillePageNumberFormat normalThis setting specifies the format of braille page numbers. ‘normal’ means ordinary Arabic numbers. ‘roman’ means Roman numbers. ‘p’ means to precede Arabic numbers with the letter "p" (for preliminary). Finally, ‘blank’ causes the page number to be blank (no page numbers). This is a "persistent" style setting.
dontSplit noIf this setting is ‘yes’, the element is protected from being split across pages. This means that if a block of text doesn’t fit on the current page, it will be placed at the beginning of the next one. This setting applies to the whole element, including children, so if nested styles specify other values for ‘dontSplit’, these values will be ignored.
keepWithNext noIf this setting is ‘yes’, the element covered by this style is protected from being split across pages, and in addition it is kept together with the first line of text of the next sibling.
orphanControl 0With this setting you can control how many lines of text of an element must be printed at least at the bottom of a braille page. The default value is ‘0’. To have an effect, the setting must have a value of ‘2’ or more.
This style is used to specify where the table of contents should be placed and its title. The xml tag assigned to it in the semantic action file should be placed in the document where you want the table of contents, and it should contain the title of that table between its starting and ending markers.
linesBefore 1linesAfter 1format centeredThis style and the other contents styles are used for the table of contents and correspond to the ten heading levels (‘contents5’, ‘contents6’, ‘contents7’, ‘contents8’, ‘contents9’ and ‘contents10’ are not showed here).
firstLineIndent -2leftMargin 2format contentsfirstLineIndent -2leftMargin 4format contentsfirstLineIndent -2leftMargin 6format contentsfirstLineIndent -2leftMargin 8format contentsThis style is used for main headings, such as chapter titles.
linesBefore 1center yeslinesAfter 1The first level of subheadings after the main heading.
linesBefore 1firstLineIndent 4The third level of headings.
firstLineIndent 4The fourth level of headings. There are six more levels: ‘heading5’, ‘heading6’, ‘heading7’, ‘heading8’, ‘heading9’ and ‘heading10’.
firstLineIndent 4Paragraph. This is ordinary body text.
firstLineIndent 2Typically used to form the top and bottom lines of "boxed" material. The character must be chosen to produce the desired dot pattern on the embosser or display in use.
topBoxline .This should be set to the character you want used for the boxline which appears before the content.
bottomBoxline .This should be set to the character you want used for the boxline which appears after the content.
When liblouisutdml (or file2brl) processes an xml document, it
needs to be told how to use the information in that document to
produce a properly translated and formatted braille document. These
instructions are provided by a semantic-action file, so called because
it explains the meaning, or semantics, of the various specifications
in the xml document. To understand how this works, it is necessary to
have a basic knowledge of the organization of an xml document.
An xml document is organized like a book, but with much finer detail.
First there is the title of the whole book. Then there are various
sections, such as author, copyright, table of contents, dedication,
acknowledgments, preface, various chapters, bibliography, index, and so
on. Each chapter may be divided into sections, and these in turn can be
divided into subsections, subsubsections, etc. In a book the parts have
names or titles distinguished by capitalization, type fonts, spacing,
and so forth. In an xml document the names of the parts are enclosed in
angle brackets (‘<>’). For example, if liblouisutdml encounters
<html> at the beginning of a document, it knows it is dealing
with a document that conforms to the standards of the extensible markup
language (xhtml) - at least we hope it does. When you see a book, you
know it’s a book. The computer can know only by being told. Something
enclosed in angle brackets is called an "element" (more properly, a
"tag") in xml parlance. (There may be more between the angle brackets
than just the name of the element. More of this later). The first
"element" in a document thus tells liblouisutdml what kind of document it
is dealing with. This element is called the "root element" because the
document is visualized as branching out from it like a tree. Some
examples of root elements are <html>, <math>,
<book>, <dtbook> and <wordDocument>. Whenever
liblouisutdml encounters a root element that it doesn’t know about it
creates a new file called a semantic-action file. The name of this file
is formed by stripping the angle brackets from the root element, putting
‘new_’ in front of it and adding a period plus the letters
‘sem’. For example, ‘new_myformat.sem’. If you look in a
directory containing semantic-action files you will see names like
html.sem, dtbook.sem, math.sem, and so on. The
"new" semantic-action files must be edited by a person and the prefix
"new" removed to get an ordinary semantic-action file name.
Sometimes it is advantageous to preempt the creation of a
semantic-action file for a new root element. For example, an article
written according to the docbook specification may have the root element
<article>. However, the specification itself has the root element
<book>. In this case you can specify the book.sem file in
the configuration file by writing, in the xml section,:
semanticFiles book.sem
You will note that this setting uses the plural of "file". This is because you can actually specify a list of file names separated by commas. You might want to do this to specify the semantic-action file for the particular braille mathematical code to be used. For example:
semanticFiles book.sem,ukmaths.sem
You can use an asterisk * to specify the semantic-action file
corresponding to the root element of the document anywhere in the list.
As you will see in the next section, different braille style conventions and different braille mathematical codes may require different semantic-action files
liblouisutdml records the names of all elements found in the document in the semantic-action file. The document has a multitude of elements, which can be thought of as describing the headings of various parts of the document. One element is used to denote a chapter heading. Another is used to denote a paragraph, Still another to denote text in bold type, and so on. In other words, the elements take the place of the capitalization, changes in type font, spacing, etc. in a book. However, the computer still does not know what to do when it encounters an element. The semantic-action file tells it that.
Consider html.sem. A copy is included as part of this documentation with the name example_html.sem (see Example files). It may differ from the file that liblouisutdml is currently using. You will see that it begins with some lines about copyrights. Each line begins with a number sign (‘#’). This indicates that it is a "comment", intended for the human reader and the computer should ignore it. Then there is a blank line. Finally, there are two other comments explaining that the file must be edited to get proper output. This is because a human being must tell the computer what to do with each element. The semantic files for common types of documents have already been edited, so you generally don’t have to worry about this. But if you encounter a new type of document or wish to specify special handling for styles or mathematics you may have to edit the semantic-action file or send it to the maintainer for editing. In any case the rest of this section is essential for understanding how liblouisutdml handles documents and for making changes if the way it does so is not correct.
After another blank line you will see a table consisting of two, and sometimes three, columns. The first column contains a word which tells the computer to do something. For example, the first entry in the table is: ‘include nemeth.sem’. This tells liblouisutdml to include the information in the nemeth.sem file when it is deciphering an html (actually xhtml) document (it may be preferable to use the semanticFiles setting in the configuration file rather than an include).
The second row of the table is:
no hr
‘hr’ is an element with the angle brackets removed. It means nothing in itself. However, the first column contains the word ‘no’. This tells liblouisutdml "no do", that is, do nothing. This is not strictly true, since liblouisutdml will sometimes insert a blank space so that words in text do not run together.
After a few more lines with ‘no’ in the first column, we see one that says:
softreturn br
This means that when the element <br> is encountered,
liblouisutdml is to do a soft return, that is, start a new line without
starting a new paragraph.
The next line says:
heading1 h1
This tells liblouisutdml that when it encounters the element <h1>
it is to format the text which follows as a first-level braille
heading, that is, the text will be centered and preceeded and followed
by blank lines. (You can change this by changing the definition of the
heading1 style).
The next line says:
italicx em
This tells liblouisutdml that when it encounters the element <em>
it is to enclose the text which follows in braille italic indicators.
The ‘x’ at the end of the semantic action name is there to
prevent conflicts with names elsewhere in the software. Just where the
italic indicators will be placed is controlled by the liblouis
translation table in use.
The next line says:
skip style
This tells liblouis to simply skip ahead until it encounters the
element </style>. Nothing in between will have any effect on
the braille output. Note the slash (‘/’) before the ‘style’.
This means the end of whatever the <style> element was
referring to. Actually, it was referring to specifications of how
things should be printed. If liblouisutdml had not been told to skip
these specifications, the braille output would have contained a lot of
gobledygook.
The next line says:
italicx strong
This tells liblouis to also use the italic braille indicators for the
text between the <strong> and </strong> elements.
After a few more lines with ‘no’ in the first column we come to the line:
document html
This tells liblouisutdml that everything between <html> and
</html> is an entire document. <html> was the root
element of this document, so this is logical.
After another ‘no’ line we come to:
para p
liblouisutdml will consider everything between <p> and
</p> to be a normal body text paragraph.
The next line is:
heading1 title
this causes the title of the document to also be treated as a braille level 1 heading.
Next we have the line:
list li
The xhtml <li> and </li> pair of elements is used to
enclose an item in a list. liblouisutdml will format this with its own
list style. That is, the first line will begin at the left margin and
subsequent lines will be indented two cells.
Next we have:
table table
You will note that the names of actions and elements are often identical. This is because they are both mnemonic. In any case, this line tells liblouisutdml to format the table contained in the xhtml document according to the table formatting rules it has been given for braille output.
Next we have the line:
heading2 h2
This means that the text between <h2> and </h2> is to be
formatted according to the Liblouisutdml style heading2. A blank line
will be left before the heading and the first line will be indented
four spaces.
After a few more lines we come to:
no table,cellpadding
Note the comma in the second column. This divides the column into two
subcolumns. The first is the table element name. The second is called
an "attribute" in xml. It gives further instructions about the
material enclosed between the starting and ending "tags" of the
element (<table> and </table>. Full information requires
three subcolumns. The third is called the value and gives the actual
information. The attribute is merely the name of the information.
Much further down we find:
no table,border,0
Here the element is table, the attribute is border and the value is 0. If liblouisutdml were to interpret this, it would mean that the table was to have a border of 0 width. It is not told to do so because tables in braille do not have borders.
Now let’s look at the file which is included at the beginning of the html.sem file. This is nemeth.sem. As with html.sem, a copy is included in the appendix (see Example files), but it is not necessarily the one that liblouisutdml is currently using. It illustrates several more things about how liblouisutdml uses semantic-action files.
The first thing you will notice is that for quite a few lines the
first and second columns are identical. This is because the MathML
element and attribute names are part of a standard, and it was
simplest to use the element names for the semantic actions as well. Most
of these actions do not do anything and could be replaced with the
generic semantic action. They are retained for backward
compatibility.
The first line of real interest is:
math math
Every mathematical expression begins with the element <math>
(which may have attributes and values), and ends with </math>.
This is therefore the root element of a mathematical expression.
However, mathematical expressions are usually part of a document, so
it is not given the semantic action document. The math semantic action
causes liblouisutdml to carry out special interpretation actions. These
will become clearer as we continue to look at the nemeth.sem
file. You will note that this line has three columns. The meaning of
the third column is discussed below.
After another uninteresting line we come to two that illustrate several more facts about semantic-action files:
mfrac mfrac ^?,/,^# mfrac mfrac,linethickness,0 ^(,^;%,^)
Like the math entry above, the first line has three columns. While the
first two columns must always be present, the third column is
optional. Here, it is also divided into subcolumns by commas. The
element <mfrac> indicates a fraction. A fraction has two parts,
a numerator and a denominator. In xml, we call these parts children of
<mfrac>. They may be represented in various ways, which need
not concern us here. What is of real importance is that the third
column tells liblouisutdml to put the characters ‘~?’ before the
numerator, ‘/’ between the numerator and denominator, and
‘~#’ after the denominator. Later on, liblouis will translate
these characters into the proper representation of a fraction in the
Nemeth Code of Braille Mathematics. (For other mathematical codes,
see Implementing Braille Mathematics Codes).
The second line is of even greater interest. The first column is again ‘mfrac’, but this line is for binomial coefficient. The second column contains three subcolumns, an element name, an attribute name and an attribute value. The attribute linethickness specifies the thickness of the line separating the numerator and denominator. Here it is 0, so there is no line. This is how the binomial coefficient is represented in print. The third column tells how to represent it in braille. liblouisutdml will supply ‘~(’, upper number, ‘~%’, lower number, ‘~)’ to liblouis, which will then produce the proper braille representation for the binomial coefficient.
Returning to the line for the math element, we see that the third column begins with a backslash followed by an asterisk. The backslash is an escape character which gives a special meaning to the character which follows it. Here the asterisk means that what follows is to be placed at the very end of the mathematical expression, no matter how complex it is.
For further discussion of how the third column is used see Implementing Braille Mathematics Codes. The third column is not limited to mathematics. It can be used to add characters to anything enclosed by an xml tag.
Here is a complete list of the semantic actions which liblouisutdml recognizes. Some of them are also the names of styles. These are listed in the first table. For a discussion of these, see Customization Configuring liblouisutdml.
Generally the format of a semantic action is:
semanticAction elementSpecifier optionalArguments
elementSpecifier is the second-column value, which may be an
element name, an element-attribute pair or an element-attribute-value
triplet, separated by commas. This specifies where a semantic action
is to be applied. If it is solely an element then the action is
applied if this element is encountered. If it is an element-attribute
pair then the action is applied if the given element also has the
specified attribute. In the last case with a element-attribute-value
triplet the action is only applied if the element has the specified
attribute and the value of this attribute is equal to the specified
value.
contenss1 elementSpecifierNote that the contenss1, etc. semantic actions are never
assigned an
actual elementSpecifier. There used internally by the table of
contents generator. They should be assigned style settings, however.
contenss2 elementSpecifiercontenss3 elementSpecifiercontenss4 elementSpecifiercontentsheader elementSpecifierThis semantic action must be assigned an element specifier if used. See the discussion of it in style.