Swish-E Logo


SWISH-CONFIG - Configuration File Directives


Table of Contents:

[ TOC ]

Swish-e CONFIGURATION FILE

What files Swish-e indexes and how they are indexed, and where the index is written can be controlled by a configuration file.

The configuration file is a text file composed of comments, blank lines, and configuration directives. The order of the directives is not important. Some directives may be used more than once in the configuration file, while others can only be used once (e.g. additional directives will overwrite preceding directives). Case of the directive is not important -- you may use upper, lower, or mixed case.

Comments are any line that begin with a "#".

 
    # This is a comment

As of 2.4.3 lines may be continued by placing a backslas as the last character on the line:

 
    IgnoreWords \
        am \
        the \
        foo

Directives may take more than one parameter. Enclose single parameters that include whitespace in quotes (single or double). Inside of quotes the backslash escapes the next character.

 
    ReplaceRules append "foo bar"   <- define "foo bar" as a single parameter

If you need to include a quote character in the value either use a backslash to escape it, or enclose it in quotes of the other type.

For example, under unix you can use quotes to include white space in a single parameter. Here, to protect against path names (%p) that might have white space embedded use single quotes (this also protects against shell expansion or metacharacters):

 
    FileFilter .foo foofilter "'%p'"  <- parameter passed through the shell in single quotes
    FileFilter .foo foofilter '"%p"'  <- windows uses double-quotes
    FileFilter .foo foofilter '\'%p\''<- silly example

Backslashes also have special meaning in regular expressions.

 
    FileFilterMatch pdftotext "'%p' -" /\.pdf$/

This says that the dot is a real dot (instead of matching any character). If you place the regular expression in quotes then you must use double-backslashes.

 
    FileFilterMatch pdftotext "'%p' -" "/\\.pdf$/"

Swish-e will convert the double backslash into a single backslash before passing the parameter to the regular expression compiler.

Commented example configuration files are included in the conf directory of the Swish-e distribution.

Some command line arguments can override directives specified in the configuration file. Please see also the SWISH-RUN for instructions on running Swish-e, and the SWISH-SEARCH page for information and examples on how to search your index.

The configuration file is specified to Swish-e by the -c switch. For example,

 
    swish-e -c myconfig.conf

You may also split your directives up into different configuration files. This allows you to have a master configuration file used for many different indexes, and smaller configuration files for each separate index. You can specify the different configuration files when running from the command line with the -c switch (see SWISH-RUN), or you may include other Configuration file with the IncludeConfigFile directive below.

Typically, in a configuration file the directives are grouped together in some logical order -- that is, directives that control the source of the documents would be grouped together first, and directives that control how each document is filtered or its words index in another group of directives. (The directives listed below are grouped in this order).

The configuration file directives are listed below in these groups:

[ TOC ]


Alphabetical Listing of Directives

[ TOC ]


Directives that Control Swish

These configuration directives control the general behavior of Swish-e.

IncludeConfigFile *path to config file*

This directive can be used to include configuration directives located in another file.

 
    IncludeConfigFile /usr/local/swish/conf/site_config.config

IndexReport [0|1|2|3]

This is how detailed you want reporting while indexing. You can specify numbers 0 to 3. 0 is totally silent, 3 is the most verbose. The default is 1.

This may be overridden from the command line via the -v switch (see SWISH-RUN).

ParserWarnLevel [0|1|2|3]

Sets the error level when using the libxml2 parser for XML and HTML. libxml2 will point out structural errors in your documents.

 
    0 = no report
    1 = fatal errors
    2 = errors
    3 = warnings

The exception to this is UTF-8 to Latin-1 conversion errors are reported at level 1. This is because words may be indexed incorrectly in these cases.

Note that unlike other errors generated by Swish-e, these errors are sent to stderr.

IndexFile *path*

Index file specifies the location of the generated index file. If not specified, Swish-e will create the file index.swish-e in the current directory.

 
    IndexFile /usr/local/swish/site.index

obeyRobotsNoIndex [yes|NO]

When enabled, Swish-e will not index any HTML file that contains:

 
    <meta name="robots" content="noindex">

The default is to ignore these meta tags and index the document. This tag is described at http://www.robotstxt.org/wc/exclusion.html.

Note: This feature is only available with the libxml2 HTML parser.

Also, if you are using the libxml2 parser (HTML2 and XML2) then you can use the following comments in your documents to prevent indexing:

 
       <!-- SwishCommand noindex -->
       <!-- SwishCommand index -->

and/or these may be used also:

 
       <!-- noindex -->
       <!-- index -->

For example, these are very helpful to prevent indexing of common headers, footers, and menus.

NOTE: This following items are currently not available. These items require Swish-e to parse the configuration file while searching.

EnableAltSearchSyntax [yes|NO]

NOTE: This following item is currently not available.

Enable alternate search syntax. Allows the usage of a basic "Altavista(c)", "Lycos(c)", etc. like search syntax. This means a search query can contain "+" and "-" as syntax parameter.

Example:

 
    swish-e -w "+word1 +word2 -word3  word4 word5"
    "+"  = following word has to be in all found documents
    "-"  = following word may not be in any document found
    " "  = following word will be searched in documents

SwishSearchOperators <and-word> <or-word> <not-word>

NOTE: This following item is currently not available.

Using this config directive you can change the boolean search operators of Swish-e, e.g. to adapt these to your language. The default is: AND OR NOT

Example (german):

 
    SwishSearchOperators   UND  ODER  NICHT

SwishSearchDefaultRule [<AND-WORD>|<or-word>]

NOTE: This following item is currently not available.

SwishSearchDefaultRule defines the default Boolean operator to use if none is specified between words or phrases. The default is AND.

The word you specify must match one of the available SwishSearchOperators.

Example:

 
    SwishSearchOperators   UND  ODER  NICHT
    # Make it act like a web search engine
    SwishSearchDefaultRule ODER

ResultExtFormatName name -x format string

NOTE: This following item is currently not available.

The output of Swish-e can be defined by specifying a format string with the -x command line argument. Using ResultExtFormatName you can assign a predefined format string to a name.

Examples:

 
    ResultExtFormatName  moreinfo   "%c|%r|%t|%p|<author>|<publishyear>\n"

Then when searching you can specify the format string's name

 
    swish-e   ...  -x moreinfo  ...

See the -x switch in SWISH-RUN for more information about output formats.

[ TOC ]


Administrative Headers Directives

Swish-e stores configuration information in the header of the index file. This information can be retrieved while searching or by functions in the Swish-e C library. There are a number of fields available for your own use. None of these fields are required:

IndexName *text*

IndexDescription *text*

IndexPointer *text*

IndexAdmin *text*

These variables specify information that goes into index files to help users and administrators. IndexName should be the name of your index, like a book title. IndexDescription is a short description of the index or a URL pointing to a more full description. IndexPointer should be a pointer to the original information, most likely a URL. IndexAdmin should be the name of the index maintainer and can include name and email information. These values should not be more than 70 or so characters and should be contained in quotes. Note that the automatically generated date in index files is in D/M/Y and 24-hour format.

Examples:

 
    IndexName "Linux Documentation"
    IndexDescription "This is an index of /usr/doc on our Linux machine." 
    IndexPointer http://localhost/swish/linux/index.html
    IndexAdmin webmaster

[ TOC ]


Document Source Directives

These directives control what documents are indexed and how they are accessed. See also Directives for the File Access method only and Directives for the HTTP Access Method Only for directives that are specific to those access methods.

IndexDir [directories or files|URL|external program]

IndexDir defines the source of the documents for Swish-e. Swish-e currently supports three file access methods: File system, HTTP (also called spidering), and prog for reading files from an external program.

The -S command line argument is used to select the file access method.

 
    swish-e -c swish.config -S fs    - file system
    swish-e -c swish.config -S http  - internal http spider
    swish-e -c swish.config -S prog  - external program of any type

For the fs method of access IndexDir is a space-separated list of files and directories to index. Use a forward slash as the path separator in MS Windows.

For the http method the IndexDir setting is a list of space-separated URLs.

For the prog method the IndexDir setting is a list of space-separated programs to run (which generate documents for swish to index).

You may specify more than one IndexDir directive.

Any sub-directories of any listed directory will also be indexed.

Note: While processing directories, Swish-e will ignore any files or directories that begin with a dot ("."). You may index files or directories that begin with a dot by specifying their name with IndexDir or -i.

Examples:

 
    # Index this directory an any subdirectories
    IndexDir /usr/local/home/http

 
    # Index the docs directory in current directory
    IndexDir ./docs

 
    # Index these files in the current directory
    IndexDir ./index.html ./page1.html ./page2.html
    # and index this directory, too
    IndexDir ../public_html

For the HTTP method of access specify the URL's from which you want the spidering to begin.

Example:

 
    IndexDir http://www.my-site.com/index.html
    IndexDir http://localhost/index.html

Obviously, using the HTTP method to index is much slower than indexing local files. Be well aware that some sites do not appreciate spidering and may block your IP address. You may wish to contact the remote site before spidering their web site. More information about spidering can be found in Directives for the HTTP Access Method Only below.

For the prog method of access IndexDir specifies the path to the program(s) to execute. The external program must correctly format the documents being passed back to Swish-e. Examples of external programs are provided in the prog-bin directory.

 
    IndexDir ./myprogram.pl

See prog for details.

Note: Not all directives work with all methods.

NoContents *list of file suffixes*

Files with these suffixes will not have their contents indexed, but will have their path name (file name) indexed instead.

If the file's type is HTML or HTML2 (as set by IndexContents or DefaultContents) then the file will be parsed for a HTML title and that title will be indexed. Note that you must set the file's type with IndexContents or DefaultContents: .html and .htm are NOT type HTML by default. For example:

 
   IndexContents HTML* .htm .html

If a title is found, it will still be checked for FileRules title, and the file will be skipped if a match is found. See FileRules.

If the file's type is not HTML, or it is HTML and no title is found, then the file's path will be indexed.

For example, this will allow searching by image file name.

 
    NoContents .gif .xbm .au .place|remove|prepend|append|regex]

  • ResultExtFormatName name -x format string

  • SpiderDirectory *path*

  • StoreDescription [XML <tag>|HTML <meta>|TXT size]

  • "SwishProgParameters *list of parameters*

  • SwishSearchDefaultRule [<AND-WORD>|<or-word>]

  • SwishSearchOperators <and-word> <or-word> <not-word>

  • TmpDir *path*

  • TranslateCharacters [*string1 string2*|:ascii7:]

  • TruncateDocSize *number of characters*

  • UndefinedMetaTags [error|ignore|INDEX|auto]

  • UndefinedXMLAttributes [DISABLE|error|ignore|index|auto]

  • UseStemming [yes|NO]

  • UseSoundex [yes|NO]

  • UseWords [*list of words*|File: path]

  • WordCharacters *string of characters*

  • XMLClassAttributes *list of XML attribute names*

    [ TOC ]


    Directives that Control Swish

    These configuration directives control the general behavior of Swish-e.

    IncludeConfigFile *path to config file*

    This directive can be used to include configuration directives located in another file.

     
        IncludeConfigFile /usr/local/swish/conf/site_config.config

    IndexReport [0|1|2|3]

    This is how detailed you want reporting while indexing. You can specify numbers 0 to 3. 0 is totally silent, 3 is the most verbose. The default is 1.

    This may be overridden from the command line via the -v switch (see SWISH-RUN).

    ParserWarnLevel [0|1|2|3]

    Sets the error level when using the libxml2 parser for XML and HTML. libxml2 will point out structural errors in your documents.

     
        0 = no report
        1 = fatal errors
        2 = errors
        3 = warnings

    The exception to this is UTF-8 to Latin-1 conversion errors are reported at level 1. This is because words may be indexed incorrectly in these cases.

    Note that unlike other errors generated by Swish-e, these errors are sent to stderr.

    IndexFile *path*

    Index file specifies the location of the generated index file. If not specified, Swish-e will create the file index.swish-e in the current directory.

     
        IndexFile /usr/local/swish/site.index

    obeyRobotsNoIndex [yes|NO]

    When enabled, Swish-e will not index any HTML file that contains:

     
        <meta name="robots" content="noindex">

    The default is to ignore these meta tags and index the document. This tag is described at http://www.robotstxt.org/wc/exclusion.html.

    Note: This feature is only available with the libxml2 HTML parser.

    Also, if you are using the libxml2 parser (HTML2 and XML2) then you can use the following comments in your documents to prevent indexing:

     
           <!-- SwishCommand noindex -->
           <!-- SwishCommand index -->

    and/or these may be used also:

     
           <!-- noindex -->
           <!-- index -->

    For example, these are very helpful to prevent indexing of common headers, footers, and menus.

    NOTE: This following items are currently not available. These items require Swish-e to parse the configuration file while searching.

    EnableAltSearchSyntax [yes|NO]

    NOTE: This following item is currently not available.

    Enable alternate search syntax. Allows the usage of a basic "Altavista(c)", "Lycos(c)", etc. like search syntax. This means a search query can contain "+" and "-" as syntax parameter.

    Example:

     
        swish-e -w "+word1 +word2 -word3  word4 word5"
        "+"  = following word has to be in all found documents
        "-"  = following word may not be in any document found
        " "  = following word will be searched in documents

    SwishSearchOperators <and-word> <or-word> <not-word>

    NOTE: This following item is currently not available.

    Using this config directive you can change the boolean search operators of Swish-e, e.g. to adapt these to your language. The default is: AND OR NOT

    Example (german):

     
        SwishSearchOperators   UND  ODER  NICHT

    SwishSearchDefaultRule [<AND-WORD>|<or-word>]

    NOTE: This following item is currently not available.

    SwishSearchDefaultRule defines the default Boolean operator to use if none is specified between words or phrases. The default is AND.

    The word you specify must match one of the available SwishSearchOperators.

    Example:

     
        SwishSearchOperators   UND  ODER  NICHT
        # Make it act like a web search engine
        SwishSearchDefaultRule ODER

    ResultExtFormatName name -x format string

    NOTE: This following item is currently not available.

    The output of Swish-e can be defined by specifying a format string with the -x command line argument. Using ResultExtFormatName you can assign a predefined format string to a name.

    Examples:

     
        ResultExtFormatName  moreinfo   "%c|%r|%t|%p|<author>|<publishyear>\n"

    Then when searching you can specify the format string's name

     
        swish-e   ...  -x moreinfo  ...

    See the -x switch in SWISH-RUN for more information about output formats.

    [ TOC ]


    Administrative Headers Directives

    Swish-e stores configuration information in the header of the index file. This information can be retrieved while searching or by functions in the Swish-e C library. There are a number of fields available for your own use. None of these fields are required:

    IndexName *text*

    IndexDescription *text*

    IndexPointer *text*

    IndexAdmin *text*

    These variables specify information that goes into index files to help users and administrators. IndexName should be the name of your index, like a book title. IndexDescription is a short description of the index or a URL pointing to a more full description. IndexPointer should be a pointer to the original information, most likely a URL. IndexAdmin should be the name of the index maintainer and can include name and email information. These values should not be more than 70 or so characters and should be contained in quotes. Note that the automatically generated date in index files is in D/M/Y and 24-hour format.

    Examples:

     
        IndexName "Linux Documentation"
        IndexDescription "This is an index of /usr/doc on our Linux machine." 
        IndexPointer http://localhost/swish/linux/index.html
        IndexAdmin webmaster

    [ TOC ]


    Document Source Directives

    These directives control what documents are indexed and how they are accessed. See also Directives for the File Access method only and Directives for the HTTP Access Method Only for directives that are specific to those access methods.

    IndexDir [directories or files|URL|external program]

    IndexDir defines the source of the documents for Swish-e. Swish-e currently supports three file access methods: File system, HTTP (also called spidering), and prog for reading files from an external program.

    The -S command line argument is used to select the file access method.

     
        swish-e -c swish.config -S fs    - file system
        swish-e -c swish.config -S http  - internal http spider
        swish-e -c swish.config -S prog  - external program of any type

    For the fs method of access IndexDir is a space-separated list of files and directories to index. Use a forward slash as the path separator in MS Windows.

    For the http method the IndexDir setting is a list of space-separated URLs.

    For the prog method the IndexDir setting is a list of space-separated programs to run (which generate documents for swish to index).

    You may specify more than one IndexDir directive.

    Any sub-directories of any listed directory will also be indexed.

    Note: While processing directories, Swish-e will ignore any files or directories that begin with a dot ("."). You may index files or directories that begin with a dot by specifying their name with IndexDir or -i.

    Examples:

     
        # Index this directory an any subdirectories
        IndexDir /usr/local/home/http

     
        # Index the docs directory in current directory
        IndexDir ./docs

     
        # Index these files in the current directory
        IndexDir ./index.html ./page1.html ./page2.html
        # and index this directory, too
        IndexDir ../public_html

    For the HTTP method of access specify the URL's from which you want the spidering to begin.

    Example:

     
        IndexDir http://www.my-site.com/index.html
        IndexDir http://localhost/index.html

    Obviously, using the HTTP method to index is much slower than indexing local files. Be well aware that some sites do not appreciate spidering and may block your IP address. You may wish to contact the remote site before spidering their web site. More information about spidering can be found in Directives for the HTTP Access Method Only below.

    For the prog method of access IndexDir specifies the path to the program(s) to execute. The external program must correctly format the documents being passed back to Swish-e. Examples of external programs are provided in the prog-bin directory.

     
        IndexDir ./myprogram.pl

    See prog for details.

    Note: Not all directives work with all methods.

    NoContents *list of file suffixes*

    Files with these suffixes will not have their contents indexed, but will have their path name (file name) indexed instead.

    If the file's type is HTML or HTML2 (as set by IndexContents or DefaultContents) then the file will be parsed for a HTML title and that title will be indexed. Note that you must set the file's type with IndexContents or DefaultContents: .html and .htm are NOT type HTML by default. For example:

     
       IndexContents HTML* .htm .html

    If a title is found, it will still be checked for FileRules title, and the file will be skipped if a match is found. See FileRules.

    If the file's type is not HTML, or it is HTML and no title is found, then the file's path will be indexed.

    For example, this will allow searching by image file name.

     
        NoContents .gif .xbm .au .place|remove|prepend|append|regex]
    
    

  • ResultExtFormatName name -x format string

  • SpiderDirectory *path*

  • StoreDescription [XML <tag>|HTML <meta>|TXT size]

  • "SwishProgParameters *list of parameters*

  • SwishSearchDefaultRule [<AND-WORD>|<or-word>]

  • SwishSearchOperators <and-word> <or-word> <not-word>

  • TmpDir *path*

  • TranslateCharacters [*string1 string2*|:ascii7:]

  • TruncateDocSize *number of characters*

  • UndefinedMetaTags [error|ignore|INDEX|auto]

  • UndefinedXMLAttributes [DISABLE|error|ignore|index|auto]

  • UseStemming [yes|NO]

  • UseSoundex [yes|NO]

  • UseWords [*list of words*|File: path]

  • WordCharacters *string of characters*

  • XMLClassAttributes *list of XML attribute names*

    [ TOC ]


    Directives that Control Swish

    These configuration directives control the general behavior of Swish-e.

    IncludeConfigFile *path to config file*

    This directive can be used to include configuration directives located in another file.

     
        IncludeConfigFile /usr/local/swish/conf/site_config.config

    IndexReport [0|1|2|3]

    This is how detailed you want reporting while indexing. You can specify numbers 0 to 3. 0 is totally silent, 3 is the most verbose. The default is 1.

    This may be overridden from the command line via the -v switch (see SWISH-RUN).

    ParserWarnLevel [0|1|2|3]

    Sets the error level when using the libxml2 parser for XML and HTML. libxml2 will point out structural errors in your documents.

     
        0 = no report
        1 = fatal errors
        2 = errors
        3 = warnings

    The exception to this is UTF-8 to Latin-1 conversion errors are reported at level 1. This is because words may be indexed incorrectly in these cases.

    Note that unlike other errors generated by Swish-e, these errors are sent to stderr.

    IndexFile *path*

    Index file specifies the location of the generated index file. If not specified, Swish-e will create the file index.swish-e in the current directory.

     
        IndexFile /usr/local/swish/site.index

    obeyRobotsNoIndex [yes|NO]

    When enabled, Swish-e will not index any HTML file that contains:

     
        <meta name="robots" content="noindex">

    The default is to ignore these meta tags and index the document. This tag is described at http://www.robotstxt.org/wc/exclusion.html.

    Note: This feature is only available with the libxml2 HTML parser.

    Also, if you are using the libxml2 parser (HTML2 and XML2) then you can use the following comments in your documents to prevent indexing:

     
           <!-- SwishCommand noindex -->
           <!-- SwishCommand index -->

    and/or these may be used also:

     
           <!-- noindex -->
           <!-- index -->

    For example, these are very helpful to prevent indexing of common headers, footers, and menus.

    NOTE: This following items are currently not available. These items require Swish-e to parse the configuration file while searching.

    EnableAltSearchSyntax [yes|NO]

    NOTE: This following item is currently not available.

    Enable alternate search syntax. Allows the usage of a basic "Altavista(c)", "Lycos(c)", etc. like search syntax. This means a search query can contain "+" and "-" as syntax parameter.

    Example:

     
        swish-e -w "+word1 +word2 -word3  word4 word5"
        "+"  = following word has to be in all found documents
        "-"  = following word may not be in any document found
        " "  = following word will be searched in documents

    SwishSearchOperators <and-word> <or-word> <not-word>

    NOTE: This following item is currently not available.

    Using this config directive you can change the boolean search operators of Swish-e, e.g. to adapt these to your language. The default is: AND OR NOT

    Example (german):

     
        SwishSearchOperators   UND  ODER  NICHT

    SwishSearchDefaultRule [<AND-WORD>|<or-word>]

    NOTE: This following item is currently not available.

    SwishSearchDefaultRule defines the default Boolean operator to use if none is specified between words or phrases. The default is AND.

    The word you specify must match one of the available SwishSearchOperators.

    Example:

     
        SwishSearchOperators   UND  ODER  NICHT
        # Make it act like a web search engine
        SwishSearchDefaultRule ODER

    ResultExtFormatName name -x format string

    NOTE: This following item is currently not available.

    The output of Swish-e can be defined by specifying a format string with the -x command line argument. Using ResultExtFormatName you can assign a predefined format string to a name.

    Examples:

     
        ResultExtFormatName  moreinfo   "%c|%r|%t|%p|<author>|<publishyear>\n"

    Then when searching you can specify the format string's name

     
        swish-e   ...  -x moreinfo  ...

    See the -x switch in SWISH-RUN for more information about output formats.

    [ TOC ]


    Administrative Headers Directives

    Swish-e stores configuration information in the header of the index file. This information can be retrieved while searching or by functions in the Swish-e C library. There are a number of fields available for your own use. None of these fields are required:

    IndexName *text*

    IndexDescription *text*

    IndexPointer *text*

    IndexAdmin *text*

    These variables specify information that goes into index files to help users and administrators. IndexName should be the name of your index, like a book title. IndexDescription is a short description of the index or a URL pointing to a more full description. IndexPointer should be a pointer to the original information, most likely a URL. IndexAdmin should be the name of the index maintainer and can include name and email information. These values should not be more than 70 or so characters and should be contained in quotes. Note that the automatically generated date in index files is in D/M/Y and 24-hour format.

    Examples:

     
        IndexName "Linux Documentation"
        IndexDescription "This is an index of /usr/doc on our Linux machine." 
        IndexPointer http://localhost/swish/linux/index.html
        IndexAdmin webmaster

    [ TOC ]


    Document Source Directives

    These directives control what documents are indexed and how they are accessed. See also Directives for the File Access method only and Directives for the HTTP Access Method Only for directives that are specific to those access methods.

    IndexDir [directories or files|URL|external program]

    IndexDir defines the source of the documents for Swish-e. Swish-e currently supports three file access methods: File system, HTTP (also called spidering), and prog for reading files from an external program.

    The -S command line argument is used to select the file access method.

     
        swish-e -c swish.config -S fs    - file system
        swish-e -c swish.config -S http  - internal http spider
        swish-e -c swish.config -S prog  - external program of any type

    For the fs method of access IndexDir is a space-separated list of files and directories to index. Use a forward slash as the path separator in MS Windows.

    For the http method the IndexDir setting is a list of space-separated URLs.

    For the prog method the IndexDir setting is a list of space-separated programs to run (which generate documents for swish to index).

    You may specify more than one IndexDir directive.

    Any sub-directories of any listed directory will also be indexed.

    Note: While processing directories, Swish-e will ignore any files or directories that begin with a dot ("."). You may index files or directories that begin with a dot by specifying their name with IndexDir or -i.

    Examples:

     
        # Index this directory an any subdirectories
        IndexDir /usr/local/home/http

     
        # Index the docs directory in current directory
        IndexDir ./docs

     
        # Index these files in the current directory
        IndexDir ./index.html ./page1.html ./page2.html
        # and index this directory, too
        IndexDir ../public_html

    For the HTTP method of access specify the URL's from which you want the spidering to begin.

    Example:

     
        IndexDir http://www.my-site.com/index.html
        IndexDir http://localhost/index.html

    Obviously, using the HTTP method to index is much slower than indexing local files. Be well aware that some sites do not appreciate spidering and may block your IP address. You may wish to contact the remote site before spidering their web site. More information about spidering can be found in Directives for the HTTP Access Method Only below.

    For the prog method of access IndexDir specifies the path to the program(s) to execute. The external program must correctly format the documents being passed back to Swish-e. Examples of external programs are provided in the prog-bin directory.

     
        IndexDir ./myprogram.pl

    See prog for details.

    Note: Not all directives work with all methods.

    NoContents *list of file suffixes*

    Files with these suffixes will not have their contents indexed, but will have their path name (file name) indexed instead.

    If the file's type is HTML or HTML2 (as set by IndexContents or DefaultContents) then the file will be parsed for a HTML title and that title will be indexed. Note that you must set the file's type with IndexContents or DefaultContents: .html and .htm are NOT type HTML by default. For example:

     
       IndexContents HTML* .htm .html

    If a title is found, it will still be checked for FileRules title, and the file will be skipped if a match is found. See FileRules.

    If the file's type is not HTML, or it is HTML and no title is found, then the file's path will be indexed.

    For example, this will allow searching by image file name.

     
        NoContents .gif .xbm .au .place|remove|prepend|append|regex]
    
    

  • ResultExtFormatName name -x format string

  • SpiderDirectory *path*

  • StoreDescription [XML <tag>|HTML <meta>|TXT size]

  • "SwishProgParameters *list of parameters*

  • SwishSearchDefaultRule [<AND-WORD>|<or-word>]

  • SwishSearchOperators <and-word> <or-word> <not-word>

  • TmpDir *path*

  • TranslateCharacters [*string1 string2*|:ascii7:]

  • TruncateDocSize *number of characters*

  • UndefinedMetaTags [error|ignore|INDEX|auto]

  • UndefinedXMLAttributes [DISABLE|error|ignore|index|auto]

  • UseStemming [yes|NO]

  • UseSoundex [yes|NO]

  • UseWords [*list of words*|File: path]

  • WordCharacters *string of characters*

  • XMLClassAttributes *list of XML attribute names*

    [ TOC ]


    Directives that Control Swish

    These configuration directives control the general behavior of Swish-e.

    IncludeConfigFile *path to config file*

    This directive can be used to include configuration directives located in another file.

     
        IncludeConfigFile /usr/local/swish/conf/site_config.config

    IndexReport [0|1|2|3]

    This is how detailed you want reporting while indexing. You can specify numbers 0 to 3. 0 is totally silent, 3 is the most verbose. The default is 1.

    This may be overridden from the command line via the -v switch (see SWISH-RUN).

    ParserWarnLevel [0|1|2|3]

    Sets the error level when using the libxml2 parser for XML and HTML. libxml2 will point out structural errors in your documents.

     
        0 = no report
        1 = fatal errors
        2 = errors
        3 = warnings

    The exception to this is UTF-8 to Latin-1 conversion errors are reported at level 1. This is because words may be indexed incorrectly in these cases.

    Note that unlike other errors generated by Swish-e, these errors are sent to stderr.

    IndexFile *path*

    Index file specifies the location of the generated index file. If not specified, Swish-e will create the file index.swish-e in the current directory.

     
        IndexFile /usr/local/swish/site.index

    obeyRobotsNoIndex [yes|NO]

    When enabled, Swish-e will not index any HTML file that contains:

     
        <meta name="robots" content="noindex">

    The default is to ignore these meta tags and index the document. This tag is described at http://www.robotstxt.org/wc/exclusion.html.

    Note: This feature is only available with the libxml2 HTML parser.

    Also, if you are using the libxml2 parser (HTML2 and XML2) then you can use the following comments in your documents to prevent indexing:

     
           <!-- SwishCommand noindex -->
           <!-- SwishCommand index -->

    and/or these may be used also:

     
           <!-- noindex -->
           <!-- index -->

    For example, these are very helpful to prevent indexing of common headers, footers, and menus.

    NOTE: This following items are currently not available. These items require Swish-e to parse the configuration file while searching.

    EnableAltSearchSyntax [yes|NO]

    NOTE: This following item is currently not available.

    Enable alternate search syntax. Allows the usage of a basic "Altavista(c)", "Lycos(c)", etc. like search syntax. This means a search query can contain "+" and "-" as syntax parameter.

    Example:

     
        swish-e -w "+word1 +word2 -word3  word4 word5"
        "+"  = following word has to be in all found documents
        "-"  = following word may not be in any document found
        " "  = following word will be searched in documents

    SwishSearchOperators <and-word> <or-word> <not-word>

    NOTE: This following item is currently not available.

    Using this config directive you can change the boolean search operators of Swish-e, e.g. to adapt these to your language. The default is: AND OR NOT

    Example (german):

     
        SwishSearchOperators   UND  ODER  NICHT

    SwishSearchDefaultRule [<AND-WORD>|<or-word>]

    NOTE: This following item is currently not available.

    SwishSearchDefaultRule defines the default Boolean operator to use if none is specified between words or phrases. The default is AND.

    The word you specify must match one of the available SwishSearchOperators.

    Example:

     
        SwishSearchOperators   UND  ODER  NICHT
        # Make it act like a web search engine
        SwishSearchDefaultRule ODER

    ResultExtFormatName name -x format string

    NOTE: This following item is currently not available.

    The output of Swish-e can be defined by specifying a format string with the -x command line argument. Using ResultExtFormatName you can assign a predefined format string to a name.

    Examples:

     
        ResultExtFormatName  moreinfo   "%c|%r|%t|%p|<author>|<publishyear>\n"

    Then when searching you can specify the format string's name

     
        swish-e   ...  -x moreinfo  ...

    See the -x switch in SWISH-RUN for more information about output formats.

    [ TOC ]


    Administrative Headers Directives

    Swish-e stores configuration information in the header of the index file. This information can be retrieved while searching or by functions in the Swish-e C library. There are a number of fields available for your own use. None of these fields are required:

    IndexName *text*

    IndexDescription *text*

    IndexPointer *text*

    IndexAdmin *text*

    These variables specify information that goes into index files to help users and administrators. IndexName should be the name of your index, like a book title. IndexDescription is a short description of the index or a URL pointing to a more full description. IndexPointer should be a pointer to the original information, most likely a URL. IndexAdmin should be the name of the index maintainer and can include name and email information. These values should not be more than 70 or so characters and should be contained in quotes. Note that the automatically generated date in index files is in D/M/Y and 24-hour format.

    Examples:

     
        IndexName "Linux Documentation"
        IndexDescription "This is an index of /usr/doc on our Linux machine." 
        IndexPointer http://localhost/swish/linux/index.html
        IndexAdmin webmaster

    [ TOC ]


    Document Source Directives

    These directives control what documents are indexed and how they are accessed. See also Directives for the File Access method only and Directives for the HTTP Access Method Only for directives that are specific to those access methods.

    IndexDir [directories or files|URL|external program]

    IndexDir defines the source of the documents for Swish-e. Swish-e currently supports three file access methods: File system, HTTP (also called spidering), and prog for reading files from an external program.

    The -S command line argument is used to select the file access method.

     
        swish-e -c swish.config -S fs    - file system
        swish-e -c swish.config -S http  - internal http spider
        swish-e -c swish.config -S prog  - external program of any type

    For the fs method of access IndexDir is a space-separated list of files and directories to index. Use a forward slash as the path separator in MS Windows.

    For the http method the IndexDir setting is a list of space-separated URLs.

    For the prog method the IndexDir setting is a list of space-separated programs to run (which generate documents for swish to index).

    You may specify more than one IndexDir directive.

    Any sub-directories of any listed directory will also be indexed.

    Note: While processing directories, Swish-e will ignore any files or directories that begin with a dot ("."). You may index files or directories that begin with a dot by specifying their name with IndexDir or -i.

    Examples:

     
        # Index this directory an any subdirectories
        IndexDir /usr/local/home/http

     
        # Index the docs directory in current directory
        IndexDir ./docs

     
        # Index these files in the current directory
        IndexDir ./index.html ./page1.html ./page2.html
        # and index this directory, too
        IndexDir ../public_html

    For the HTTP method of access specify the URL's from which you want the spidering to begin.

    Example:

     
        IndexDir http://www.my-site.com/index.html
        IndexDir http://localhost/index.html

    Obviously, using the HTTP method to index is much slower than indexing local files. Be well aware that some sites do not appreciate spidering and may block your IP address. You may wish to contact the remote site before spidering their web site. More information about spidering can be found in Directives for the HTTP Access Method Only below.

    For the prog method of access IndexDir specifies the path to the program(s) to execute. The external program must correctly format the documents being passed back to Swish-e. Examples of external programs are provided in the prog-bin directory.

     
        IndexDir ./myprogram.pl

    See prog for details.

    Note: Not all directives work with all methods.

    NoContents *list of file suffixes*

    Files with these suffixes will not have their contents indexed, but will have their path name (file name) indexed instead.

    If the file's type is HTML or HTML2 (as set by IndexContents or DefaultContents) then the file will be parsed for a HTML title and that title will be indexed. Note that you must set the file's type with IndexContents or DefaultContents: .html and .htm are NOT type HTML by default. For example:

     
       IndexContents HTML* .htm .html

    If a title is found, it will still be checked for FileRules title, and the file will be skipped if a match is found. See FileRules.

    If the file's type is not HTML, or it is HTML and no title is found, then the file's path will be indexed.

    For example, this will allow searching by image file name.

     
        NoContents .gif .xbm .au .place|remove|prepend|append|regex]
    
    

  • ResultExtFormatName name -x format string

  • SpiderDirectory *path*

  • StoreDescription [XML <tag>|HTML <meta>|TXT size]

  • "SwishProgParameters *list of parameters*

  • SwishSearchDefaultRule [<AND-WORD>|<or-word>]

  • SwishSearchOperators <and-word> <or-word> <not-word>

  • TmpDir *path*

  • TranslateCharacters [*string1 string2*|:ascii7:]

  • TruncateDocSize *number of characters*

  • UndefinedMetaTags [error|ignore|INDEX|auto]

  • UndefinedXMLAttributes [DISABLE|error|ignore|index|auto]

  • UseStemming [yes|NO]

  • UseSoundex [yes|NO]

  • UseWords [*list of words*|File: path]

  • WordCharacters *string of characters*

  • XMLClassAttributes *list of XML attribute names*

    [ TOC ]


    Directives that Control Swish

    These configuration directives control the general behavior of Swish-e.

    IncludeConfigFile *path to config file*

    This directive can be used to include configuration directives located in another file.

     
        IncludeConfigFile /usr/local/swish/conf/site_config.config

    IndexReport [0|1|2|3]

    This is how detailed you want reporting while indexing. You can specify numbers 0 to 3. 0 is totally silent, 3 is the most verbose. The default is 1.

    This may be overridden from the command line via the -v switch (see SWISH-RUN).

    ParserWarnLevel [0|1|2|3]

    Sets the error level when using the libxml2 parser for XML and HTML. libxml2 will point out structural errors in your documents.

     
        0 = no report
        1 = fatal errors
        2 = errors
        3 = warnings

    The exception to this is UTF-8 to Latin-1 conversion errors are reported at level 1. This is because words may be indexed incorrectly in these cases.

    Note that unlike other errors generated by Swish-e, these errors are sent to stderr.

    IndexFile *path*

    Index file specifies the location of the generated index file. If not specified, Swish-e will create the file index.swish-e in the current directory.

     
        IndexFile /usr/local/swish/site.index

    obeyRobotsNoIndex [yes|NO]

    When enabled, Swish-e will not index any HTML file that contains:

     
        <meta name="robots" content="noindex">

    The default is to ignore these meta tags and index the document. This tag is described at http://www.robotstxt.org/wc/exclusion.html.

    Note: This feature is only available with the libxml2 HTML parser.

    Also, if you are using the libxml2 parser (HTML2 and XML2) then you can use the following comments in your documents to prevent indexing:

     
           <!-- SwishCommand noindex -->
           <!-- SwishCommand index -->

    and/or these may be used also:

     
           <!-- noindex -->
           <!-- index -->

    For example, these are very helpful to prevent indexing of common headers, footers, and menus.

    NOTE: This following items are currently not available. These items require Swish-e to parse the configuration file while searching.

    EnableAltSearchSyntax [yes|NO]

    NOTE: This following item is currently not available.

    Enable alternate search syntax. Allows the usage of a basic "Altavista(c)", "Lycos(c)", etc. like search syntax. This means a search query can contain "+" and "-" as syntax parameter.

    Example:

     
        swish-e -w "+word1 +word2 -word3  word4 word5"
        "+"  = following word has to be in all found documents
        "-"  = following word may not be in any document found
        " "  = following word will be searched in documents

    SwishSearchOperators <and-word> <or-word> <not-word>

    NOTE: This following item is currently not available.

    Using this config directive you can change the boolean search operators of Swish-e, e.g. to adapt these to your language. The default is: AND OR NOT

    Example (german):

     
        SwishSearchOperators   UND  ODER  NICHT

    SwishSearchDefaultRule [<AND-WORD>|<or-word>]

    NOTE: This following item is currently not available.

    SwishSearchDefaultRule defines the default Boolean operator to use if none is specified between words or phrases. The default is AND.

    The word you specify must match one of the available SwishSearchOperators.

    Example:

     
        SwishSearchOperators   UND  ODER  NICHT
        # Make it act like a web search engine
        SwishSearchDefaultRule ODER

    ResultExtFormatName name -x format string

    NOTE: This following item is currently not available.

    The output of Swish-e can be defined by specifying a format string with the -x command line argument. Using ResultExtFormatName you can assign a predefined format string to a name.

    Examples:

     
        ResultExtFormatName  moreinfo   "%c|%r|%t|%p|<author>|<publishyear>\n"

    Then when searching you can specify the format string's name

     
        swish-e   ...  -x moreinfo  ...

    See the -x switch in SWISH-RUN for more information about output formats.

    [ TOC ]


    Administrative Headers Directives

    Swish-e stores configuration information in the header of the index file. This information can be retrieved while searching or by functions in the Swish-e C library. There are a number of fields available for your own use. None of these fields are required:

    IndexName *text*

    IndexDescription *text*

    IndexPointer *text*

    IndexAdmin *text*

    These variables specify information that goes into index files to help users and administrators. IndexName should be the name of your index, like a book title. IndexDescription is a short description of the index or a URL pointing to a more full description. IndexPointer should be a pointer to the original information, most likely a URL. IndexAdmin should be the name of the index maintainer and can include name and email information. These values should not be more than 70 or so characters and should be contained in quotes. Note that the automatically generated date in index files is in D/M/Y and 24-hour format.

    Examples:

     
        IndexName "Linux Documentation"
        IndexDescription "This is an index of /usr/doc on our Linux machine." 
        IndexPointer http://localhost/swish/linux/index.html
        IndexAdmin webmaster

    [ TOC ]


    Document Source Directives

    These directives control what documents are indexed and how they are accessed. See also Directives for the File Access method only and Directives for the HTTP Access Method Only for directives that are specific to those access methods.

    IndexDir [directories or files|URL|external program]

    IndexDir defines the source of the documents for Swish-e. Swish-e currently supports three file access methods: File system, HTTP (also called spidering), and prog for reading files from an external program.

    The -S command line argument is used to select the file access method.

     
        swish-e -c swish.config -S fs    - file system
        swish-e -c swish.config -S http  - internal http spider
        swish-e -c swish.config -S prog  - external program of any type

    For the fs method of access IndexDir is a space-separated list of files and directories to index. Use a forward slash as the path separator in MS Windows.

    For the http method the IndexDir setting is a list of space-separated URLs.

    For the prog method the IndexDir setting is a list of space-separated programs to run (which generate documents for swish to index).

    You may specify more than one IndexDir directive.

    Any sub-directories of any listed directory will also be indexed.

    Note: While processing directories, Swish-e will ignore any files or directories that begin with a dot ("."). You may index files or directories that begin with a dot by specifying their name with IndexDir or -i.

    Examples:

     
        # Index this directory an any subdirectories
        IndexDir /usr/local/home/http

     
        # Index the docs directory in current directory
        IndexDir ./docs

     
        # Index these files in the current directory
        IndexDir ./index.html ./page1.html ./page2.html
        # and index this directory, too
        IndexDir ../public_html

    For the HTTP method of access specify the URL's from which you want the spidering to begin.

    Example:

     
        IndexDir http://www.my-site.com/index.html
        IndexDir http://localhost/index.html

    Obviously, using the HTTP method to index is much slower than indexing local files. Be well aware that some sites do not appreciate spidering and may block your IP address. You may wish to contact the remote site before spidering their web site. More information about spidering can be found in Directives for the HTTP Access Method Only below.

    For the prog method of access IndexDir specifies the path to the program(s) to execute. The external program must correctly format the documents being passed back to Swish-e. Examples of external programs are provided in the prog-bin directory.

     
        IndexDir ./myprogram.pl

    See prog for details.

    Note: Not all directives work with all methods.

    NoContents *list of file suffixes*

    Files with these suffixes will not have their contents indexed, but will have their path name (file name) indexed instead.

    If the file's type is HTML or HTML2 (as set by IndexContents or DefaultContents) then the file will be parsed for a HTML title and that title will be indexed. Note that you must set the file's type with IndexContents or DefaultContents: .html and .htm are NOT type HTML by default. For example:

     
       IndexContents HTML* .htm .html

    If a title is found, it will still be checked for FileRules title, and the file will be skipped if a match is found. See FileRules.

    If the file's type is not HTML, or it is HTML and no title is found, then the file's path will be indexed.

    For example, this will allow searching by image file name.

     
        NoContents .gif .xbm .au .place|remove|prepend|append|regex]
    
    

  • ResultExtFormatName name -x format string

  • SpiderDirectory *path*

  • StoreDescription [XML <tag>|HTML <meta>|TXT size]

  • "SwishProgParameters *list of parameters*

  • SwishSearchDefaultRule [<AND-WORD>|<or-word>]

  • SwishSearchOperators <and-word> <or-word> <not-word>

  • TmpDir *path*

  • TranslateCharacters [*string1 string2*|:ascii7:]

  • TruncateDocSize *number of characters*

  • UndefinedMetaTags [error|ignore|INDEX|auto]

  • UndefinedXMLAttributes [DISABLE|error|ignore|index|auto]

  • UseStemming [yes|NO]

  • UseSoundex [yes|NO]

  • UseWords [*list of words*|File: path]

  • WordCharacters *string of characters*

  • XMLClassAttributes *list of XML attribute names*

    [ TOC ]


    Directives that Control Swish

    These configuration directives control the general behavior of Swish-e.

    IncludeConfigFile *path to config file*

    This directive can be used to include configuration directives located in another file.

     
        IncludeConfigFile /usr/local/swish/conf/site_config.config

    IndexReport [0|1|2|3]

    This is how detailed you want reporting while indexing. You can specify numbers 0 to 3. 0 is totally silent, 3 is the most verbose. The default is 1.

    This may be overridden from the command line via the -v switch (see SWISH-RUN).

    ParserWarnLevel [0|1|2|3]

    Sets the error level when using the libxml2 parser for XML and HTML. libxml2 will point out structural errors in your documents.

     
        0 = no report
        1 = fatal errors
        2 = errors
        3 = warnings

    The exception to this is UTF-8 to Latin-1 conversion errors are reported at level 1. This is because words may be indexed incorrectly in these cases.

    Note that unlike other errors generated by Swish-e, these errors are sent to stderr.

    IndexFile *path*

    Index file specifies the location of the generated index file. If not specified, Swish-e will create the file index.swish-e in the current directory.

     
        IndexFile /usr/local/swish/site.index

    obeyRobotsNoIndex [yes|NO]

    When enabled, Swish-e will not index any HTML file that contains:

     
        <meta name="robots" content="noindex">

    The default is to ignore these meta tags and index the document. This tag is described at http://www.robotstxt.org/wc/exclusion.html.

    Note: This feature is only available with the libxml2 HTML parser.

    Also, if you are using the libxml2 parser (HTML2 and XML2) then you can use the following comments in your documents to prevent indexing:

     
           <!-- SwishCommand noindex -->
           <!-- SwishCommand index -->

    and/or these may be used also:

     
           <!-- noindex -->
           <!-- index -->

    For example, these are very helpful to prevent indexing of common headers, footers, and menus.

    NOTE: This following items are currently not available. These items require Swish-e to parse the configuration file while searching.

    EnableAltSearchSyntax [yes|NO]

    NOTE: This following item is currently not available.

    Enable alternate search syntax. Allows the usage of a basic "Altavista(c)", "Lycos(c)", etc. like search syntax. This means a search query can contain "+" and "-" as syntax parameter.

    Example:

     
        swish-e -w "+word1 +word2 -word3  word4 word5"
        "+"  = following word has to be in all found documents
        "-"  = following word may not be in any document found
        " "  = following word will be searched in documents

    SwishSearchOperators <and-word> <or-word> <not-word>

    NOTE: This following item is currently not available.

    Using this config directive you can change the boolean search operators of Swish-e, e.g. to adapt these to your language. The default is: AND OR NOT

    Example (german):

     
        SwishSearchOperators   UND  ODER  NICHT

    SwishSearchDefaultRule [<AND-WORD>|<or-word>]

    NOTE: This following item is currently not available.

    SwishSearchDefaultRule defines the default Boolean operator to use if none is specified between words or phrases. The default is AND.

    The word you specify must match one of the available SwishSearchOperators.

    Example:

     
        SwishSearchOperators   UND  ODER  NICHT
        # Make it act like a web search engine
        SwishSearchDefaultRule ODER

    ResultExtFormatName name -x format string

    NOTE: This following item is currently not available.

    The output of Swish-e can be defined by specifying a format string with the -x command line argument. Using ResultExtFormatName you can assign a predefined format string to a name.

    Examples:

     
        ResultExtFormatName  moreinfo   "%c|%r|%t|%p|<author>|<publishyear>\n"

    Then when searching you can specify the format string's name

     
        swish-e   ...  -x moreinfo  ...

    See the -x switch in SWISH-RUN for more information about output formats.

    [ TOC ]


    Administrative Headers Directives

    Swish-e stores configuration information in the header of the index file. This information can be retrieved while searching or by functions in the Swish-e C library. There are a number of fields available for your own use. None of these fields are required:

    IndexName *text*

    IndexDescription *text*

    IndexPointer *text*

    IndexAdmin *text*

    These variables specify information that goes into index files to help users and administrators. IndexName should be the name of your index, like a book title. IndexDescription is a short description of the index or a URL pointing to a more full description. IndexPointer should be a pointer to the original information, most likely a URL. IndexAdmin should be the name of the index maintainer and can include name and email information. These values should not be more than 70 or so characters and should be contained in quotes. Note that the automatically generated date in index files is in D/M/Y and 24-hour format.

    Examples:

     
        IndexName "Linux Documentation"
        IndexDescription "This is an index of /usr/doc on our Linux machine." 
        IndexPointer http://localhost/swish/linux/index.html
        IndexAdmin webmaster

    [ TOC ]


    Document Source Directives

    These directives control what documents are indexed and how they are accessed. See also Directives for the File Access method only and Directives for the HTTP Access Method Only for directives that are specific to those access methods.

    IndexDir [directories or files|URL|external program]

    IndexDir defines the source of the documents for Swish-e. Swish-e currently supports three file access methods: File system, HTTP (also called spidering), and prog for reading files from an external program.

    The -S command line argument is used to select the file access method.

     
        swish-e -c swish.config -S fs    - file system
        swish-e -c swish.config -S http  - internal http spider
        swish-e -c swish.config -S prog  - external program of any type

    For the fs method of access IndexDir is a space-separated list of files and directories to index. Use a forward slash as the path separator in MS Windows.

    For the http method the IndexDir setting is a list of space-separated URLs.

    For the prog method the IndexDir setting is a list of space-separated programs to run (which generate documents for swish to index).

    You may specify more than one IndexDir directive.

    Any sub-directories of any listed directory will also be indexed.

    Note: While processing directories, Swish-e will ignore any files or directories that begin with a dot ("."). You may index files or directories that begin with a dot by specifying their name with IndexDir or -i.

    Examples:

     
        # Index this directory an any subdirectories
        IndexDir /usr/local/home/http

     
        # Index the docs directory in current directory
        IndexDir ./docs

     
        # Index these files in the current directory
        IndexDir ./index.html ./page1.html ./page2.html
        # and index this directory, too
        IndexDir ../public_html

    For the HTTP method of access specify the URL's from which you want the spidering to begin.

    Example:

     
        IndexDir http://www.my-site.com/index.html
        IndexDir http://localhost/index.html

    Obviously, using the HTTP method to index is much slower than indexing local files. Be well aware that some sites do not appreciate spidering and may block your IP address. You may wish to contact the remote site before spidering their web site. More information about spidering can be found in Directives for the HTTP Access Method Only below.

    For the prog method of access IndexDir specifies the path to the program(s) to execute. The external program must correctly format the documents being passed back to Swish-e. Examples of external programs are provided in the prog-bin directory.

     
        IndexDir ./myprogram.pl

    See prog for details.

    Note: Not all directives work with all methods.

    NoContents *list of file suffixes*

    Files with these suffixes will not have their contents indexed, but will have their path name (file name) indexed instead.

    If the file's type is HTML or HTML2 (as set by IndexContents or DefaultContents) then the file will be parsed for a HTML title and that title will be indexed. Note that you must set the file's type with IndexContents or DefaultContents: .html and .htm are NOT type HTML by default. For example:

     
       IndexContents HTML* .htm .html

    If a title is found, it will still be checked for FileRules title, and the file will be skipped if a match is found. See FileRules.

    If the file's type is not HTML, or it is HTML and no title is found, then the file's path will be indexed.

    For example, this will allow searching by image file name.

     
        NoContents .gif .xbm .au .place|remove|prepend|append|regex]
    
    

  • ResultExtFormatName name -x format string

  • SpiderDirectory *path*

  • StoreDescription [XML <tag>|HTML <meta>|TXT size]

  • "SwishProgParameters *list of parameters*

  • SwishSearchDefaultRule [<AND-WORD>|<or-word>]

  • SwishSearchOperators <and-word> <or-word> <not-word>

  • TmpDir *path*

  • TranslateCharacters [*string1 string2*|:ascii7:]

  • TruncateDocSize *number of characters*

  • UndefinedMetaTags [error|ignore|INDEX|auto]

  • UndefinedXMLAttributes [DISABLE|error|ignore|index|auto]

  • UseStemming [yes|NO]

  • UseSoundex [yes|NO]

  • UseWords [*list of words*|File: path]

  • WordCharacters *string of characters*

  • XMLClassAttributes *list of XML attribute names*

    [ TOC ]


    Directives that Control Swish

    These configuration directives control the general behavior of Swish-e.

    IncludeConfigFile *path to config file*

    This directive can be used to include configuration directives located in another file.

     
        IncludeConfigFile /usr/local/swish/conf/site_config.config

    IndexReport [0|1|2|3]

    This is how detailed you want reporting while indexing. You can specify numbers 0 to 3. 0 is totally silent, 3 is the most verbose. The default is 1.

    This may be overridden from the command line via the -v switch (see SWISH-RUN).

    ParserWarnLevel [0|1|2|3]

    Sets the error level when using the libxml2 parser for XML and HTML. libxml2 will point out structural errors in your documents.

     
        0 = no report
        1 = fatal errors
        2 = errors
        3 = warnings

    The exception to this is UTF-8 to Latin-1 conversion errors are reported at level 1. This is because words may be indexed incorrectly in these cases.

    Note that unlike other errors generated by Swish-e, these errors are sent to stderr.

    IndexFile *path*

    Index file specifies the location of the generated index file. If not specified, Swish-e will create the file index.swish-e in the current directory.

     
        IndexFile /usr/local/swish/site.index

    obeyRobotsNoIndex [yes|NO]

    When enabled, Swish-e will not index any HTML file that contains:

     
        <meta name="robots" content="noindex">

    The default is to ignore these meta tags and index the document. This tag is described at http://www.robotstxt.org/wc/exclusion.html.

    Note: This feature is only available with the libxml2 HTML parser.

    Also, if you are using the libxml2 parser (HTML2 and XML2) then you can use the following comments in your documents to prevent indexing:

     
           <!-- SwishCommand noindex -->
           <!-- SwishCommand index -->

    and/or these may be used also:

     
           <!-- noindex -->
           <!-- index -->

    For example, these are very helpful to prevent indexing of common headers, footers, and menus.

    NOTE: This following items are currently not available. These items require Swish-e to parse the configuration file while searching.

    EnableAltSearchSyntax [yes|NO]

    NOTE: This following item is currently not available.

    Enable alternate search syntax. Allows the usage of a basic "Altavista(c)", "Lycos(c)", etc. like search syntax. This means a search query can contain "+" and "-" as syntax parameter.

    Example:

     
        swish-e -w "+word1 +word2 -word3  word4 word5"
        "+"  = following word has to be in all found documents
        "-"  = following word may not be in any document found
        " "  = following word will be searched in documents

    SwishSearchOperators <and-word> <or-word> <not-word>

    NOTE: This following item is currently not available.

    Using this config directive you can change the boolean search operators of Swish-e, e.g. to adapt these to your language. The default is: AND OR NOT

    Example (german):

     
        SwishSearchOperators   UND  ODER  NICHT

    SwishSearchDefaultRule [<AND-WORD>|<or-word>]

    NOTE: This following item is currently not available.

    SwishSearchDefaultRule defines the default Boolean operator to use if none is specified between words or phrases. The default is AND.

    The word you specify must match one of the available SwishSearchOperators.

    Example:

     
        SwishSearchOperators   UND  ODER  NICHT
        # Make it act like a web search engine
        SwishSearchDefaultRule ODER

    ResultExtFormatName name -x format string

    NOTE: This following item is currently not available.

    The output of Swish-e can be defined by specifying a format string with the -x command line argument. Using ResultExtFormatName you can assign a predefined format string to a name.

    Examples:

     
        ResultExtFormatName  moreinfo   "%c|%r|%t|%p|<author>|<publishyear>\n"

    Then when searching you can specify the format string's name

     
        swish-e   ...  -x moreinfo  ...

    See the -x switch in SWISH-RUN for more information about output formats.

    [ TOC ]


    Administrative Headers Directives

    Swish-e stores configuration information in the header of the index file. This information can be retrieved while searching or by functions in the Swish-e C library. There are a number of fields available for your own use. None of these fields are required:

    IndexName *text*

    IndexDescription *text*

    IndexPointer *text*

    IndexAdmin *text*

    These variables specify information that goes into index files to help users and administrators. IndexName should be the name of your index, like a book title. IndexDescription is a short description of the index or a URL pointing to a more full description. IndexPointer should be a pointer to the original information, most likely a URL. IndexAdmin should be the name of the index maintainer and can include name and email information. These values should not be more than 70 or so characters and should be contained in quotes. Note that the automatically generated date in index files is in D/M/Y and 24-hour format.

    Examples:

     
        IndexName "Linux Documentation"
        IndexDescription "This is an index of /usr/doc on our Linux machine." 
        IndexPointer http://localhost/swish/linux/index.html
        IndexAdmin webmaster

    [ TOC ]


    Document Source Directives

    These directives control what documents are indexed and how they are accessed. See also Directives for the File Access method only and Directives for the HTTP Access Method Only for directives that are specific to those access methods.

    IndexDir [directories or files|URL|external program]

    IndexDir defines the source of the documents for Swish-e. Swish-e currently supports three file access methods: File system, HTTP (also called spidering), and prog for reading files from an external program.

    The -S command line argument is used to select the file access method.

     
        swish-e -c swish.config -S fs    - file system
        swish-e -c swish.config -S http  - internal http spider
        swish-e -c swish.config -S prog  - external program of any type

    For the fs method of access IndexDir is a space-separated list of files and directories to index. Use a forward slash as the path separator in MS Windows.

    For the http method the IndexDir setting is a list of space-separated URLs.

    For the prog method the IndexDir setting is a list of space-separated programs to run (which generate documents for swish to index).

    You may specify more than one IndexDir directive.

    Any sub-directories of any listed directory will also be indexed.

    Note: While processing directories, Swish-e will ignore any files or directories that begin with a dot ("."). You may index files or directories that begin with a dot by specifying their name with IndexDir or -i.

    Examples:

     
        # Index this directory an any subdirectories
        IndexDir /usr/local/home/http

     
        # Index the docs directory in current directory
        IndexDir ./docs

     
        # Index these files in the current directory
        IndexDir ./index.html ./page1.html ./page2.html
        # and index this directory, too
        IndexDir ../public_html

    For the HTTP method of access specify the URL's from which you want the spidering to begin.

    Example:

     
        IndexDir http://www.my-site.com/index.html
        IndexDir http://localhost/index.html

    Obviously, using the HTTP method to index is much slower than indexing local files. Be well aware that some sites do not appreciate spidering and may block your IP address. You may wish to contact the remote site before spidering their web site. More information about spidering can be found in Directives for the HTTP Access Method Only below.

    For the prog method of access IndexDir specifies the path to the program(s) to execute. The external program must correctly format the documents being passed back to Swish-e. Examples of external programs are provided in the prog-bin directory.

     
        IndexDir ./myprogram.pl

    See prog for details.

    Note: Not all directives work with all methods.

    NoContents *list of file suffixes*

    Files with these suffixes will not have their contents indexed, but will have their path name (file name) indexed instead.

    If the file's type is HTML or HTML2 (as set by IndexContents or DefaultContents) then the file will be parsed for a HTML title and that title will be indexed. Note that you must set the file's type with IndexContents or DefaultContents: .html and .htm are NOT type HTML by default. For example:

     
       IndexContents HTML* .htm .html

    If a title is found, it will still be checked for FileRules title, and the file will be skipped if a match is found. See FileRules.

    If the file's type is not HTML, or it is HTML and no title is found, then the file's path will be indexed.

    For example, this will allow searching by image file name.

     
        NoContents .gif .xbm .au .place|remove|prepend|append|regex]
    
    

  • ResultExtFormatName name -x format string

  • SpiderDirectory *path*

  • StoreDescription [XML <tag>|HTML <meta>|TXT size]

  • "SwishProgParameters *list of parameters*

  • SwishSearchDefaultRule [<AND-WORD>|<or-word>]

  • SwishSearchOperators <and-word> <or-word> <not-word>

  • TmpDir *path*

  • TranslateCharacters [*string1 string2*|:ascii7:]

  • TruncateDocSize *number of characters*

  • UndefinedMetaTags [error|ignore|INDEX|auto]

  • UndefinedXMLAttributes [DISABLE|error|ignore|index|auto]

  • UseStemming [yes|NO]

  • UseSoundex [yes|NO]

  • UseWords [*list of words*|File: path]

  • WordCharacters *string of characters*

  • XMLClassAttributes *list of XML attribute names*

    [ TOC ]


    Directives that Control Swish

    These configuration directives control the general behavior of Swish-e.

    IncludeConfigFile *path to config file*

    This directive can be used to include configuration directives located in another file.

     
        IncludeConfigFile /usr/local/swish/conf/site_config.config

    IndexReport [0|1|2|3]

    This is how detailed you want reporting while indexing. You can specify numbers 0 to 3. 0 is totally silent, 3 is the most verbose. The default is 1.

    This may be overridden from the command line via the -v switch (see SWISH-RUN).

    ParserWarnLevel [0|1|2|3]

    Sets the error level when using the libxml2 parser for XML and HTML. libxml2 will point out structural errors in your documents.

     
        0 = no report
        1 = fatal errors
        2 = errors
        3 = warnings

    The exception to this is UTF-8 to Latin-1 conversion errors are reported at level 1. This is because words may be indexed incorrectly in these cases.

    Note that unlike other errors generated by Swish-e, these errors are sent to stderr.

    IndexFile *path*

    Index file specifies the location of the generated index file. If not specified, Swish-e will create the file index.swish-e in the current directory.

     
        IndexFile /usr/local/swish/site.index

    obeyRobotsNoIndex [yes|NO]

    When enabled, Swish-e will not index any HTML file that contains:

     
        <meta name="robots" content="noindex">

    The default is to ignore these meta tags and index the document. This tag is described at http://www.robotstxt.org/wc/exclusion.html.

    Note: This feature is only available with the libxml2 HTML parser.

    Also, if you are using the libxml2 parser (HTML2 and XML2) then you can use the following comments in your documents to prevent indexing:

     
           <!-- SwishCommand noindex -->
           <!-- SwishCommand index -->

    and/or these may be used also:

     
           <!-- noindex -->
           <!-- index -->

    For example, these are very helpful to prevent indexing of common headers, footers, and menus.

    NOTE: This following items are currently not available. These items require Swish-e to parse the configuration file while searching.

    EnableAltSearchSyntax [yes|NO]

    NOTE: This following item is currently not available.

    Enable alternate search syntax. Allows the usage of a basic "Altavista(c)", "Lycos(c)", etc. like search syntax. This means a search query can contain "+" and "-" as syntax parameter.

    Example:

     
        swish-e -w "+word1 +word2 -word3  word4 word5"
        "+"  = following word has to be in all found documents
        "-"  = following word may not be in any document found
        " "  = following word will be searched in documents

    SwishSearchOperators <and-word> <or-word> <not-word>

    NOTE: This following item is currently not available.

    Using this config directive you can change the boolean search operators of Swish-e, e.g. to adapt these to your language. The default is: AND OR NOT

    Example (german):

     
        SwishSearchOperators   UND  ODER  NICHT

    SwishSearchDefaultRule [<AND-WORD>|<or-word>]

    NOTE: This following item is currently not available.

    SwishSearchDefaultRule defines the default Boolean operator to use if none is specified between words or phrases. The default is AND.

    The word you specify must match one of the available SwishSearchOperators.

    Example:

     
        SwishSearchOperators   UND  ODER  NICHT
        # Make it act like a web search engine
        SwishSearchDefaultRule ODER

    ResultExtFormatName name -x format string

    NOTE: This following item is currently not available.

    The output of Swish-e can be defined by specifying a format string with the -x command line argument. Using ResultExtFormatName you can assign a predefined format string to a name.

    Examples:

     
        ResultExtFormatName  moreinfo   "%c|%r|%t|%p|<author>|<publishyear>\n"

    Then when searching you can specify the format string's name

     
        swish-e   ...  -x moreinfo  ...

    See the -x switch in SWISH-RUN for more information about output formats.

    [ TOC ]


    Administrative Headers Directives

    Swish-e stores configuration information in the header of the index file. This information can be retrieved while searching or by functions in the Swish-e C library. There are a number of fields available for your own use. None of these fields are required:

    IndexName *text*

    IndexDescription *text*

    IndexPointer *text*

    IndexAdmin *text*

    These variables specify information that goes into index files to help users and administrators. IndexName should be the name of your index, like a book title. IndexDescription is a short description of the index or a URL pointing to a more full description. IndexPointer should be a pointer to the original information, most likely a URL. IndexAdmin should be the name of the index maintainer and can include name and email information. These values should not be more than 70 or so characters and should be contained in quotes. Note that the automatically generated date in index files is in D/M/Y and 24-hour format.

    Examples:

     
        IndexName "Linux Documentation"
        IndexDescription "This is an index of /usr/doc on our Linux machine." 
        IndexPointer http://localhost/swish/linux/index.html
        IndexAdmin webmaster

    [ TOC ]


    Document Source Directives

    These directives control what documents are indexed and how they are accessed. See also Directives for the File Access method only and Directives for the HTTP Access Method Only for directives that are specific to those access methods.

    IndexDir [directories or files|URL|external program]

    IndexDir defines the source of the documents for Swish-e. Swish-e currently supports three file access methods: File system, HTTP (also called spidering), and prog for reading files from an external program.

    The -S command line argument is used to select the file access method.

     
        swish-e -c swish.config -S fs    - file system
        swish-e -c swish.config -S http  - internal http spider
        swish-e -c swish.config -S prog  - external program of any type

    For the fs method of access IndexDir is a space-separated list of files and directories to index. Use a forward slash as the path separator in MS Windows.

    For the http method the IndexDir setting is a list of space-separated URLs.

    For the prog method the IndexDir setting is a list of space-separated programs to run (which generate documents for swish to index).

    You may specify more than one IndexDir directive.

    Any sub-directories of any listed directory will also be indexed.

    Note: While processing directories, Swish-e will ignore any files or directories that begin with a dot ("."). You may index files or directories that begin with a dot by specifying their name with IndexDir or -i.

    Examples:

     
        # Index this directory an any subdirectories
        IndexDir /usr/local/home/http

     
        # Index the docs directory in current directory
        IndexDir ./docs

     
        # Index these files in the current directory
        IndexDir ./index.html ./page1.html ./page2.html
        # and index this directory, too
        IndexDir ../public_html

    For the HTTP method of access specify the URL's from which you want the spidering to begin.

    Example:

     
        IndexDir http://www.my-site.com/index.html
        IndexDir http://localhost/index.html
    <