Skip to main content.
home | support | download

The Swish-e FAQ - Answers to Common Questions

Swish-e version 2.4.7

Table of Contents


OVERVIEW

List of commonly asked and answered questions. Please review this document before asking questions on the Swish-e discussion list.

General Questions

What is Swish-e?

Swish-e is Simple Web Indexing System for Humans - Enhanced. With it, you can quickly and easily index directories of files or remote web sites and search the generated indexes for words and phrases.

So, is Swish-e a search engine?

Well, yes. Probably the most common use of Swish-e is to provide a search engine for web sites. The Swish-e distribution includes CGI scripts that can be used with it to add a search engine for your web site. The CGI scripts can be found in the example directory of the distribution package. See the README file for information about the scripts.

But Swish-e can also be used to index all sorts of data, such as email messages, data stored in a relational database management system, XML documents, or documents such as Word and PDF documents -- or any combination of those sources at the same time. Searches can be limited to fields or MetaNames within a document, or limited to areas within an HTML document (e.g. body, title). Programs other than CGI applications can use Swish-e, as well.

Should I upgrade if I'm already running a previous version of Swish-e?

A large number of bug fixes, feature additions, and logic corrections were made in version 2.2. In addition, indexing speed has been drastically improved (reports of indexing times changing from four hours to 5 minutes), and major parts of the indexing and search parsers have been rewritten. There's better debugging options, enhanced output formats, more document meta data (e.g. last modified date, document summary), options for indexing from external data sources, and faster spidering just to name a few changes. (See the CHANGES file for more information.

Since so much effort has gone into version 2.2, support for previous versions will probably be limited.

Are there binary distributions available for Swish-e on platform foo?

Foo? Well, yes there are some binary distributions available. Please see the Swish-e web site for a list at http://swish-e.org/.

In general, it is recommended that you build Swish-e from source, if possible.

Do I need to reindex my site each time I upgrade to a new Swish-e version?

At times it might not strictly be necessary, but since you don't really know if anything in the index has changed, it is a good rule to reindex.

What's the advantage of using the libxml2 library for parsing HTML?

Swish-e may be linked with libxml2, a library for working with HTML and XML documents. Swish-e can use libxml2 for parsing HTML and XML documents.

The libxml2 parser is a better parser than Swish-e's built-in HTML parser. It offers more features, and it does a much better job at extracting out the text from a web page. In addition, you can use the ParserWarningLevel configuration setting to find structural errors in your documents that could (and would with Swish-e's HTML parser) cause documents to be indexed incorrectly.

Libxml2 is not required, but is strongly recommended for parsing HTML documents. It's also recommended for parsing XML, as it offers many more features than the internal Expat xml.c parser.

The internal HTML parser will have limited support, and does have a number of bugs. For example, HTML entities may not always be correctly converted and properties do not have entities converted. The internal parser tends to get confused when invalid HTML is parsed where the libxml2 parser doesn't get confused as often. The structure is better detected with the libxml2 parser.

If you are using the Perl module (the C interface to the Swish-e library) you may wish to build two versions of Swish-e, one with the libxml2 library linked in the binary, and one without, and build the Perl module against the library without the libxml2 code. This is to save space in the library. Hopefully, the library will someday soon be split into indexing and searching code (volunteers welcome).

Does Swish-e include a CGI interface?

Yes. Kind of.

There's two example CGI scripts included, swish.cgi and search.cgi. Both are installed at $prefix/lib/swish-e.

Both require a bit of work to setup and use. Swish.cgi is probably what most people will want to use as it contains more features. Search.cgi is for those that want to start with a small script and customize it to fit their needs.

An example of using swish.cgi is given in the INSTALL man page, and it the swish.cgi documentation. Like often is the case, it will be easier to use if you first read the documentation.

Please use caution about CGI scripts found on the Internet for use with Swish-e. Some are not secure.

The included example CGI scripts were designed with security in mind. Regardless, you are encouraged to have your local Perl expert review it (and all other CGI scripts you use) before placing it into production. This is just a good policy to follow.

How secure is Swish-e?

We know of no security issues with using Swish-e. Careful attention has been made with regard to common security problems such as buffer overruns when programming Swish-e.

The most likely security issue with Swish-e is when it is run via a poorly written CGI interface. This is not limited to CGI scripts written in Perl, as it's just as easy to write an insecure CGI script in C, Java, PHP, or Python. A good source of information is included with the Perl distribution. Type perldoc perlsec at your local prompt for more information. Another must-read document is located at http://www.w3.org/Security/faq/wwwsf4.html.

Note that there are many free yet insecure and poorly written CGI scripts available -- even some designed for use with Swish-e. Please carefully review any CGI script you use. Free is not such a good price when you get your server hacked...

Should I run Swish-e as the superuser (root)?

No. Never.

What files does Swish-e write?

Swish writes the index file, of course. This is specified with the IndexFile configuration directive or by the -f command line switch.

The index file is actually a collection of files, but all start with the file name specified with the IndexFile directive or the -f command line switch.

For example, the file ending in .prop contains the document properties.

When creating the index files Swish-e appends the extension .temp to the index file names. When indexing is complete Swish-e renames the .temp files to the index files specified by IndexFile or -f. This is done so that existing indexes remain untouched until it completes indexing.

Swish-e also writes temporary files in some cases during indexing (e.g. -s http, -s prog with filters), when merging, and when using -e). Temporary files are created with the mkstemp(3) function (with 0600 permission on unix-like operating systems).

The temporary files are created in the directory specified by the environment variables TMPDIR and TMP in that order. If those are not set then swish uses the setting the configuration setting TmpDir. Otherwise, the temporary file will be located in the current directory.

Can I index PDF and MS-Word documents?

Yes, you can use a Filter to convert documents while indexing, or you can use a program that "feeds" documents to Swish-e that have already been converted. See Indexing below.

Can I index documents on a web server?

Yes, Swish-e provides two ways to index (spider) documents on a web server. See Spidering below.

Swish-e can retrieve documents from a file system or from a remote web server. It can also execute a program that returns documents back to it. This program can retrieve documents from a database, filter compressed documents files, convert PDF files, extract data from mail archives, or spider remote web sites.

Can I implement keywords in my documents?

Yes, Swish-e can associate words with MetaNames while indexing, and you can limit your searches to these MetaNames while searching.

In your HTML files you can put keywords in HTML META tags or in XML blocks.

META tags can have two formats in your source documents:

    <META NAME="DC.subject" CONTENT="digital libraries">

And in XML format (can also be used in HTML documents when using libxml2):

    <meta2>
        Some Content
    </meta2>

Then, to inform Swish-e about the existence of the meta name in your documents, edit the line in your configuration file:

    MetaNames DC.subject meta1 meta2

When searching you can now limit some or all search terms to that MetaName. For example, to look for documents that contain the word apple and also have either fruit or cooking in the DC.subject meta tag.

What are document properties?

A document property is typically data that describes the document. For example, properties might include a document's path name, its last modified date, its title, or its size. Swish-e stores a document's properties in the index file, and they can be reported back in search results.

Swish-e also uses properties for sorting. You may sort your results by one or more properties, in ascending or descending order.

Properties can also be defined within your documents. HTML and XML files can specify tags (see previous question) as properties. The contents of these tags can then be returned with search results. These user-defined properties can also be used for sorting search results.

For example, if you had the following in your documents

   <meta name="creator" content="accounting department">

and creator is defined as a property (see PropertyNames in SWISH-CONFIG) Swish-e can return accounting department with the result for that document.

    swish-e -w foo -p creator

Or for sorting:

    swish-e -w foo -s creator

What's the difference between MetaNames and PropertyNames?

MetaNames allows keywords searches in your documents. That is, you can use MetaNames to restrict searches to just parts of your documents.

PropertyNames, on the other hand, define text that can be returned with results, and can be used for sorting.

Both use meta tags found in your documents (as shown in the above two questions) to define the text you wish to use as a property or meta name.

You may define a tag as both a property and a meta name. For example:

   <meta name="creator" content="accounting department">

placed in your documents and then using configuration settings of:

    PropertyNames creator
    MetaNames creator

will allow you to limit your searches to documents created by accounting:

    swish-e -w 'foo and creator=(accounting)'

That will find all documents with the word foo that also have a creator meta tag that contains the word accounting. This is using MetaNames.

And you can also say:

    swish-e -w foo -p creator

which will return all documents with the word foo, but the results will also include the contents of the creator meta tag along with results. This is using properties.

You can use properties and meta names at the same time, too:

    swish-e -w creator=(accounting or marketing) -p creator -s creator

That searches only in the creator meta name for either of the words accounting or marketing, prints out the contents of the contents of the creator property, and sorts the results by the creator property name.

(See also the -x output format switch in SWISH-RUN.)

Can Swish-e index multi-byte characters?

No. This will require much work to change. But, Swish-e works with eight-bit characters, so many characters sets can be used. Note that it does call the ANSI-C tolower() function which does depend on the current locale setting. See locale(7) for more information.

Indexing

How do I pass Swish-e a list of files to index?

Currently, there is not a configuration directive to include a file that contains a list of files to index. But, there is a directive to include another configuration file.

    IncludeConfigFile /path/to/other/config

And in /path/to/other/config you can say:

    IndexDir file1 file2 file3 file4 file5 ...
    IndexDir file20 file21 file22

You may also specify more than one configuration file on the command line:

    ./swish-e -c config_one config_two config_three

Another option is to create a directory with symbolic links of the files to index, and index just that directory.

How does Swish-e know which parser to use?

Swish can parse HTML, XML, and text documents. The parser is set by associating a file extension with a parser by the IndexContents directive. You may set the default parser with the DefaultContents directive. If a document is not assigned a parser it will default to the HTML parser (HTML2 if built with libxml2).

You may use Filters or an external program to convert documents to HTML, XML, or text.

Can I reindex and search at the same time?

Yes. Starting with version 2.2 Swish-e indexes to temporary files, and then renames the files when indexing is complete. On most systems renames are atomic. But, since Swish-e also generates more than one file during indexing there will be a very short period of time between renaming the various files when the index is out of sync.

Settings in src/config.h control some options related to temporary files, and their use during indexing.

Can I index phrases?

Phrases are indexed automatically. To search for a phrase simply place double quotes around the phrase.

For example:

    swish-e -w 'free and "fast search engine"'

How can I prevent phrases from matching across sentences?

Use the BumpPositionCounterCharacters configuration directive.

Swish-e isn't indexing a certain word or phrase.

There are a number of configuration parameters that control what Swish-e considers a "word" and it has a debugging feature to help pinpoint any indexing problems.

Configuration file directives (SWISH-CONFIG) WordCharacters, BeginCharacters, EndCharacters, IgnoreFirstChar, and IgnoreLastChar are the main settings that Swish-e uses to define a "word". See SWISH-CONFIG and SWISH-RUN for details.

Swish-e also uses compile-time defaults for many settings. These are located in src/config.h file.

Use of the command line arguments -k, -v and -T are useful when debugging these problems. Using -T INDEXED_WORDS while indexing will display each word as it is indexed. You should specify one file when using this feature since it can generate a lot of output.

     ./swish-e -c my.conf -i problem.file -T INDEXED_WORDS

You may also wish to index a single file that contains words that are or are not indexing as you expect and use -T to output debugging information about the index. A useful command might be:

    ./swish-e -f index.swish-e -T INDEX_FULL

Once you see how Swish-e is parsing and indexing your words, you can adjust the configuration settings mentioned above to control what words are indexed.

Another useful command might be:

     ./swish-e -c my.conf -i problem.file -T PARSED_WORDS INDEXED_WORDS

This will show white-spaced words parsed from the document (PARSED_WORDS), and how those words are split up into separate words for indexing (INDEXED_WORDS).

How do I keep Swish-e from indexing numbers?

Swish-e indexes words as defined by the WordCharacters setting, as described above. So to avoid indexing numbers you simply remove digits from the WordCharacters setting.

There are also some settings in src/config.h that control what "words" are indexed. You can configure swish to never index words that are all digits, vowels, or consonants, or that contain more than some consecutive number of digits, vowels, or consonants. In general, you won't need to change these settings.

Also, there's an experimental feature called IgnoreNumberChars which allows you to define a set of characters that describe a number. If a word is made up of only those characters it will not be indexed.

Swish-e crashes and burns on a certain file. What can I do?

This shouldn't happen. If it does please post to the Swish-e discussion list the details so it can be reproduced by the developers.

In the mean time, you can use a FileRules directive to exclude the particular file name, or pathname, or its title. If there are serious problems in indexing certain types of files, they may not have valid text in them (they may be binary files, for instance). You can use NoContents to exclude that type of file.

Swish-e will issue a warning if an embedded nu