CHANGES - List of revisions
Swish-e version 2.4.7Table of Contents
-
OVERVIEW
- Version 2.4.7 - 4 April 2009
- Version 2.4.6 - 10 March 2008
- Version 2.4.5 - 22 Jan 2007
- Version 2.4.4 - 11 Oct 2006
- Version 2.4.3 December 9, 2004
- Version 2.4.3-pr1 - Wed Dec 1 09:52:50 PST 2004
- Version 2.4.2 - March 09, 2004
- Version 2.4.1 - December 17, 2003
- Version 2.4.0 - October 27, 2003
- Version 2.4.0 (Release Candidate 4) September 26, 2003
- Version 2.4.0 (Release Candidate 3) September 11, 2003
- Version 2.4.0 (Release Candidate 2) September 10, 2003
- Version 2.4.0 (Release Candidate 1) May 21, 2003
- Version 2.2.3 - December 11, 2002
- Version 2.2.2 - November 14, 2002
- Version 2.2.1 - September 26, 2002
- Version 2.2 - September 18, 2002
- Version 2.2rc1 - August 29, 2002
Version 2.4.7 - 4 April 2009
- Added ReturnRawRank for raw rank score
Setting ReturnRawRank to a true value will return the rank score unscaled. Can be set with the -a command line option (mnemonic: "a"bsolute rank score).
- Yanked setenv feature introduced in 2.4.6
The ranking debugging feature using setenv introduced in 2.4.6 was yanked. Some platforms (notably HP-UX and Windows) lack the setenv feature, and the convenience of setting the env var was not worth the limitations.
Version 2.4.6 - 10 March 2008
- MinWordLength respected in query parser
Clark Vent reported that the query parser was not respecting MinWordLength settings. See http://dev.swish-e.org/changeset/2145
- Patch to file.c.
The file.c patch was in response to http://swish-e.org/archive/2007-03/11321.html although that user never responded about that patch.
- SWISH_DEBUG_RANK env var now enables rank debugging
Set SWISH_DEBUG_RANK to a true value to enable lots of rank debugging on stderr.
- Perl Makefile.PL patched to fix MakeMaker issue
Recent versions of ExtUtils::MakeMaker revealed a bug in Makefile.PL. Patch from mschwern via RT, report by mpeters.
- LARGEFILE support detected automatically in configure
jrobinson852@yahoo.com suggest LARGEFILE support be auto-detected since it is needed so often on Linux systems.
- New Snowball stemmers
Trygve Falch contributed patches to update the Snowball stemmers, including new Hungarian and Romanian stemmers.
- Patched leaks
Anthony Dovgal patched two leaks. One when there's a failure to open a file the file name was not freed.
SwishSetSearchLimit() was nulling the search limits when an error was found in the parameters, but not freeing the existing limits.
- Leak in SwishResetSearchLimit
Fixed a leak if a limit was set and then reset but not prepared. Patch provided by Antony Dovgal.
- New API functions added
Added SwishGetStructure() and SwishGetPhraseDelimiter() functions which return relevant properties of the search object. Patch provided by Antony Dovgal.
Version 2.4.5 - 22 Jan 2007
- Fixed 'deflate' handling in spider.pl
spider.pl was using the wrong method do uncompress HTTP responses that were 'deflate' encoded. Also decode content based on the document's charset and encode back to charset before outputting.
- re-indexing required
The magic numbers in src/swish.h were changed to require re-indexing from version 2.4.4 indexes. This should have been done in 2.4.4 as well, and anytime the index format changes. -- karman
- fixed stemmer bug introduced in 2.4.4
stemmer.c had a mix up in the deprecated stemmer assignments for "Stemmer_en" and "Stem". Also fixed stemmer.h so that 2.4.3 indexes can be read correctly. -- karman
- Now fork/exec to run filters
FileFilter* was using popen to run the filter, which could pass user data though the shell. Now uses fork/exec if fork is available which should be everywhere except Windows. In windows popen is used but all parameters are double-quoted. -- moseley
- fixed signed/unsigned warnings from gcc 4.x
Cleaned up search.c to catch mismatched signedness warnings from newer GCC versions. This issue pre-existed 2.4.4 but the new wildcard features in search.c made for a lot more warnings. -- karman
- Makefile.mingw included in distrib
Modified root Makefile to include the perl/Makefile.mingw file. -- karman
Version 2.4.4 - 11 Oct 2006
- Version 2.4.4 RC1
Release Candidate 1 for 2.4.4, 2 Oct 2006.
- quote fix for FileFilter config param
Ludovic Drolez contributed a patch to fix a quoting issue with filenames. This affects non-Windows builds only.
- SWISH::Filter now on CPAN
SWISH::Filter is now available on http://cpan.org/. The version in the distribution is not kept in sync with the CPAN version. Install the CPAN version if you want the latest and greatest version.
- SWISH::API updated to 0.04
Added several fixes, including:
- added proximity feature and single character wildcard with '?' instead of '*'
Herman Knoops contributed these patches. See http://swish-e.org/archive/2006-05/10543.html
Error messages were also changed to better reflect correct use of wildcards.
- fixed bug when using DoubleMetaphone
Fixed problem reported by Andreas Völter where a query that generated a two-word query with DoubleMetaphone fuzzy mode was not working.
- fix sparc64 property issue
Sorithy Seng (pourlassi@gmail.com) submitted a patch against docprop.c to fix an issue on sparc64 platforms. It is unknown whether this bug affected other 64-bit architectures.
- fixed bug when StopWords resulted in no unique words
Added check in db_native.c to check that some words exist before writing index.
- updates to SWISH-RUN.1
Added doc for -u and -r options.
- filename only in SWISH::Filters
added fix to SWISH::Filters::pp2html and SWISH::Filters::XLtoHTML to save only filename as title without full path
- Removed Stem and Stemmer_en
The legacy Porter stemmer was removed. This had been deprecated some time ago. A warning will issue if the old stemmer is indicated in config file, and Stemmer_en1 will be used instead.
- GPL'd all the source files with the new Swish-e License
After a source code review, the developers decided to put Swish-e under the GPL with a special exception for linking against libswish-e. See http://swish-e.org/license.html for the details.
- Fixed Segfault with updating incremental index
Dobrica Pavlinusic reported a segfaut after updating an index multiple times. José provided updated worddata.c. - April 27, 2005
- Fixed NOT check with incremental indexes
Swish was returning results for deleted files when the NOT operator was used.
- Fixed bug when using old parsers with zero length input
Thomas Angst reported swish consuming memory when using -S prog to process large number of empty documents.
When -S prog generated a zero length file the old parsers (e.g. TXT) would attempt to read in *all* content from the -S prog program into a buffer. The old parser incorrectly assumed it was reading from a filter and tried to read to eof().
- Changes to ParserWarnLevel
The default value for ParserWarnLevel was changed form zero to two.
The ParserWarnLevel controls the error handling of the libxml2 parser. The higher the setting, the more verbose the output. The change to the default is to report when libxml2 has problems parsing a document (which often times results in processing only part of a document).
To get the old behavior, either set ParserWarnLevel to zero in your config file, or use the new -W command line option to set the ParserWarnLevel at run time. If ParserWarnLevel is set in the config file, it will override the -W option.
Also, to see UTF-8 to 8859-1 conversion errors set ParserWarnLevel to 3 or more. Previously, these warning were issues at ParserWarnLevel of one.
- Documentation changes
Removed all the target documentation (html, pdf, ps) from cvs. There's now a separate cvs module "swish_website" that is used to generate both the website and the html docs. If building swish-e from cvs please see the README.cvs file for instructions.
- Fixed bug in pre-sorted indexes with USE_BTREE
Gunnar Mätzler reported a problem with reading the pre-sorted property index tables when running with USE_BTREE (--enable-enremental). Not all entries were being written to disk. There was/is a question if the "array" code used for pre-sorted indexes with USE_BTREE would be slower. So, added a separate define USE_PRESORT_ARRAY to enable that code when USE_BTREE is set. This allows using the old integer arrays with USE_BTREE. Gunnar reported that this is working, but more testing is needed. Need to compare speed of the array code vs. the non-array code, and to verify the workings of USE_PRESORT_ARRAY code.
- Add strcoll() usage for sorting properties
Andreas Seltenreich provided a patch to use strcoll when sorting properties. strcoll is locale dependent.
- Fix incremental indexing when adding back a file
Jose fixed a problem with incremental indexing where a file could not be added back to the index once removed.
Patch initially provided by Dobrica Pavlinusic:
http://swish-e.org/Discussion/archive/2004-12/8694.html
- Documentation correction
A change in the default way the index is compressed was not documented in 2.4.3. The change resulted in larger indexes. See CompressPositions below and in SWISH-CONFIG.
- libxml2 UTF-8 conversion failures
Fixed issue where a UTF-8 to Latin1 encoding failure would skip more input than just the failed character. Libxml2 passes swish text that is not null terminated, but the libxml2 functions to skip UTF-8 chars expected a null-terminated string. Replace libxml2 call with fixed version.
Version 2.4.3 December 9, 2004
- New config directive: CompressPositions
This option enables zlib compression for word data in the index. Previously word data was always compressed but resulted in slower wildcard searches. The default now is to not compress the word data, but results in larger index files. Set to "YES" to get pre-2.4.3 index sizes.
[This CHANGES entry was added after 2.4.3 was released]
- Improved error messsages when using incremental indexing
There was a bit of confusion on how to use incremental indexing (still experimental) so added better logic for error messages.
Also fixed a logic error when setting the incremental update mode. Caught by Paul Loner.
Version 2.4.3-pr1 - Wed Dec 1 09:52:50 PST 2004
- "Fixed" libxml2's change in UTF8Toisolat1() return value
Bernhard Weisshuhn supplied a patch to parser.c for checking the return value of UTF8Toisolat1(). Seems that libxml2 now returns the number of characters converted instead of zero for success.
http://bugzilla.gnome.org/show_bug.cgi?id=153937
- Added swish-config and pkg-config
Swish now provides a swish-config script and config file for the pkg-config utility. These tools help when building programs that link with the swish-e library.
The SWISH::API Makefile.PL program uses swish-config to locate the installation directory of swish-e. This should make building SWISH::API easier when swish-e is installed in a non-standard location.
- Fixed rank bias in merge
Peter van Dijk noticed that MetaNamesRank settings were not being copied to the output index when merging.
- Added SwishFuzzy function
SwishFuzzy function (SWISH::API::Fuzzy) lets you stem a word without first searching. This might be helpful for playing with queries prior to the search.
- Fixed translate character table
Michael Levy found an error in the table used to translate 8859-1 to ascii7. Luckily, it was an upper case translation and the table is only used on lower case characters.
- MetaNamesRank documentation
Changed the 'not yet implemented' caveat to 'implemented but experimental'.
- Added Continuation option to config processing
You can now use continuation lines in the config file:
IgnoreWords \ the \ am \ is \ are \ wasThere may not be any characters following the backslash.
- Fixed Buzzwords (and other word lists entered in the config)
Words entered in config were not converted to lower case before storing in the index.
- Fixed metaname mapping problem in Merge
Peter Karman found an error when merging indexes where the source indexes had the same metanames, but listed in a different order in their config files. Words would then be indexed under the wrong metaID number in the output index.
- SWISH::Filters and spider.pl updates
The web spider spider.pl was updated to work better with SWISH::Filter by default and also make it easier to use the spider default along with a spider config file. See spider.pl for details.
SWISH::Filter was updated. The way filters are created has changed. If you created your own filters you will need to update them. Take a look at SWISH::Filter and the filters included in the distribution.
- Updates to Documentation
Richard Morin submitted formatting and punctuation dates to the README and INSTALL docs.
- Added -R option to support IDF word weighting in ranking. (karman)
Added Inverse Document Frequency calculation to the getrank() routine. This will allow the relative frequency of a word in relationship to other words in the query to impact the ranking of documents.
Example: if 'foo' is present twice as often as 'bar' in the collection as a whole, a search for 'foo bar' will weight documents with 'bar' more heavily (i.e., higher rank) than those with 'foo'.
The impact is greatest when OR'ing words in a query rather than AND'ing them (which is the default).
Also added Rank discussion to the FAQ.
- Updates to the example scripts
Updated PhraseHighlight.pm as suggested by Bill Schell for an optimization when all words in a document are highlighted.
Updated search.cgi and PhraseHighlight.pm to use the internal stemmers via the SWISH::API module as suggested by Jonas Wolf.
- Leak when using C library
David Windmueller found a memory leak when calling multiple searches on a swish handle. The problem was swish loading the pre-sorted property index on every search, even after the table had been loaded into memory.
- Swish.cgi now kills swish-e on time out
The example script swish.cgi uses an alarm (on platforms that support alarm) to abort processing after some number of seconds, but it was not killing the child process, swish-e. Bill Schell submitted a patch to kill the child when the alarm triggers.
- The template search.tt was renamed to swish.tt
The template was renamed because it's used by swish.cgi, not by search.cgi, which was confusing.
- Updates to the search.cgi
The example script search.cgi was updated to work better with mod_perl and to use external template files and style sheets.
- New MS Word Filter
James Job provided the SWISH::Filter::Doc2html filter that uses the wvWare (http://wvware.sourceforge.net/) program for filtering MS Word documents. If both catdoc and wvWare are installed then wvWare will be used.
wvWare is reported to do a good job at converting MS Word docs to HTML. In a few tests it did work well, but other cases it failed to generate correct output. It was also much, much slower than catdoc. I tested with wvWare 0.7.3 on Debian Linux. Testing with both is recommended.
- Change in way symbolic links are followed
John-Marc Chandonia pointed out that if a symlink is skipped by FileRules, then the actual file/directory is marked as "already seen" and cannot be indexed by other links or directly.
Now, files and directories are not marked "already seen" until after passing FileRules (i.e after a file is actually indexed or a directory is processed).
- Could not set SwishSetSort() more than once
David Windmueller found a problem when trying to set the sort order more than once on an existing search object. Memory was not correctly reset after clearing the previous sort values.
- Access MetaNames and PropertyNames from API
Patch provided by Jamie Herre to access the MetaNames and PropertyNames via the C API and to test via the testlib program. Swish::API also updated to access this data.
- SwishResultPropertyULong() bug fixed
David Windmueller reported that SwishResultPropertyULong() was returning ULONG_MAX on all calls. This was fixed.
- Null written to wrong location in file.c
Bill Schell with the help of valgrind found a null written past the end of a buffer in file.c in the code that supports the old parsersas a bit of confusion on how to use incremental indexing (still experimental) so added better logic for error messages.
Also fixed a logic error when setting the incremental update mode. Caught by Paul Loner.
Version 2.4.3-pr1 - Wed Dec 1 09:52:50 PST 2004
- "Fixed" libxml2's change in UTF8Toisolat1() return value
Bernhard Weisshuhn supplied a patch to parser.c for checking the return value of UTF8Toisolat1(). Seems that libxml2 now returns the number of characters converted instead of zero for success.
http://bugzilla.gnome.org/show_bug.cgi?id=153937
- Added swish-config and pkg-config
Swish now provides a swish-config script and config file for the pkg-config utility. These tools help when building programs that link with the swish-e library.
The SWISH::API Makefile.PL program uses swish-config to locate the installation directory of swish-e. This should make building SWISH::API easier when swish-e is installed in a non-standard location.
- Fixed rank bias in merge
Peter van Dijk noticed that MetaNamesRank settings were not being copied to the output index when merging.
- Added SwishFuzzy function
SwishFuzzy function (SWISH::API::Fuzzy) lets you stem a word without first searching. This might be helpful for playing with queries prior to the search.
- Fixed translate character table
Michael Levy found an error in the table used to translate 8859-1 to ascii7. Luckily, it was an upper case translation and the table is only used on lower case characters.
- MetaNamesRank documentation
Changed the 'not yet implemented' caveat to 'implemented but experimental'.
- Added Continuation option to config processing
You can now use continuation lines in the config file:
IgnoreWords \ the \ am \ is \ are \ wasThere may not be any characters following the backslash.
- Fixed Buzzwords (and other word lists entered in the config)
Words entered in config were not converted to lower case before storing in the index.
- Fixed metaname mapping problem in Merge
Peter Karman found an error when merging indexes where the source indexes had the same metanames, but listed in a different order in their config files. Words would then be indexed under the wrong metaID number in the output index.
- SWISH::Filters and spider.pl updates
The web spider spider.pl was updated to work better with SWISH::Filter by default and also make it easier to use the spider default along with a spider config file. See spider.pl for details.
SWISH::Filter was updated. The way filters are created has changed. If you created your own filters you will need to update them. Take a look at SWISH::Filter and the filters included in the distribution.
- Updates to Documentation
Richard Morin submitted formatting and punctuation dates to the README and INSTALL docs.
- Added -R option to support IDF word weighting in ranking. (karman)
Added Inverse Document Frequency calculation to the getrank() routine. This will allow the relative frequency of a word in relationship to other words in the query to impact the ranking of documents.
Example: if 'foo' is present twice as often as 'bar' in the collection as a whole, a search for 'foo bar' will weight documents with 'bar' more heavily (i.e., higher rank) than those with 'foo'.
The impact is greatest when OR'ing words in a query rather than AND'ing them (which is the default).
Also added Rank discussion to the FAQ.
- Updates to the example scripts
Updated PhraseHighlight.pm as suggested by Bill Schell for an optimization when all words in a document are highlighted.
Updated search.cgi and PhraseHighlight.pm to use the internal stemmers via the SWISH::API module as suggested by Jonas Wolf.
- Leak when using C library
David Windmueller found a memory leak when calling multiple searches on a swish handle. The problem was swish loading the pre-sorted property index on every search, even after the table had been loaded into memory.
- Swish.cgi now kills swish-e on time out
The example script swish.cgi uses an alarm (on platforms that support alarm) to abort processing after some number of seconds, but it was not killing the child process, swish-e. Bill Schell submitted a patch to kill the child when the alarm triggers.
- The template search.tt was renamed to swish.tt
The template was renamed because it's used by swish.cgi, not by search.cgi, which was confusing.
- Updates to the search.cgi
The example script search.cgi was updated to work better with mod_perl and to use external template files and style sheets.
- New MS Word Filter
James Job provided the SWISH::Filter::Doc2html filter that uses the wvWare (http://wvware.sourceforge.net/) program for filtering MS Word documents. If both catdoc and wvWare are installed then wvWare will be used.
wvWare is reported to do a good job at converting MS Word docs to HTML. In a few tests it did work well, but other cases it failed to generate correct output. It was also much, much slower than catdoc. I tested with wvWare 0.7.3 on Debian Linux. Testing with both is recommended.
- Change in way symbolic links are followed
John-Marc Chandonia pointed out that if a symlink is skipped by FileRules, then the actual file/directory is marked as "already seen" and cannot be indexed by other links or directly.
Now, files and directories are not marked "already seen" until after passing FileRules (i.e after a file is actually indexed or a directory is processed).
- Could not set SwishSetSort() more than once
David Windmueller found a problem when trying to set the sort order more than once on an existing search object. Memory was not correctly reset after clearing the previous sort values.
- Access MetaNames and PropertyNames from API
Patch provided by Jamie Herre to access the MetaNames and PropertyNames via the C API and to test via the testlib program. Swish::API also updated to access this data.
- SwishResultPropertyULong() bug fixed
David Windmueller reported that SwishResultPropertyULong() was returning ULONG_MAX on all calls. This was fixed.
- Null written to wrong location in file.c
Bill Schell with the help of valgrind found a null written past the end of a buffer in file.c in the code that supports the old parsersas a bit of confusion on how to use incremental indexing (still experimental) so added better logic for error messages.
Also fixed a logic error when setting the incremental update mode. Caught by Paul Loner.
Version 2.4.3-pr1 - Wed Dec 1 09:52:50 PST 2004
- "Fixed" libxml2's change in UTF8Toisolat1() return value
Bernhard Weisshuhn supplied a patch to parser.c for checking the return value of UTF8Toisolat1(). Seems that libxml2 now returns the number of characters converted instead of zero for success.
http://bugzilla.gnome.org/show_bug.cgi?id=153937
- Added swish-config and pkg-config
Swish now provides a swish-config script and config file for the pkg-config utility. These tools help when building programs that link with the swish-e library.
The SWISH::API Makefile.PL program uses swish-config to locate the installation directory of swish-e. This should make building SWISH::API easier when swish-e is installed in a non-standard location.
- Fixed rank bias in merge
Peter van Dijk noticed that MetaNamesRank settings were not being copied to the output index when merging.
- Added SwishFuzzy function
SwishFuzzy function (SWISH::API::Fuzzy) lets you stem a word without first searching. This might be helpful for playing with queries prior to the search.
- Fixed translate character table
Michael Levy found an error in the table used to translate 8859-1 to ascii7. Luckily, it was an upper case translation and the table is only used on lower case characters.
- MetaNamesRank documentation
Changed the 'not yet implemented' caveat to 'implemented but experimental'.
- Added Continuation option to config processing
You can now use continuation lines in the config file:
IgnoreWords \ the \ am \ is \ are \ wasThere may not be any characters following the backslash.
- Fixed Buzzwords (and other word lists entered in the config)
Words entered in config were not converted to lower case before storing in the index.
- Fixed metaname mapping problem in Merge
Peter Karman found an error when merging indexes where the source indexes had the same metanames, but listed in a different order in their config files. Words would then be indexed under the wrong metaID number in the output index.
- SWISH::Filters and spider.pl updates
The web spider spider.pl was updated to work better with SWISH::Filter by default and also make it easier to use the spider default along with a spider config file. See spider.pl for details.
SWISH::Filter was updated. The way filters are created has changed. If you created your own filters you will need to update them. Take a look at SWISH::Filter and the filters included in the distribution.
- Updates to Documentation
Richard Morin submitted formatting and punctuation dates to the README and INSTALL docs.
- Added -R option to support IDF word weighting in ranking. (karman)
Added Inverse Document Frequency calculation to the getrank() routine. This will allow the relative frequency of a word in relationship to other words in the query to impact the ranking of documents.
Example: if 'foo' is present twice as often as 'bar' in the collection as a whole, a search for 'foo bar' will weight documents with 'bar' more heavily (i.e., higher rank) than those with 'foo'.
The impact is greatest when OR'ing words in a query rather than AND'ing them (which is the default).
Also added Rank discussion to the FAQ.
- Updates to the example scripts
Updated PhraseHighlight.pm as suggested by Bill Schell for an optimization when all words in a document are highlighted.
Updated search.cgi and PhraseHighlight.pm to use the internal stemmers via the SWISH::API module as suggested by Jonas Wolf.
- Leak when using C library
David Windmueller found a memory leak when calling multiple searches on a swish handle. The problem was swish loading the pre-sorted property index on every search, even after the table had been loaded into memory.
- Swish.cgi now kills swish-e on time out
The example script swish.cgi uses an alarm (on platforms that support alarm) to abort processing after some number of seconds, but it was not killing the child process, swish-e. Bill Schell submitted a patch to kill the child when the alarm triggers.
- The template search.tt was renamed to swish.tt
The template was renamed because it's used by swish.cgi, not by search.cgi, which was confusing.
- Updates to the search.cgi
The example script search.cgi was updated to work better with mod_perl and to use external template files and style sheets.
- New MS Word Filter
James Job provided the SWISH::Filter::Doc2html filter that uses the wvWare (http://wvware.sourceforge.net/) program for filtering MS Word documents. If both catdoc and wvWare are installed then wvWare will be used.
wvWare is reported to do a good job at converting MS Word docs to HTML. In a few tests it did work well, but other cases it failed to generate correct output. It was also much, much slower than catdoc. I tested with wvWare 0.7.3 on Debian Linux. Testing with both is recommended.
- Change in way symbolic links are followed
John-Marc Chandonia pointed out that if a symlink is skipped by FileRules, then the actual file/directory is marked as "already seen" and cannot be indexed by other links or directly.
Now, files and directories are not marked "already seen" until after passing FileRules (i.e after a file is actually indexed or a directory is processed).
- Could not set SwishSetSort() more than once
David Windmueller found a problem when trying to set the sort order more than once on an existing search object. Memory was not correctly reset after clearing the previous sort values.
- Access MetaNames and PropertyNames from API
Patch provided by Jamie Herre to access the MetaNames and PropertyNames via the C API and to test via the testlib program. Swish::API also updated to access this data.
- SwishResultPropertyULong() bug fixed
David Windmueller reported that SwishResultPropertyULong() was returning ULONG_MAX on all calls. This was fixed.
- Null written to wrong location in file.c
Bill Schell with the help of valgrind found a null written past the end of a buffer in file.c in the code that supports the old parsersas a bit of confusion on how to use incremental indexing (still experimental) so added better logic for error messages.
Also fixed a logic error when setting the incremental update mode. Caught by Paul Loner.
Version 2.4.3-pr1 - Wed Dec 1 09:52:50 PST 2004
- "Fixed" libxml2's change in UTF8Toisolat1() return value
Bernhard Weisshuhn supplied a patch to parser.c for checking the return value of UTF8Toisolat1(). Seems that libxml2 now returns the number of characters converted instead of zero for success.
http://bugzilla.gnome.org/show_bug.cgi?id=153937
- Added swish-config and pkg-config
Swish now provides a swish-config script and config file for the pkg-config utility. These tools help when building programs that link with the swish-e library.
The SWISH::API Makefile.PL program uses swish-config to locate the installation directory of swish-e. This should make building SWISH::API easier when swish-e is installed in a non-standard location.
- Fixed rank bias in merge
Peter van Dijk noticed that MetaNamesRank settings were not being copied to the output index when merging.
- Added SwishFuzzy function
SwishFuzzy function (SWISH::API::Fuzzy) lets you stem a word without first searching. This might be helpful for playing with queries prior to the search.
- Fixed translate character table
Michael Levy found an error in the table used to translate 8859-1 to ascii7. Luckily, it was an upper case translation and the table is only used on lower case characters.
- MetaNamesRank documentation
Changed the 'not yet implemented' caveat to 'implemented but experimental'.
- Added Continuation option to config processing
You can now use continuation lines in the config file:
IgnoreWords \ the \ am \ is \ are \ wasThere may not be any characters following the backslash.
- Fixed Buzzwords (and other word lists entered in the config)
Words entered in config were not converted to lower case before storing in the index.
- Fixed metaname mapping problem in Merge
Peter Karman found an error when merging indexes where the source indexes had the same metanames, but listed in a different order in their config files. Words would then be indexed under the wrong metaID number in the output index.
- SWISH::Filters and spider.pl updates
The web spider spider.pl was updated to work better with SWISH::Filter by default and also make it easier to use the spider default along with a spider config file. See spider.pl for details.
SWISH::Filter was updated. The way filters are created has changed. If you created your own filters you will need to update them. Take a look at SWISH::Filter and the filters included in the distribution.
- Updates to Documentation
Richard Morin submitted formatting and punctuation dates to the README and INSTALL docs.
- Added -R option to support IDF word weighting in ranking. (karman)
Added Inverse Document Frequency calculation to the getrank() routine. This will allow the relative frequency of a word in relationship to other words in the query to impact the ranking of documents.
Example: if 'foo' is present twice as often as 'bar' in the collection as a whole, a search for 'foo bar' will weight documents with 'bar' more heavily (i.e., higher rank) than those with 'foo'.
The impact is greatest when OR'ing words in a query rather than AND'ing them (which is the default).
Also added Rank discussion to the FAQ.
- Updates to the example scripts
Updated PhraseHighlight.pm as suggested by Bill Schell for an optimization when all words in a document are highlighted.
Updated search.cgi and PhraseHighlight.pm to use the internal stemmers via the SWISH::API module as suggested by Jonas Wolf.
- Leak when using C library
David Windmueller found a memory leak when calling multiple searches on a swish handle. The problem was swish loading the pre-sorted property index on every search, even after the table had been loaded into memory.
- Swish.cgi now kills swish-e on time out
The example script swish.cgi uses an alarm (on platforms that support alarm) to abort processing after some number of seconds, but it was not killing the child process, swish-e. Bill Schell submitted a patch to kill the child when the alarm triggers.
- The template search.tt was renamed to swish.tt
The template was renamed because it's used by swish.cgi, not by search.cgi, which was confusing.
- Updates to the search.cgi
The example script search.cgi was updated to work better with mod_perl and to use external template files and style sheets.
- New MS Word Filter
James Job provided the SWISH::Filter::Doc2html filter that uses the wvWare (http://wvware.sourceforge.net/) program for filtering MS Word documents. If both catdoc and wvWare are installed then wvWare will be used.
wvWare is reported to do a good job at converting MS Word docs to HTML. In a few tests it did work well, but other cases it failed to generate correct output. It was also much, much slower than catdoc. I tested with wvWare 0.7.3 on Debian Linux. Testing with both is recommended.
- Change in way symbolic links are followed
John-Marc Chandonia pointed out that if a symlink is skipped by FileRules, then the actual file/directory is marked as "already seen" and cannot be indexed by other links or directly.
Now, files and directories are not marked "already seen" until after passing FileRules (i.e after a file is actually indexed or a directory is processed).
- Could not set SwishSetSort() more than once
David Windmueller found a problem when trying to set the sort order more than once on an existing search object. Memory was not correctly reset after clearing the previous sort values.
- Access MetaNames and PropertyNames from API
Patch provided by Jamie Herre to access the MetaNames and PropertyNames via the C API and to test via the testlib program. Swish::API also updated to access this data.
- SwishResultPropertyULong() bug fixed
David Windmueller reported that SwishResultPropertyULong() was returning ULONG_MAX on all calls. This was fixed.
- Null written to wrong location in file.c
Bill Schell with the help of valgrind found a null written past the end of a buffer in file.c in the code that supports the old parsersas a bit of confusion on how to use incremental indexing (still experimental) so added better logic for error messages.
Also fixed a logic error when setting the incremental update mode. Caught by Paul Loner.
Version 2.4.3-pr1 - Wed Dec 1 09:52:50 PST 2004
- "Fixed" libxml2's change in UTF8Toisolat1() return value
Bernhard Weisshuhn supplied a patch to parser.c for checking the return value of UTF8Toisolat1(). Seems that libxml2 now returns the number of characters converted instead of zero for success.
http://bugzilla.gnome.org/show_bug.cgi?id=153937
- Added swish-config and pkg-config
Swish now provides a swish-config script and config file for the pkg-config utility. These tools help when building programs that link with the swish-e library.
The SWISH::API Makefile.PL program uses swish-config to locate the installation directory of swish-e. This should make building SWISH::API easier when swish-e is installed in a non-standard location.
- Fixed rank bias in merge
Peter van Dijk noticed that MetaNamesRank settings were not being copied to the output index when merging.
- Added SwishFuzzy function
SwishFuzzy function (SWISH::API::Fuzzy) lets you stem a word without first searching. This might be helpful for playing with queries prior to the search.
- Fixed translate character table
Michael Levy found an error in the table used to translate 8859-1 to ascii7. Luckily, it was an upper case translation and the table is only used on lower case characters.
- MetaNamesRank documentation
Changed the 'not yet implemented' caveat to 'implemented but experimental'.
- Added Continuation option to config processing
You can now use continuation lines in the config file:
IgnoreWords \ the \ am \ is \ are \ wasThere may not be any characters following the backslash.
- Fixed Buzzwords (and other word lists entered in the config)
Words entered in config were not converted to lower case before storing in the index.
- Fixed metaname mapping problem in Merge
Peter Karman found an error when merging indexes where the source indexes had the same metanames, but listed in a different order in their config files. Words would then be indexed under the wrong metaID number in the output index.
- SWISH::Filters and spider.pl updates
The web spider spider.pl was updated to work better with SWISH::Filter by default and also make it easier to use the spider default along with a spider config file. See spider.pl for details.
SWISH::Filter was updated. The way filters are created has changed. If you created your own filters you will need to update them. Take a look at SWISH::Filter and the filters included in the distribution.
- Updates to Documentation
Richard Morin submitted formatting and punctuation dates to the README and INSTALL docs.
- Added -R option to support IDF word weighting in ranking. (karman)
Added Inverse Document Frequency calculation to the getrank() routine. This will allow the relative frequency of a word in relationship to other words in the query to impact the ranking of documents.
Example: if 'foo' is present twice as often as 'bar' in the collection as a whole, a search for 'foo bar' will weight documents with 'bar' more heavily (i.e., higher rank) than those with 'foo'.
The impact is greatest when OR'ing words in a query rather than AND'ing them (which is the default).
Also added Rank discussion to the FAQ.
- Updates to the example scripts
Updated PhraseHighlight.pm as suggested by Bill Schell for an optimization when all words in a document are highlighted.
Updated search.cgi and PhraseHighlight.pm to use the internal stemmers via the SWISH::API module as suggested by Jonas Wolf.
- Leak when using C library
David Windmueller found a memory leak when calling multiple searches on a swish handle. The problem was swish loading the pre-sorted property index on every search, even after the table had been loaded into memory.
- Swish.cgi now kills swish-e on time out
The example script swish.cgi uses an alarm (on platforms that support alarm) to abort processing after some number of seconds, but it was not killing the child process, swish-e. Bill Schell submitted a patch to kill the child when the alarm triggers.
- The template search.tt was renamed to swish.tt
The template was renamed because it's used by swish.cgi, not by search.cgi, which was confusing.
- Updates to the search.cgi
The example script search.cgi was updated to work better with mod_perl and to use external template files and style sheets.
- New MS Word Filter
James Job provided the SWISH::Filter::Doc2html filter that uses the wvWare (http://wvware.sourceforge.net/) program for filtering MS Word documents. If both catdoc and wvWare are installed then wvWare will be used.
wvWare is reported to do a good job at converting MS Word docs to HTML. In a few tests it did work well, but other cases it failed to generate correct output. It was also much, much slower than catdoc. I tested with wvWare 0.7.3 on Debian Linux. Testing with both is recommended.
- Change in way symbolic links are followed
John-Marc Chandonia pointed out that if a symlink is skipped by FileRules, then the actual file/directory is marked as "already seen" and cannot be indexed by other links or directly.
Now, files and directories are not marked "already seen" until after passing FileRules (i.e after a file is actually indexed or a directory is processed).
- Could not set SwishSetSort() more than once
David Windmueller found a problem when trying to set the sort order more than once on an existing search object. Memory was not correctly reset after clearing the previous sort values.
- Access MetaNames and PropertyNames from API
Patch provided by Jamie Herre to access the MetaNames and PropertyNames via the C API and to test via the testlib program. Swish::API also updated to access this data.
- SwishResultPropertyULong() bug fixed
David Windmueller reported that SwishResultPropertyULong() was returning ULONG_MAX on all calls. This was fixed.
- Null written to wrong location in file.c
Bill Schell with the help of valgrind found a null written past the end of a buffer in file.c in the code that supports the old parsersas a bit of confusion on how to use incremental indexing (still experimental) so added better logic for error messages.
Also fixed a logic error when setting the incremental update mode. Caught by Paul Loner.
Version 2.4.3-pr1 - Wed Dec 1 09:52:50 PST 2004
- "Fixed" libxml2's change in UTF8Toisolat1() return value
Bernhard Weisshuhn supplied a patch to parser.c for checking the return value of UTF8Toisolat1(). Seems that libxml2 now returns the number of characters converted instead of zero for success.
http://bugzilla.gnome.org/show_bug.cgi?id=153937
- Added swish-config and pkg-config
Swish now provides a swish-config script and config file for the pkg-config utility. These tools help when building programs that link with the swish-e library.
The SWISH::API Makefile.PL program uses swish-config to locate the installation directory of swish-e. This should make building SWISH::API easier when swish-e is installed in a non-standard location.
- Fixed rank bias in merge
Peter van Dijk noticed that MetaNamesRank settings were not being copied to the output index when merging.
- Added SwishFuzzy function
SwishFuzzy function (SWISH::API::Fuzzy) lets you stem a word without first searching. This might be helpful for playing with queries prior to the search.
- Fixed translate character table
Michael Levy found an error in the table used to translate 8859-1 to ascii7. Luckily, it was an upper case translation and the table is only used on lower case characters.
- MetaNamesRank documentation
Changed the 'not yet implemented' caveat to 'implemented but experimental'.
- Added Continuation option to config processing
You can now use continuation lines in the config file:
IgnoreWords \ the \ am \ is \ are \ wasThere may not be any characters following the backslash.
- Fixed Buzzwords (and other word lists entered in the config)
Words entered in config were not converted to lower case before storing in the index.
- Fixed metaname mapping problem in Merge
Peter Karman found an error when merging indexes where the source indexes had the same metanames, but listed in a different order in their config files. Words would then be indexed under the wrong metaID number in the output index.
- SWISH::Filters and spider.pl updates
The web spider spider.pl was updated to work better with SWISH::Filter by default and also make it easier to use the spider default along with a spider config file. See spider.pl for details.
SWISH::Filter was updated. The way filters are created has changed. If you created your own filters you will need to update them. Take a look at SWISH::Filter and the filters included in the distribution.
- Updates to Documentation
Richard Morin submitted formatting and punctuation dates to the README and INSTALL docs.
- Added -R option to support IDF word weighting in ranking. (karman)
Added Inverse Document Frequency calculation to the getrank() routine. This will allow the relative frequency of a word in relationship to other words in the query to impact the ranking of documents.
Example: if 'foo' is present twice as often as 'bar' in the collection as a whole, a search for 'foo bar' will weight documents with 'bar' more heavily (i.e., higher rank) than those with 'foo'.
The impact is greatest when OR'ing words in a query rather than AND'ing them (which is the default).
Also added Rank discussion to the FAQ.
- Updates to the example scripts
Updated PhraseHighlight.pm as suggested by Bill Schell for an optimization when all words in a document are highlighted.
Updated search.cgi and PhraseHighlight.pm to use the internal stemmers via the SWISH::API module as suggested by Jonas Wolf.
- Leak when using C library
David Windmueller found a memory leak when calling multiple searches on a swish handle. The problem was swish loading the pre-sorted property index on every search, even after the table had been loaded into memory.
- Swish.cgi now kills swish-e on time out
The example script swish.cgi uses an alarm (on platforms that support alarm) to abort processing after some number of seconds, but it was not killing the child process, swish-e. Bill Schell submitted a patch to kill the child when the alarm triggers.
- The template search.tt was renamed to swish.tt
The template was renamed because it's used by swish.cgi, not by search.cgi, which was confusing.
- Updates to the search.cgi
The example script search.cgi was updated to work better with mod_perl and to use external template files and style sheets.
- New MS Word Filter
James Job provided the SWISH::Filter::Doc2html filter that uses the wvWare (http://wvware.sourceforge.net/) program for filtering MS Word documents. If both catdoc and wvWare are installed then wvWare will be used.
wvWare is reported to do a good job at converting MS Word docs to HTML. In a few tests it did work well, but other cases it failed to generate correct output. It was also much, much slower than catdoc. I tested with wvWare 0.7.3 on Debian Linux. Testing with both is recommended.
- Change in way symbolic links are followed
John-Marc Chandonia pointed out that if a symlink is skipped by FileRules, then the actual file/directory is marked as "already seen" and cannot be indexed by other links or directly.
Now, files and directories are not marked "already seen" until after passing FileRules (i.e after a file is actually indexed or a directory is processed).
- Could not set SwishSetSort() more than once
David Windmueller found a problem when trying to set the sort order more than once on an existing search object. Memory was not correctly reset after clearing the previous sort values.
- Access MetaNames and PropertyNames from API
Patch provided by Jamie Herre to access the MetaNames and PropertyNames via the C API and to test via the testlib program. Swish::API also updated to access this data.
- SwishResultPropertyULong() bug fixed
David Windmueller reported that SwishResultPropertyULong() was returning ULONG_MAX on all calls. This was fixed.
- Null written to wrong location in file.c
Bill Schell with the help of valgrind found a null written past the end of a buffer in file.c in the code that supports the old parsersas a bit of confusion on how to use incremental indexing (still experimental) so added better logic for error messages.
Also fixed a logic error when setting the incremental update mode. Caught by Paul Loner.
Version 2.4.3-pr1 - Wed Dec 1 09:52:50 PST 2004
- "Fixed" libxml2's change in UTF8Toisolat1() return value
Bernhard Weisshuhn supplied a patch to parser.c for checking the return value of UTF8Toisolat1(). Seems that libxml2 now returns the number of characters converted instead of zero for success.
http://bugzilla.gnome.org/show_bug.cgi?id=153937
- Added swish-config and pkg-config
Swish now provides a swish-config script and config file for the pkg-config utility. These tools help when building programs that link with the swish-e library.
The SWISH::API Makefile.PL program uses swish-config to locate the installation directory of swish-e. This should make building SWISH::API easier when swish-e is installed in a non-standard location.
- Fixed rank bias in merge
Peter van Dijk noticed that MetaNamesRank settings were not being copied to the output index when merging.
- Added SwishFuzzy function
SwishFuzzy function (SWISH::API::Fuzzy) lets you stem a word without first searching. This might be helpful for playing with queries prior to the search.
- Fixed translate character table
Michael Levy found an error in the table used to translate 8859-1 to ascii7. Luckily, it was an upper case translation and the table is only used on lower case characters.
- MetaNamesRank documentation
Changed the 'not yet implemented' caveat to 'implemented but experimental'.
- Added Continuation option to config processing
You can now use continuation lines in the config file:
IgnoreWords \ the \ am \ is \ are \ wasThere may not be any characters following the backslash.
- Fixed Buzzwords (and other word lists entered in the config)
Words entered in config were not converted to lower case before storing in the index.
- Fixed metaname mapping problem in Merge
Peter Karman found an error when merging indexes where the source indexes had the same metanames, but listed in a different order in their config files. Words would then be indexed under the wrong metaID number in the output index.
- SWISH::Filters and spider.pl updates
The web spider spider.pl was updated to work better with SWISH::Filter by default and also make it easier to use the spider default along with a spider config file. See spider.pl for details.
SWISH::Filter was updated. The way filters are created has changed. If you created your own filters you will need to update them. Take a look at SWISH::Filter and the filters included in the distribution.
- Updates to Documentation
Richard Morin submitted formatting and punctuation dates to the README and INSTALL docs.
- Added -R option to support IDF word weighting in ranking. (karman)
Added Inverse Document Frequency calculation to the getrank() routine. This will allow the relative frequency of a word in relationship to other words in the query to impact the ranking of documents.
Example: if 'foo' is present twice as often as 'bar' in the collection as a whole, a search for 'foo bar' will weight documents with 'bar' more heavily (i.e., higher rank) than those with 'foo'.
The impact is greatest when OR'ing words in a query rather than AND'ing them (which is the default).
Also added Rank discussion to the FAQ.
- Updates to the example scripts
Updated PhraseHighlight.pm as suggested by Bill Schell for an optimization when all words in a document are highlighted.
Updated search.cgi and PhraseHighlight.pm to use the internal stemmers via the SWISH::API module as suggested by Jonas Wolf.
- Leak when using C library
David Windmueller found a memory leak when calling multiple searches on a swish handle. The problem was swish loading the pre-sorted property index on every search, even after the table had been loaded into memory.
- Swish.cgi now kills swish-e on time out
The example script swish.cgi uses an alarm (on platforms that support alarm) to abort processing after some number of seconds, but it was not killing the child process, swish-e. Bill Schell submitted a patch to kill the child when the alarm triggers.
- The template search.tt was renamed to swish.tt
The template was renamed because it's used by swish.cgi, not by search.cgi, which was confusing.
- Updates to the search.cgi
The example script search.cgi was updated to work better with mod_perl and to use external template files and style sheets.
- New MS Word Filter
James Job provided the SWISH::Filter::Doc2html filter that uses the wvWare (http://wvware.sourceforge.net/) program for filtering MS Word documents. If both catdoc and wvWare are installed then wvWare will be used.
wvWare is reported to do a good job at converting MS Word docs to HTML. In a few tests it did work well, but other cases it failed to generate correct output. It was also much, much slower than catdoc. I tested with wvWare 0.7.3 on Debian Linux. Testing with both is recommended.
- Change in way symbolic links are followed
John-Marc Chandonia pointed out that if a symlink is skipped by FileRules, then the actual file/directory is marked as "already seen" and cannot be indexed by other links or directly.
Now, files and directories