
    Msb: Nearest neighbor searching in metric spaces

------------------------------------------------------------------------
Version 0.11 ("very preliminary release")
Jan 2003
------------------------------------------------------------------------


      Introduction

Msb is a collection of subroutines and drivers for nearest neighbor
searching in metric spaces.  It is written in ANSI C, and also compiles
under C++.  The routines are given as input a /distance function /
dist(size_t, size_t), which satisfies the metric space conditions that

      dist(a,a)=0,
      dist(a,b)=dist(b,a);
      dist(a,c) <= dist(a,b) + dist(b,c), (the/triangle/ inequality)

Here dist is assumed to give the distance between points: dist(1,2) is
the distance between "point number 1" and "point number 2". Some
examples of metric spaces include: points with the Euclidean distance
measure or other /L_p / measures, strings under hamming or edit
distance, and bit vectors under hamming distance. The drivers given here
give examples of these measures.

The data structures are a means of potentially speeding up the /nearest
neighbor searching /problem:

    Given a set /S/ of points (also called /sites/) in a metric space,
    build a data structure so that given a query point /q/, the nearest
    site to /q/ can be found quickly.

The /sb(S)/ data structure given here can be built for arbitrary metric
spaces. Measurements <white_paper.pdf> for several cases have shown that
it can be effective, and is, at least, does no harm.

The data structure needs typically a (small) constant amount of
additional space per site. The data structures may well bog down and
give no improvement over brute force for high dimensional points, or
some sets of strings. However, this is not predictable, since a given
set of sites may have a lot of structure, even if it lies in a very
high-dimensional ambient space. So  the data structure may be worth a
try. There is also available a quadtree-like data structure, coded by
Arya and Mount, <http://www.cs.umd.edu/%7Emount/ANN/> for /L_p / spaces,
which is faster in that setting.

The SB library also includes functions for fixed radius queries, where
all sites closer than some given distance are desired, and /k/'th
nearest queries, where the closest /k/ sites are desired, for some input
integer /k/.  The code also internally supports /inverse nearest
neighbor / queries, although that capability is not currently supported
in the interface.


      License

The SB library is provided under the Gnu General Public License,and the
usual disclaimers apply:

/****************************************************************************
* Copyright (C)2002 Lucent Technologies                                     *
* (clarkson@research.bell-labs.com http://cm.bell-labs.com/who/clarkson)    *
*                                                                           *
* This program is free software; you can redistribute it and/or             *
* modify it under the terms of the GNU General Public License               *
* as published by the Free Software Foundation; either                      *
* version 2 of the License, or (at your option) any later version.          *
*                                                                           *
* This program is distributed in the hope that it will be useful,           *
* but WITHOUT ANY WARRANTY; without even the implied warranty of            *
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the             *
* GNU General Public License for more details.                              *
*                                                                           *
* You should have received a copy of the GNU General Public License         *
* along with this program; if not, write to the Free Software               *
* Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA,                 *
* or visit http://www.gnu.org/licenses/gpl.html                             *
*****************************************************************************/


      Source Code, tar'd, gzipped
      <http://cm.bell-labs.com/who/clarkson/Msb/Msb_0.1.tar.gz>


      Source Code, zipfile
      <http://cm.bell-labs.com/who/clarkson/Msb/Msb_0.1.zip>


      "Slides" from a recent talk
      <http://cm.bell-labs.com/who/clarkson/Msb/t/v.xml>

(Warning: the above "slides" will only be legible in a browser with
native support for mathml and svg: that is, currently only a special
build of Mozilla.  Sorry)


      Installation

The .zip file can be downloaded and extracted using, for example,
winzip. To extract the tar file Msb_0.1.tar.gz, run gunzip on it, and
tar to create the directory Msb_0.1, and cd to that directory:

      gunzip Msb_0.11.tar.gz
      tar xf Msb_0.11.tar
      cd Msb_0.11

*Building using make.  *Now adjust the macros in Make_config to look in
the appropriate directories, and use the appropriate compiler, random
number generator, and library creation and use. The code has been
compiled under ANSI C and C++ on SGI's and under Visual Studio on
Windows.  Finally, type

    make library

to create the library, and

    make drives

to produce the drivers for points under the Euclidean norm, strings
under hamming distance, strings under edit distance, and bitvectors
under hamming distance.

*Building using Visual Studio 7.  *The folder 'win' contains the
solution sb.sln, which contains projects to build the library and driver.


      Using the library

The declarations for the functions for creating and using /sb(S)/ are
given in include/sb.h. So

    #include <sb.h>

must be added to programs calling these functions, where the directory
containing that file must be in the include path when compiling the
program.

To load these functions into the resulting program, on most systems the
linker should be told of the library libSB.a with the option -lSB, and
using the -L option to tell the linker where that library is.


      Using the drivers

The command

    make drives

given in the Msb_0.11 directory, will produce drivers and distance
functions for euclidean data in euclid, strings under hamming distance
in hamming, strings under edit distance in string, and bitvectors in
(you guessed it) bitvector. Some corresponding data sets are included in
those directories also.  The drivers take data from stdin, and print
test results on stdout.  They are included here mainly to show various
applications of the library, and the user will need to examine and
change the source code to adapt them to other needs.

The executable test_sb can be make'd in the euclid subdirectory, and
will generate a variety of distributions of random sites, build the data
structure, and then test the data structure on those sites.


      A Variant Data Structure

Also included is the "MM" data structure, built with the build_M
function, searched with the search_M function, and with driver drive_M.
This data structure does not require the triangle inequality, but does
need some additional points, that are used as representatives of typical
query points.  Most users would probably prefer the /sb /data structure.


        Author

    Ken Clarkson <http://cm.bell-labs.com/who/clarkson/>
    Room 2C-455
    Bell Laboratories
    600 Mountain Ave
    Murray Hill, NJ
    07974
    clarkson@bell-labs.com <mailto:clarkson@bell-labs.com>

/Last Modified/
/Jan 2003/

