One of the nice features of Mac OS X is Apple's spotlight.
It makes it easy to find documents because it supports full
text search and is aware of different file types. In the open
source world, there are many search tools for Linux, but they all
fail in different ways. Some of them are slow. Others
don't support full text search and rely on inotify.
Linux solutions
With inotify, the Linux kernel can notify a program that a file
has changed by path name. In the BSD community, we have
kqueue that will report changes via fd. Ideally, one would
create a system daemon that can monitor changes in files and update
the index on the fly. This is planned for a future version of
msearch(1). A flaw with most BSD approaches is that it's easy
to hit the kern.maxfiles limit as one has to have many directories
and files open to detect changes. kqueue approaches tend to
work with UFS and UFS2 file systems only. Someone using ZFS
or fat32 would not get changes unless polling was used. Most modern
Linux systems use gamin or FAM to monitor file changes.
Many of the Linux solutions are under the GPL license. They were
not designed for BSD. I've started down the path of solving
this problem. The first iteration of my work is called
msearch. msearch(1) is a command line tool to search for
files on the computer either matching elements of the path or by
using the full text search feature.
Indexing
All text files on the computer can be indexed by msearch.
It uses libmagic to determine the mime type of the file.
This allows it to skip files that are empty, binary, or
otherwise useless to the search tool.
msearch(1) uses two index files generated by a program called
msearch.index. /var/db/msearch.db is a sqlite database
containing path information, owner, group, and file size at the
time of indexing. /var/db/msearch_full.db contains a sqlite 3 FTS4
full text index of the text files on the computer. It makes
use of zlib to compress the text data. On my computer,
approximately 350,000 files were indexed and 84,000 were considered
text files indexable by the full text engine. Prior to adding
compression, the database used 850MB of space. After compression,
the file uses 413MB. Another compression algorithm might cut
off additional space at the expense of indexing
performance.
The current version of msearch relies on a periodic script
similar to locate(1). It is run weekly and most be turned on
with weekly_msearch_enable="YES" in periodic.conf. I would
like to replace this process with a daemon that handles search
requests and indexing. Apple's search features work in this
manner.
Graphical Search
Most of the logic for msearch(1) was placed in a shared library,
libmsearch, which can be used to create a graphical search tool.
I envision a sherlock like search tool for the initial
release and possibly an integrated solution if MidnightBSD ever
gets it's own window manager.
Security
There are several possible issues with generating an index of
all files. If the index is readable by any user, it could
allow one to open the sqlite file and read the contents of
sensitive files. For this reason, I've limited the indexer so
that it cannot run as the root user. Files most be readable
by nobody (if using the periodic script) to become part of the
index.
There is also the possibility of sql injection. The
database files aren't writable by normal users and the indexer uses
prepare statements. As the searching functionality is
currently using a custom built search string, this could result in
undesired behavior. It's also not recommended to do a search
as the root user. sqlite does have the ability to load
extensions, and this feature is used to compress and rank full text
data. The extension loading is turned off right after the
database is created to avoid problems form uesrs.
Future directions
I have a large list of features to add to mserach(1). I
plan to add filtering based on file size, user id, group id,
created and modified times. I've considered adding a network search
feature in combination with the plans for the search daemon and
indexing in near 'real time" with file monitoring. In order
for this to work efficiently, a new kernel interface would need to
be created or kqueue would need to be modified.
I don't intend for this tool to replace locate(1), find(1) or
similar search functions, but merely allow users to have an
additional option with full text.
Performance
Full text searches are quire fast. Simple queries such as
searching for Linux are done in seconds. A search against
path names takes longer than locate(1), but is still respectable.
locate(1) uses a path compression technique to keep the database
small and was optimized for low resources. msearch(1) takes
advantage of the convenience of sqlite 3 and the modern performance
of PCs.