Swatantra Kumar: Solr

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.

Solr is a layer of code on top of Lucene that transforms Lucene into a search platform for building search applications. Solr was created by Yonik Seeley while at CNET and contributed to Apache by CNET. Solr provides the following capabilities:

1. Web service: Solr places Lucene over HTTP, allowing programs written in any language to invoke Lucene
2. XML-based schema for managing indexed fields and their characteristics
3. System administration tools for configuration, data loading, index replication, statistics, logging and cache management
4. Large scale distributed search
5. Fixed/paid result list placement
6. Faceting -- the dynamic clustering of items or search results into categories that lets users drill into search results (or even skip searching entirely) by any value in any field, as seen on popular ecommerce sites such as Amazon

Most users building Lucene-based search applications will find they can do so more quickly if they start with Solr since it contains many of the capabilities needed to turn a core search capability into a full-fledged search application. Most of the more recent large Lucene-based installations mentioned above use Solr, including AOL, Comcast Interactive Media and Netflix, and of course CNET. However, as in any open layered environment, users can still choose to work directly with the underlying Lucene library, perhaps to manipulate or exploit lower level Lucene capabilities.

Feature List of Solr

1) Faceted search
2) Full-text search
3) Hit highlighting
4) Dynamic clustering
5) Sorting
6) Filtering
7) Spell checking
8) Elevation
9) Boosting at index and query time
10) "Did you mean" spell checking
11) Finding Documents that are "More like this"
12) Overriding search results based on editorial input (also known as paid placement)
13) Term
14) Term Frequency
15) Position (based on analysis)
16) Offset (character based)
17) IDF – Inverse Document Frequency
18) CopyField functionality allows indexing a single field multiple ways, or combining multiple fields into a single searchable field

Query

1 HTTP interface with configurable response formats (XML/XSLT, JSON, Python, Ruby)
2 Sort by any number of fields
3 Advanced DisMax query parser for high relevancy results from user-entered queries
4 Highlighted context snippets
5 Faceted Searching based on unique field values and explicit queries
6 Spelling suggestions for user queries
7 More Like This suggestions for given document
8 Constant scoring range and prefix queries - no idf, coord, or lengthNorm factors, and no restriction on the number of terms the query matches.
9 Function Query - influence the score by a function of a field's numeric value or ordinal
10 Date Math - specify dates relative to "NOW" in queries and updates
11 Performance Optimizations

Cache in Solr

Solr caches are associated with an Index Searcher — a particular 'view' of the index that doesn't change. So as long as that Index Searcher is being used, any items in the cache will be valid and available for reuse. Caching in Solr is unlike ordinary caches in that Solr cached objects will not expire after a certain period of time; rather, cached objects will be valid as long as the Index Searcher is valid.

The current Index Searcher serves requests and when a new searcher is opened, the new one is auto-warmed while the current one is still serving external requests. When the new one is ready, it will be registered as the current searcher and will handle any new search requests. The old searcher will be closed after all request it was servicing finish. The current Searcher is used as the source of auto-warming. When a new searcher is opened, its caches may be prepopulated or "autowarmed" using data from caches in the old searcher.

There are currently two cache implementations — solr.search.LRUCache (LRU = Least Recently Used in memory), and solr.search.FastLRUCache.

Admin Interface
1 Comprehensive statistics on cache utilization, updates, and queries
2 Interactive schema browser that includes index statistics
3 Replication monitoring
4 Full logging control
5 Text analysis debugger, showing result of every stage in an analyzer
6 Web Query Interface w/ debugging output
o parsed query output
o Lucene explain() document score detailing
o explain score for documents outside of the requested range to debug why a given document wasn't ranked higher.

To summarize, Solr is not meant to be a replacement for your RDBMS. Rather, Solr should be used to develop the search service .Solr does a good job of searching and finding relevant items for a query. In truth all search engines can and should be tuned, Solr are no exception.

Swatantra Kumar

Pages

Friday, October 15, 2010

Solr

No comments: