luceneconsulting

San Francisco +1 415 287 0121 | 
Lucene is a free, open-source information retrieval library written in Java and supported by the Apache Software Foundation.

Lucene is suitable for any application which requires full-text indexing and search, and is a popular choice for consumer and business (SaaS) web applications, single-site searching, and enterprise search. It can index text from a range of content types including HTML, PDF, Microsoft Word and Excel, XML, and OpenDocument format.

Working with search and indexing technology requires quite different skills to typical web application and database development. Problems can arise in fundamental areas like search quality, performance, and system integration, and in low-level aspects such as character encoding and data cleaning.

An outside Lucene consultant can get your search project started quickly, and put in place a plan to avoid pitfalls and blind alleys. Lucene consulting can be focused on design and architecture, implementation, performance tune-ups or Lucene training.

See the list below for typical Lucene consulting topics and add-on components.

Lucene consulting is provided by Nicholas Haddock, Ph.D., who has over 15 years experience in search technology, text mining, and natural language processing in the context of real-world product development. Projects can be extended to include members of his team, who provide performance tuning, implementation, and system integration assignments.

We have extensive experience in a range of Lucene-related libraries, platforms and languages:

  • Lucene and related libraries: Lucene, SOLR, Nutch, PyLucene
  • Languages: Java, Python, C++, .NET, PHP
  • OS platforms: Linux, Windows, Mac OS X
  • Databases: MySQL, PostgreSQL, MS SQL Server, SQLite

Read more about the Lucene consulting team at Atomic Intelligence.

Content
Defining the content to be indexed. Content extraction, conversion, storage and updating, duplicate detection. Web crawling or file content access.
Indexing
Defining the index fields and field types. Using multiple indexes, overlapping or distinct. Indexing and search performance. Running indexing and search as services available via HTTP, XML-RPC, etc.
Search quality
Many search capabilities fail to get adopted by end-users simply because the search results are poor, or poorly presented. A number of simple tweaks or enhancements can improve search quality, including duplicate detection, search relevance scoring, boosting, and use of built-in or custom sorting functions.
Hardware requirements
Use of memory, disk size, disk speed, CPU, and distributed servers, and the relative performance improvements and trade-offs.
Database integration
Database integration is one of the major challenges when working with any pure full-text search platform. In particular, which information to store in a relational database and which to store in Lucene indices? And how to efficiently process queries dependent on both index- and DB-resident data.
Search-based user interfaces
How to blend search with site navigation. How to make sure search results are relevant and consistent with the rest of the application. Appropriate use of advanced search forms or intelligent query interpretation.
Lucene alternatives
Advice on alternative search solutions, both free and fee-based.

  • Text classification
  • Text clustering
  • Parsing and information extraction from free text
  • Data scraping and extraction from the web
  • User interfaces for search and management unstructured content
  • Search-based analytics

                   


Foundations of Statistical Natural Language Processing  Christopher D. Manning and Hinrich Schuetze (1999)
Introduction to Information Retrieval  Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze (2008)
Lucene in Action  Otis Gospodnetic and Erik Hatcher (2004)
Automatic Text Processing: The Transformation Analysis and Retrieval of Information by Computer  Gerard Salton (1988)

© Copyright 2009 Atomic Intelligence