Sunday, July 1, 2012

Apache Lucene : A Smart Guide to Index and Search Text

Apache Lucene(TM) is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
For adding customized full text search,Lucene is powerful efficient search algo
lets begin to explore it,The Example below is very self explanatory

Step 1: Create java project in eclipse add lucene core jar into the build path of
the project.

Step 2: Create a class LuceneIndexnSearch as shown below
package com.test;

import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;

@SuppressWarnings("deprecation")
public class LuceneIndexnSearch {

 public static final String SOURCE_FILE = "sourceFileToIndex";
 public static final String INDEX_DIR = "indexDir";

 public static final String FIELD_PATH = "path";
 public static final String FIELD_CONTENTS = "contents";

 

 public  void createIndex() throws CorruptIndexException, LockObtainFailedException, IOException {
  Analyzer analyzer = new StandardAnalyzer();
  boolean recreateIndexIfExists = true;
  IndexWriter indexWriter = new IndexWriter(INDEX_DIR, analyzer, recreateIndexIfExists);
  File dir = new File(SOURCE_FILE);
  File[] files = dir.listFiles();
  for (File file : files) {
   Document document = new Document();

   String path = file.getCanonicalPath();
   document.add(new Field(FIELD_PATH, path, Field.Store.YES, Field.Index.UN_TOKENIZED));

   Reader reader = new FileReader(file);
   document.add(new Field(FIELD_CONTENTS, reader));

   indexWriter.addDocument(document);
  }
  indexWriter.optimize();
  indexWriter.close();
 }

 public static void searchIndex(String searchString) throws IOException, ParseException {
  System.out.println("Searching for '" + searchString + "'");
  Directory directory = FSDirectory.getDirectory(INDEX_DIR);
  IndexReader indexReader = IndexReader.open(directory);
  IndexSearcher indexSearcher = new IndexSearcher(indexReader);

  Analyzer analyzer = new StandardAnalyzer();
  QueryParser queryParser = new QueryParser(FIELD_CONTENTS, analyzer);
  Query query = queryParser.parse(searchString);
  TopDocCollector collector = new TopDocCollector(5); 

  indexSearcher.search(query, collector); 

  int numTotalHits = collector.getTotalHits(); 

  collector = new TopDocCollector(numTotalHits); 

  indexSearcher.search(query, collector); 

  ScoreDoc[] hits = collector.topDocs().scoreDocs; 
  
  for(ScoreDoc sd : hits){
   int docId = sd.doc;
   Document document = indexSearcher.doc(docId);
   System.out.println("Number of matches(Hits) in the document "+document.get(FIELD_PATH)+" of the given string "+searchString+" is "+sd.doc);
  }
 }

}
Step 3: create to folder to contain files to index and to contain the indexed document:
In the source folder I have copied the to documents(text files) vehicleOwnedByABC.txt the content of the file is as:

Maruti
mahindra
vento
honda city
honda accord
hyundai



The other file name is vehicleOwnedByDEF.txt with content

swaraj
renault
polo
nissan
maruti
hyundai


Step 4: Create your main class to test the searching and indexing as fallows
package com.test;

public class Main {
 public static void main(String[] args) {
  LuceneIndexnSearch lins = new LuceneIndexnSearch();
  try{
  lins.createIndex();
  lins.searchIndex("Hyundai");
  lins.searchIndex("Maruti");
  lins.searchIndex("Mahindra");
  lins.searchIndex("Honda city");
  lins.searchIndex("Honda accord");
  }catch(Exception e){
   e.printStackTrace();
  }
 }

}

Step 5: Here is the output at console:
Searching for 'Hyundai'
Number of matches(Hits) in the document F:\SpringExamples\LuceneExample\sourceFileToIndex\vehicleOwnedByDEF.txt of the given string Hyundai is 1
Number of matches(Hits) in the document F:\SpringExamples\LuceneExample\sourceFileToIndex\vehicleOwnedByABC.txt of the given string Hyundai is 0
Searching for 'Maruti'
Number of matches(Hits) in the document F:\SpringExamples\LuceneExample\sourceFileToIndex\vehicleOwnedByDEF.txt of the given string Maruti is 1
Number of matches(Hits) in the document F:\SpringExamples\LuceneExample\sourceFileToIndex\vehicleOwnedByABC.txt of the given string Maruti is 0
Searching for 'Mahindra'
Number of matches(Hits) in the document F:\SpringExamples\LuceneExample\sourceFileToIndex\vehicleOwnedByABC.txt of the given string Mahindra is 0
Searching for 'Honda city'
Number of matches(Hits) in the document F:\SpringExamples\LuceneExample\sourceFileToIndex\vehicleOwnedByABC.txt of the given string Honda city is 0
Searching for 'Honda accord'
Number of matches(Hits) in the document F:\SpringExamples\LuceneExample\sourceFileToIndex\vehicleOwnedByABC.txt of the given string Honda accord is 0

1 comment:

Unknown said...

These kind of articles are always attractive and I am happy to find so many good point here in the post writing is simply great thanks for sharing.
Outdoor Furniture