Full Text Database Indexing with dtSearch
Written by Ian Elliot   
Wednesday, 30 November 2011
Article Index
Full Text Database Indexing with dtSearch
Programming with dtSearch

Creating a custom DataSource

The basic idea is very simple - you have to create a class that inherits from DataSource. You have to override a few of the DataSource methods to provide the data to the search engine.

You can provide the data to the search engine either via DocText, DocStream, DocBytes or DocIsFile. The difference is that DocText is a simple string and the other three provide binary data that is treated as if it was a file of a specified format.

There are only two methods you have to implement - GetNextDoc and Rewind.

The GetNexDoc has to get the next "document" be it a row in a database table or a file downloaded by any means you want to use and present it to the indexing engine via one of the properties listed above. It simply returns true or false to indicate success or failure.

The Rewind method simply resets the document sequence so that the next GetNextDoc returns the first document in the sequence. It too returns true or false to indicate success or failure.

There are some other properties that you have to set to make everything work well but these are the basic core set. Let's see how it all works.

Rather than write an example that uses ADO, LINQ or some other data protocol it is simpler to read some files from disk. It shows how everything works and you can modify it to work with any other protocol. In fact the index to be constructed is the same as the first example.

First we need to define our own DataSource class:

public class myDataSource : DataSource
{

Usually this would be in another file in the project but when experimenting you can include it within the form's source file. Also to keep things simple let's not bother writing a constructor and dispense with error checking.  This is not the way you would do it in anything other than an example that has been stripped down the to the bare minimum.

We need to override two methods GetNextDoc and Rewind. The Rewind method has to reset the data import so this is also the place to write the initialization code:

public override bool Rewind()
{
files = Directory.GetFiles(@"C:\Users\
name
\Documents");
currentFile = 0;
return true;
}

We are using standard .NET I/O classes to work with the file system. You need to add:

using System.IO;

and declare the two private variables:

private string[] files;
private int currentFile;

We now have a list of file names in the string array files. Notice that we really do need to check that this operation worked and return false if it didn't. In a more realistic application the Rewind might well only reset the position in the data and you would probably need to write a separate initialization method to be used internally by the DataSource class.

The GetNextDoc method could return the next file in the list in a number of different ways - as a file, as stream or as an array of bytes. We could even read the file in and extract any text it might contain and present this as a string. In this case let's read the file into a byte array and present this to the indexing engine:

override public  bool GetNextDoc()
{

First we should check that we haven't reached the end of the list of files:

if (currentFile >= files.Length) 
return false;

As long as there is a file to process we can process it. First we set DocName to the name of the file, notice that DocName is one of the inherited properties:

DocName = files[currentFile];
currentFile++;

Next we set the inherited data and time stamp properties:

DocCreatedDate = File.GetCreationTime(DocName);
DocModifiedDate=File.GetLastAccessTime(DocName);

We also have to set DocIsFile to false to stop the Index engine reading the file in from disk on is own - yes we could get it to do all of the work but this wouldn't illustrate how to get raw data to it.

DocIsFile = false;

As we have decided to handle the data input ourselves we next have to read the data into a byte array. We also have to check that the file actually has some data to read:

FileStream reader = File.OpenRead(DocName);
if (reader.Length > 0)
{
byte[] fileData = new byte[reader.Length];
reader.Read(fileData, 0, (int)reader.Length);

At this point we have the entire content of the file stored in the fileData array. However the file data has to be presented in DocBytes and we also have to set HaveDocBytes to true to indicate to the indexing engine that it has to read and process DocBytes:

 DocBytes = fileData;
HaveDocBytes = true;
}

We can now finish the method and the class:

 return true;
}

 

The entire class is surprisingly short

public class myDataSource : DataSource
{
private string[] files;
private int currentFile;
override public  bool GetNextDoc()
{
if(currentFile >= files.Length)return false;
DocName = files[currentFile];
currentFile++;
DocCreatedDate =
File.GetCreationTime(DocName);
DocModifiedDate =
File.GetLastAccessTime(DocName);
DocIsFile = false;

FileStream reader = File.OpenRead(DocName);
if (reader.Length > 0)
{
byte[] fileData = new byte[reader.Length];
reader.Read(fileData,0,(int)reader.Length);
DocBytes = fileData;
HaveDocBytes = true;
}
return true;
}
public override bool Rewind()
{
files = Directory.GetFiles(@"C:\Users\
name\Documents");
currentFile = 0;
return true;
}
}

Using the custom DataSource

Now we have the custom DataSource we can make use of it. Setting up the index creation is much the same as before - create IndexJob, set index path and action properties:

IndexJob indexJob = new IndexJob();
indexJob.IndexPath = @"C:\Users\name\
AppData\Local\dtSearch\test2";
indexJob.ActionCreate = true;
indexJob.ActionAdd = true;

Next we create an instance of the custom DataSource:

myDataSource dataSource1 = new myDataSource();

Finally we can tell the IndexJob to use the data source,  and finally execute the job:

indexJob.DataSourceToIndex = dataSource1;
bool result = indexJob.Execute();

The indexing engine performs a rewind to make sure everything is initialized before it begins.

If you try this out you will discover that the contents of the index are the same as before. The program might achieve the same result but it does it in a very different way. Now you can take the same DataSource class and customize it to provide documents or raw text from any source you care to use - ODB, ADO.NET, LINQ, raw SQL, XML, RSS or any of the many web APIs.

Other I Programmer articles on dtSearch

Getting started with dtSearch

Threading and dtSearch

Hit Highlighting with dtSearch

 

To be informed about new articles on I Programmer, subscribe to the RSS feed, follow us on Google+, Twitter or Facebook or sign up for our weekly newsletter.



Last Updated ( Tuesday, 06 December 2011 )