SwishD Clustering for swish-e

Swishd cluster system is an application that will allow swish-e to
scale out to multiple machines. Thus allowing the number of indexes (or
collections) to become almost limitless. By scaling out to multiple
swishd nodes the index sizes can remain small as the number of documents
increase. This it typically measured in millions of documents/files.


Download the latest source via GIT:

git clone git://solar1.net/swishd.git


Architecture Overview

 A
client makes a TCP connection to the cluster_mgr (default port of
5500). The client sends a query in XML format to the cluster_mgr.
Cluster_mgr will in turn connect to each swishd node indicated in the
configuration file (TCP 5000) and submit the search query to each node
for the collection specified in the client query. The swishd node(s)
will run the search against theappropriate index and return results to
cluster_mgr. cluster_mgr will in turn assemble and sort the results by
rank and return in XML format back to the client.
 



Search Query Format

The
swishd nodes can house several indexes which can be categorized into
several "collections". For example there can be a document collection
for sports and another forlegal documents. You may want to search for
the phrase "Jason Giambi" and get news about his legal cases but you may
not necessarily want news about games he has played. To do this, you
would specify the collection for your legal documents in the search
query.



The client sends the original query in XML format. An example of the format is as follows:



sports
legal
Jason Giambi

This
would instruct the swishd nodes to search both the legal and the sports
collections for any documents containing the query phrase.


Results Format

Cluster_mgr will return the final results to the client in XML format. An example of the format is as follows:


/documents/LEGAL/7b0000003fbcda.xml
1000
92003
(null)
/index/legal_1.idx
2004-05-06 00:42:01 EDT
1
59971

  • Path : The absolute path to the document.

  • Size : Size in bytes of the document.

  • Title : Title of the document (If applicable)

  • Index : The index that contained the information about the document.

  • Modified : The time stamp (mtime) of when the document was last modified.

  • Record : Not used.

  • File : Not used.

Comments

Add new comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.
secret-login