AllBestArticles.com AllBestArticles.com AllBestArticles.com Services Blog AllBestArticles.com Write Articles AllBestArticles.com Videos AllBestArticles.com RSS AllBestArticles.com News AllBestArticles.com Sitemap
   

Corporate Search: Can We Just Get Google?


Article Written By: arch141

Add Your Picture Google won the search engines battle of the 1990s mainly due to two factors: first, it uses the anchor texts of web links to improve the descriptions of documents the links point to, including those that Google has not even visited, thus significantly increasing their index coverage. Even more importantly, Google uses a document "quality" component to compute the scores of search results. What this means is that Google shows the documents it finds in an order that is based not only on how well the documents match the query, but also on the quality of the documents.

The effect is that the user sees higher quality documents first. How important is this? It was not obvious in the 1990's that search engines should do this. They did not - and most of the pre-Google search engines are extinct now.

The method used by Google for quality estimation (the PageRank algorithm) is based on the web links graph. Put simplistically, the more web links there are that point to a web page, the higher is the estimated quality (rank) of that page. As we see, the methods that make Google work so well depend on the availability and nature of web links. These methods don't work well for internal websites. The reason for this is the lack of statistical information required by the PageRank and similar algorithms to estimate document quality.

While global web links can be used to estimate the popularity of a page that is viewed externally, intranet links usually reflect the site structure. For corporate sites, multiple links to the same page create redundancies that make the website harder to support, and web developers tend to avoid making these when possible. As one example, the CSIRO ATNF website has a home page for a software package known as MIRIAD. This has 26 external web links pointing to it (significantly more than to most of other ATNF pages), but only two internal links. Based on a count of intranet links, this would not be detected as an important page.

Augmenting documents descriptions using anchor texts is also difficult when there are very few links. In the example above, Google has 26 text fragments to augment the description of the MIRIAD page. An intranet search engine has only two.

If the methods that Google uses on the global Web do not work well for intranets, is there anything that works better? Google Appliance (a corporate search solution) advertises a new feature: Self-Learning Scorer - a method for improving relevance judgements based on learning from user clicks on search results. Conceptually, this is not new. Arkadi Kosmynin suggested this approach in an IEEE Communications Magazine article back in 1997 to be used in the global environment, not intranets. The problem is that it may take a long time for intranet search engines to "learn", since they receive far fewer hits than a global search engine. Fortunately, intranets have a much better source for estimation of document quality. This source is web server logs.

Normally, web servers maintain log files where they record details of every request that they process. These logs are instrumental for various tasks, such as analysing web site use patterns and problems. It is not hard to use these logs to count requests to each document for a certain time interval and estimate relative document quality, assigning higher quality scores to documents that are retrieved more often. This is simple, logical and reliable.

This method is used in Arch, an open source search engine based on Apache Nutch. Arch extends it by adding features critical for corporate environment, such as authentication and document level security. Most importantly, Arch replaces the Nutch document quality computation module with its own that uses web logs to estimate relative document quality.

Results? We asked a few people who know our site content well (approx. 70,000 pages) to evaluate the performance of Arch versus Google (the global engine, not the Google Appliance) in a series of blind tests on the public part of the site and versus Nutch on the whole site. It was the first time that we have seen people happy with an intranet search engine. One of the testers said,

"I've found so much good stuff that I did not even

know existed. I was tempted to spend hours reading it instead of continuing testing."

This effect is easy to explain. Arch ranks higher documents that people use more often, thus helping searchers to transparently discover popular resources. This also makes Arch a very effective tool for people new to the site.

We measured the "precision at top ten" documents: the number of correct hits returned by each engine in the top ten documents. Users had to mark correct hits, but did not know which engine returned which hits.Arch performed on average as well as Google and about 30-40% better than Nutch in our tests. The tuning and evaluation software that was used to do the tests is released with Arch. We encourage others to use the test software to evaluate search engines against each other and publish results. Hopefully, if there are enough tests done and results published, this will make the intranet search engine market more transparent.

What's next? Arch is available to the public free of charge and the source code is included. If you are happy with your intranet search engine, please let us know what you are using. We would love to know. If you are not happy with your current search engine, ask your IT guys to try Arch. It will make a difference!


About the Author

Arkadi Kosmynin and Jessica Chapman work at CSIRO Astronomy and Space Science. Arkadi develops web based applications. Jessica is an astronomer and a senior manager. They both contributed to Arch



You Can Link Directly to "Corporate Search: Can We Just Get Google?" by using the url:
http://www.allbestarticles.com//computer/software/corporate-search-can-we-just-get-google.html

This Article Has Been Published on Wed, 6 Oct 2010 and Read 128 Times


Find All Best Articles at http://www.allbestarticles.com








Rating: Not yet rated



Comments

No comments posted.

Add Comment

You do not have permission to comment. If you log in, you may be able to comment.

Related information on Software

The Advantages of SaaS Cloud Recruitment Software
IBM XIV storage system concepts and implementation technique
Integrate cloud computing infrastructure solutions into IT environment
Database Applications and Kinds of Database Systems
IBM system storage DS8000 implementation technique
Campus management ways simplified with a technical system
Online Forms For A Better Response
Save Time and Energy With The Best Malware Software Available
Current scenario in IT job market
Customer Relationship Management
Freight Management Solution
Best Rules of Software Development process
Web and Mobile Technology in Shipping and Logistic Industry
Check out the latest Windows 8 News
Time Attendance Recording Systems – Increase in Productivity
Protect Your Data with Disk Image Software
Why Biometric Technologies are catching the eyes of security equipment manufacturers?
Why a Small Company Should Go For IT Outsourcing
Software Quality Assurance
High End Security Solutions - Biometric Access Control
Voice Biometrics Solutions Against Fraud
Voice Recognition Technology
Why You Should Choose Voice Biometric Solutions
Office instant messenger is a paramount to fix office communication problems.
UC Browser 7.7 English Version