You're a good looking blog, what's your owner's name?: Distributed Search Engines

The recent controversy about Google limiting search results to people in China has highlighted that search engines are very influential, and subject to political and commercial pressures. They also mean that everyone has to go through a single point, which is vulnerable to attack or system failure.

This blog entry suggests how a search engine could be distributed across many machines. Each user would search a small part of the web, and the results would be shared. In some ways this is similar to recent trends for downloading music. People wishing to download music previously may have used a service such as "napster", which centrally maintained a list of songs. This was closed down due to its widespread use for infringing copyright, to be replaced by "peer to peer" systems, which are proving much harder legally and technically to close down.

There isn't any technology that is likely to be difficult to implement technically, and there are 3 components, an agent to create a local search index, a service to respond to requests for searches and a search results viewer to search the Internet and return the results.

Creating a local search index

You could run an agent on your machine. This will examine the pages you visit and index them. It will follow any pages linked to from these pages, and index these as well. The depth of spidering will be recorded, so pages that you visit are considered most important, and pages several clicks away less so.

The amounts of pages that will be indexed will depend on available bandwidth and processor resources. They will also depend on disk space. When disk space runs out, older pages are removed. Indexing pages should be a background task that does not perceivable affect computer performance.

Responding to requests for searches

You will need to run a service on your machine which will listen on a port. Someone can connect to the service and specify a search term. If they do, the following are returned to the user:

a) Summaries of the top 10 pages in your index will be returned to this user.
b) A list of your search partners.

When someone connects to your machine, their IP address will be logged and recorded in your “search partners” list. This will mean you can search their machine in future.

Search Results Viewer

When you want to search, you will run a desktop application, which displays a list of search results.

The first thing this will first look at your machine for pages matching your search criteria. It will then ask machines in your list of “search partners” if they have any pages matching your criteria. As pages are returned, they are added to the list displayed in the application, and the list is reordered to place the best matches (those that most accurately match your search words and those that are returned by lots of people) first.

It then recursively asks machines in the lists of search partners of your search partners, until it deems enough results have been returned or it has been trying long enough. If a partner of a search partner returns a result, it is added to your list of search partners.

Conclusion

As this will index pages that are actually visited, search results are more likely to be relevant. It will be more difficult for political or commercial pressures to influence search results, which is especially important as the internet becomes more widely available in countries with less than perfect records on free speech. The positions of sites will differ between searches, depending on which search partners could be contacted. This will mean people see a wider range of web sites, making it easier for smaller sites to become noticed.

Related Links

Hundred Dollar Laptop

You're a good looking blog, what's your owner's name?

Sunday, January 29

Distributed Search Engines

No comments: