Search engines are programs that search documents for specified keywords and return a list of the documents where the keywords were found. A search engine is really a general class of programs; however, the term is often used to specifically describe systems like Google, Bing and Yahoo! Search that enable users to search for documents on the World Wide Web.
Web Search Engines
Typically, Web search engines work by sending out a spider to fetch as many documents as possible. Another program, called an indexer, then reads these documents and creates an indexbased on the words contained in each document. Each search engine uses a proprietary algorithm to create its indices such that, ideally, only meaningful results are returned for each query.
HOW DO WEB SEARCH ENGINES WORK?
Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be virtually impossible to locate anything on the Web without knowing a specific URL. But do you know how search engines work? And do you know what makes some search engines more effective than others?
When people use the term search engine in relation to the Web, they are usually referring to the actual search forms that search through databases of HTMLdocuments, initially gathered by a robot.
There are basically three types of search engines: Those that are powered by robots (called crawlers; ants or spiders) and those that are powered by human submissions; and those that are a hybrid of the two.
Crawler-based search engines are those that use automated softwareagents (called crawlers) that visit a Web site, read the information on the actual site, read the site’s meta tagsand also follow the links that the site connects to performing indexing on all linked Web sites as well. The crawler returns all that information back to a central depository, where the data is indexed. The crawler will periodically return to the sites to check for any information that has changed. The frequency with which this happens is determined by the administrators of the search engine.
Human-powered search engines rely on humans to submit information that is subsequently indexed and catalogued. Only information that is submitted is put into the index.
In both cases, when you query a search engine to locate information, you’re actually searching through the index that the search engine has created —you are not actually searching the Web. These indices are giant databasesof information that is collected and stored and subsequently searched. This explains why sometimes a search on a commercial search engine, such as Yahoo! or Google, will return results that are, in fact, dead links. Since the search results are based on the index, if the index hasn’t been updated since a Web page became invalid the search engine treats the page as still an active link even though it no longer is. It will remain that way until the index is updated.
So why will the same search on different search engines produce different results? Part of the answer to that question is because not all indices are going to be exactly the same. It depends on what the spiders find or what the humans submitted. But more important, not every search engine uses the same algorithm to search through the indices. The algorithm is what the search engines use to determine the relevance of the information in the index to what the user is searching for.
One of the elements that a search engine algorithm scans for is the frequency and location of keywords on a Web page. Those with higher frequency are typically considered more relevant. But search engine technology is becoming sophisticated in its attempt to discourage what is known as keyword stuffing, or spamdexing.
Another common element that algorithms analyze is the way that pages link to other pages in the Web. By analyzing how pages link to each other, an engine can both determine what a page is about (if the keywords of the linked pages are similar to the keywords on the original page) and whether that page is considered “important” and deserving of a boost in ranking. Just as the technology is becoming increasingly sophisticated to ignore keyword stuffing, it is also becoming more savvy to Web masters who build artificial links into their sites in order to build an artificial ranking.
Did you know?
The first tool for searching the Internet, created in 1990, was called “Archie”. It downloaded directory listings of all files located on public anonymous FTP servers; creating a searchable database of filenames. A year later “Gopher” was created. It indexed plain text documents. “Veronica” and “Jughead” came along to search Gopher’s index systems. The first actual Web search engine was developed by Matthew Gray in 1993 and was called “Wandex”. [Source]
KEY TERMS TO UNDERSTANDING WEB SEARCH ENGINES
A condition of dynamic Web sites in which a search engine’s spider becomes trapped in an endless loop of code
A program that searches documents for specified keywords and returns a list of the documents where the keywords were found
A special HTML tag that provides information about a Web page
A hyperlink either on a Web page or in the results of a search engine query to a page on a Web site other than the site’s home page.
A program that runs automatically without human intervention
A program that runs automatically without human intervention
WEB SEARCH ENGINES & DIRECTORIES
According to a 2007 report by Netcraft, 108,810,358 distinct Web sites make up the World Wide Web. When you want to find out more about a specific topic, service or product, you use an Internet search engine. Today there are a number of search engines, and while they work differently, they all use Webcrawlers (also called bots) that are designed to index pages on the Web and also words found on these pages. The indexing of the Web enables is what enables users to search for keywords or combinations of words to find information online. Other types of search engines are called search directories. They site index content chosen by human editors, rather than automated indexing done by bots. Today most search engines offer complementary search-related products such as shopping search, news and other services that go beyond the basic keyword search function. The following Quick Reference provides an overview of some of the more popular public Web Search Engines and Directories, including details on their history, information on how they work and tips for using each.
According to a 2007 report by Netcraft, 108,810,358 distinct Web sites make up the World Wide Web. When you want to find out more about a specific topic, service or product, you use an Internet search engine. Today there are a number of search engines, and while they work differently, they all use Webcrawlers (also called bots) that are designed to index pages on the Web and also words found on these pages. The indexing of the Web enables is what enables users to search for keywords or combinations of words to find information online.
Other types of search engines are called search directories. They site index content chosen by human editors, rather than automated indexing done by bots. Today most search engines offer complementary search-related products such as shopping search, news and other services that go beyond the basic keyword search function.
The following Quick Reference provides an overview of some of the more popular public Web Search Engines and Directories, including details on their history, information on how they work and tips for using each.
Bing is a new search engine from Microsoft that was launched on May 28, 2009. Microsoft calls it a “Decision Engine,” because it’s designed to return search results in a format that organizes answers to address your needs. When you search on Bing, in addition to providing relevant search results, the search engine also shows a list of related searches on the left-hand side of the search engine results page (SERP). You can also access a quick link to see recent search history. Bing uses technology from a company called Powerset, which Microsoft acquired.
Bing launched with several features that are unique in the search market. For example, when you mouse-over a Bing result a small pop-up provides additional information for that result, including a contact e-mail address if available. The main search box features suggestions as you type, and Bing’s travel search is touted as being the best on the net. Bing is expected to replace Microsoft Live Search.
BING SEARCH TIPS:
- You can search for feeds using feeds: before the query
- To search Bing without a background image use http://www.bing.com/?rb=0
- To turn the background image back on, use http://www.bing.com/?rb=1
- To change the number of search results returned per page, click “Extras” (on top-right of page) and select “Preferences”. Under Web Settings / Results you can choose 10, 15, 30 or 50 results
Today Google is the largest public Internet search engine, in terms of indexed content and number of users. Company founders Larry Page and Sergey Brin initially collaborated on a search engine called BackRub that had the capability to analyze the back links pointing to a given Web site. With financial backing, in 1998 the founders opened Google. By 2000 Google was handling more than 100 million search queries a day, and by 2004 Google claims the site index reached 4.28 billion Web pages.
Google’s search engine crawler, called the GoogleBot, travels from Web page to Web page following hyperlinks. When a new page is found, Googlebot will also crawl all the hyperlinks on that page as well. A second bot also crawls indexed pages to keep the index updated. As pages are indexed, they are also given scores based on criteria like how many times words are displayed (density), link popularity, HTML code, themes, content (the text) and more. These scores are what determines where the Web page listing appears in the search results.
Google Search Tips:
- You can search for a phrase by using quotations [“like this”] or with a minus sign between words [like-this].
- You can search by a date range by using two dots between the years [2004..2007].
- When searching with a question mark [?] at the end of your phrase, you will see sponsored Google Answer links, as well as definitions if available.
- Google searches are not case sensitive.
- By default Google will return results which include all of your search terms.
- Google automatically searches for variations of your term, with variants of the term shown in yellow highlight.
- Google lets you enter up to 32 words per search query.
- Google Search Page
Founded in 1994 by David Filo and Jerry Yang, Yahoo was initially a directory of Web sites categorized by human editors, a directory for which it is still most well-known for today. Over time Yahoo began acquiring search companies and combining the technologies. In 2004 Yahoo acquired the Overture pay-per-click service (which had bought AltaVista and AlltheWeb), as well as the Inktomi search database and others. These search technologies and tools combined makes yahoo what it is today, making Yahoo’s infamous directory search secondary to its main search engine.
The Yahoo Search index is made up of billions of Web pages, which are populated by a Web crawler. When Yahoo crawls pages it takes several factors into consideration. The search terms included in the page’s Title and Description tag, page content (the text), keyword density, inbound hyperlinks and so on Yahoo has also started using page rank technologies and also takes Yahoo Directory listings and paid inclusions into consideration when indexing and ranking pages. Users can submit their own pagesdirectly to the Yahoo Search index and the Yahoo Directory, and also submit products for inclusion in Yahoo Shopping. Yahoo has also incorporated special search and services such as Webmail, Local, video, images, shopping and news search products.
Yahoo Search Tips:
- By default Yahoo returns results that include all of your search terms
- To exclude words use a minus sign [cat -tabby] would show all results about cats with no mention of tabby.
- Yahoo search results also shows related searches, which are based on other searches by users with similar terms
- To search for a map, use map [location]
- To search for dictionary definitions use “define” [define hard drive]
- To search a single domain use site [site: obasimvilla.com/forum DVD] would search Webopedia for the term DVD.
- Yahoo Search Page
Windows Live Search
Microsoft’s search engine, Windows Live Search offers a huge improvement over MSN Search and is also integrated into Microsoft’s Live.com. When it launched September 12, 2006, it was a new search engine built from scratch using a new algorithmic engine that was integrated throughout Windows Live and MSN. Some of the features of Windows Live Search include a nice feature-rich interface — something new and unique in the Web search space. By signing in to a personalized live search you can add feeds and subscribe to search results. Windows Live Search also incorporates specific searches for images, news, academic journals, RSS feeds, maps and more.
Live Search technologies attempt to overcome some elements of human error, such as spelling errors, punctuation and synonyms and also predicts with the intent of providing the best search results possible for users. To improve ranking in Microsoft Live Search you can mark your front page as accessible to those with specialist settings on their browser, keyword density is a factor, include an HTML site map and also use a distinct list of keyword meta tags for each page on your site.
Windows Live Search Tips:
- Common words such as a, and, and the are ignored unless they’re enclosed by quotation marks.
- Category lists may appear on the top of search results – you can click a category to see only the results associated with that category.
- If using a date in your search query, type the name of the month instead of the calendar number.
- To define something, use define followed by the word [define DVD] will show definitions for DVD.
You can enter up to 150 characters in the search box.
HOW SEARCH ENGINES RANK WEB PAGES
Search for anything using your favorite crawler-based search engine. Nearly instantly, the search engine will sort through the millions of pages it knows about and present you with ones that match your topic. The matches will even be ranked, so that the most relevant ones come first.
Of course, the search engines don’t always get it right. Non-relevant pages make it through, and sometimes it may take a little more digging to find what you are looking for. But, by and large, search engines do an amazing job.
As WebCrawler founder Brian Pinkerton puts it, “Imagine walking up to a librarian and saying, ‘travel.’ They’re going to look at you with a blank face.”
OK — a librarian’s not really going to stare at you with a vacant expression. Instead, they’re going to ask you questions to better understand what you are looking for.
Unfortunately, search engines don’t have the ability to ask a few questions to focus your search, as a librarian can. They also can’t rely on judgment and past experience to rank web pages, in the way humans can.
So, how do crawler-based search engines go about determining relevancy, when confronted with hundreds of millions of web pages to sort through? They follow a set of rules, known as an algorithm. Exactly how a particular search engine’s algorithm works is a closely-kept trade secret. However, all major search engines follow the general rules below.
LOCATION, LOCATION, LOCATION…AND FREQUENCY
One of the the main rules in a ranking algorithm involves the location and frequency of keywords on a web page. Call it the location/frequency method, for short.
Remember the librarian mentioned above? They need to find books to match your request of “travel,” so it makes sense that they first look at books with travel in the title. Search engines operate the same way. Pages with the search terms appearing in the HTML title tag are often assumed to be more relevant than others to the topic.
Search engines will also check to see if the search keywords appear near the top of a web page, such as in the headline or in the first few paragraphs of text. They assume that any page relevant to the topic will mention those words right from the beginning.
Frequency is the other major factor in how search engines determine relevancy. A search engine will analyze how often keywords appear in relation to other words in a web page. Those with a higher frequency are often deemed more relevant than other web pages.
Spice in the Recipe
Now it’s time to qualify the location/frequency method described above. All the major search engines follow it to some degree, in the same way cooks may follow a standard chili recipe. But cooks like to add their own secret ingredients. In the same way, search engines add spice to the location/frequency method. Nobody does it exactly the same, which is one reason why the same search on different search engines produces different results.
To begin with, some search engines index more web pages than others. Some search engines also index web pages more often than others. The result is that no search engine has the exact same collection of web pages to search through. That naturally produces differences, when comparing their results.
Search engines may also penalize pages or exclude them from the index, if they detect search engine “spamming.” An example is when a word is repeated hundreds of times on a page, to increase the frequency and propel the page higher in the listings. Search engines watch for common spamming methods in a variety of ways, including following up on complaints from their users.
OFF THE PAGE FACTORS
Crawler-based search engines have plenty of experience now with webmasters who constantly rewrite their web pages in an attempt to gain better rankings. Some sophisticated webmasters may even go to great lengths to “reverse engineer” the location/frequency systems used by a particular search engine. Because of this, all major search engines now also make use of “off the page” ranking criteria.
Off the page factors are those that a webmasters cannot easily influence. Chief among these is link analysis. By analyzing how pages link to each other, a search engine can both determine what a page is about and whether that page is deemed to be “important” and thus deserving of a ranking boost. In addition, sophisticated techniques are used to screen out attempts by webmasters to build “artificial” links designed to boost their rankings.
Another off the page factor is click through measurement. In short, this means that a search engine may watch what results someone selects for a particular search, and then eventually drop high-ranking pages that aren’t attracting clicks, while promoting lower-ranking pages that do pull in visitors. As with link analysis, systems are used to compensate for artificial links generated by eager webmasters.
The Search Engine Features Charthas a section that summarizes key areas of how crawler-based search engines rank web pages. The Search Engine Placement Tips page also summarizes key tips that will help you improve the relevancy of your pages with crawler-based search engines.
Search Engine Watch members have access to the How Search Engines Worksection. This section provides detailed information about how each major search engine gathers its listings and an additional tip on enhancing your position in their results. Learn more about becoming a Search Engine Watch member and the many benefits members receive by visiting the Membership page.