Posts Tagged ‘Search’

Search Engine Optimization-Searching by Means of Search Engines

Tuesday, July 15th, 2008

This is where things start to get complicated.
Search engines are trickier than they look!  You’ll discover this the first time you enter a query on C++, the programming language.  At least of the Web search engines will essentially say, “Huh?”

C++ is not a word.  It’s a letter followed by two characters that might, depending on the index, be regarded merely as punctuation.  Many text search engines have trouble handling input of this type.  Many don’t deal too well with numbers, either.  So much for “007,” “R2D2,”or “Catch-22.”

Important Note:  This problem is no longer as bad as it used to be.  I’m now finding relevant hits for C++ on a majority of search engines sites.

Here’s another example of a text string search engines hate:  To be or not to be.  Just about anyone who finished junior high school will be able to tell you where the phrase comes from and (possibly!) what it means.  But some search engines choke because all the words in the phrase are stop words–i.e., unimportant words too short and too common to be considered relevant strings on which to search.   However, if you enclose the query in quotation marks, forcing the search engine to find the words, “to be or not to be” in that precise order, most search engines can  recognize the phrase as a famous quotation from Hamlet.

Let’s take a less obvious example.  Suppose you’re a fan of murder mysteries and you want to search the Web for the home pages of all your favorite authors in that genre.  If you simply enter the words “mystery” and “writer,” most search engines will return hyperlinks to all Web documents that contain the word “mystery” or the word, “writer.”  This will probably include hundreds–or even thousands–of URLs, most of which will have no relevance to your search. If you enter the words as a phrase, however, you stand a better chance of getting some good hits.

However, as search technology advances, this is not as much of a problem as it was a couple of years ago. Many search engines will now automatically apply the “adjacency” operator when responding to a two-word query. This means that they will indeed look for documents in which your two words appear next to each other.

If you understand how search engines organize information and run queries, you can maximize your chances of getting hits on URLs that matter.

Search Engine Optimization-Keyword Search

Tuesday, July 15th, 2008

Most search engines handle words and simple phrases.  In its simplest form, text search looks for pages with lots of occurrences of each of the words in a query, stopwords aside.  The more common a word is on a page, compared with its frequency in the overall language, the more likely that page will appear among the search results.  Hitting all the words in a query is a lot better than missing some.

Search engines also make some efforts to “understand” what is meant by the query words.  For example, most search engines now offer optional spelling correction.  And increasingly they search not just on the words and phrases actually entered, but the also use stemming to search for alternate forms of the words (e.g., speak, speaker, speaking, spoke).  Teoma-based engines are also offering refinement by category, ala the now-defunct Northern Light.  However, Excite-like concept search has otherwise not made a comeback yet, since the concept categories are too unstable.

When ranking results, search engines give special weight to keywords that appear:

* High up on the page
* In headings
* In BOLDFACE (at least in Inktomi)
* In the URL
* In the title (important)
* In the description
* In the ALT tags for graphics.
* In the generic keywords metatags (only for Inktomi, and only a little bit even for them)
* In the link text for inbound links.

More weight is put on the factors that the site owner would find it awkward to fake, such as inbound link text, page title (which shows up on the SERP — Search Engine Results Page), and description.

Search engine optimization-Page Rank

Tuesday, July 15th, 2008

Search engine ranking algorithms are closely guarded secrets, for at least two reasons: search engine companies want to protect their methods from their competitors, and they also want to make it difficult for web site owners to manipulate their rankings.

That said, a specific page’s relevance ranking for a specific query currently depends on three factors:

* Its relevance to the words and concepts in the query
* Its overall link popularity
* Whether or not it is being penalized for excessive search engine optimization (SEO).

Examples of SEO abuse would be a lot of sites linked to each other in a circular scam, or excessive and highly ungrammatical stuffing with keywords.

Factor 2 was innovated by Google with PageRank.  Essentially, the more incoming links your page has, the better.  But it is more complicated than that:  indeed, PageRank is a tricky concept because it is circular, as follows:   Every page on the Internet has a minimum PageRank score just for existing.  85%  (at least, that’s the best known estimate, based on an early paper) of this PageRank is passed along to the pages that page links to, divided more or less equally along its outgoing links.   A page’s PageRank is the sum of the minimum value plus all the PageRank passed to it via incoming links.

Although this is circular, mathematical algorithms exist for calculating it iteratively.

In one final complication, what I just said applies to “raw PageRank.”   Google actually reports PageRank scores of 0 to 10 that are believed to be based on the logarithm of raw PageRank (they’re reported as whole numbers).   And the base of that logarithm is believed to be approximately 6.

Anyhow, there are about 30 sites on the Web of PageRank10, including Yahoo, Google, Microsoft, Intel, and NASA.  IBM, AOL, and CNN, by way of contrast, were only at PageRank 9 as of early in 2004.

Further refinements in link popularity rankings are under development.  Notably, link popularity can be made specific to a subject or category; i.e., pages can have different PageRanks for health vs. sports vs. computers vs. whatever.  Supposedly, AskJeeves/Teoma already works that way.

It is believed that Inktomi, Altavista, et al. use link popularity in their ranking algorithms, but to a much lesser extent than Google.  Yahoo, owner of Inktomi, Altavista, Alltheweb, is rolling out a new search engine, which reportedly includes a feature called Web Rank.  More on how that works soon.

US GOVERNMENT SEARCH-GOOGLE

Tuesday, July 15th, 2008

<usgov.google.com>: I use FirstGov.gov as a portal for all things dealing with the US government, and FedStats.gov. Google now offers a site with links to government news and search functions.

SCHOLAR-GOOGLE

Tuesday, July 15th, 2008

<scholar.google.com>: Specialized search of just academic journals and similar publications.

GOOGLE TOOLBAR

Tuesday, July 15th, 2008

Instead of going to Google.com each time to search, it’s much faster to have Google built into your browser. Several options:

* My preferred browser for PC or Mac is Firefox <getfirefox.com>, which already has built-in Google. You should download the free Googlebar extension <googlebar.mozdev.org> or the official Google Toolbar <toolbar.google.com> (the Google Toolbar also works on PC Explorer). Faster searches; pop-up blocker; highlighting; word find (go directly to a word/phrase on a page). Be sure to get the Cool Iris extension for Firefox, which lets you preview Google results.

Seo:A bot visit

Tuesday, July 15th, 2008

You can request a Google robot visit at www.google.com/addurl. The robot will browse your site and index it’s contents. Expect to have to wait for a couple of weeks before this will happen.This is Google’s webdirectory and, next to the google robot, an important source of the google search API.

More Google API Applications

Tuesday, July 15th, 2008

Staggernation.com offers three tools based on the Google API. The Google API Web Search by Host (GAWSH) lists the Web hosts of the results for a given query (www.staggernation.com/gawsh/). When you click on the triangle next to each host, you get a list of results for that host. The Google API Relation Browsing Outliner (GARBO) is a little more complicated: You enter a URL and choose whether you want pages that related to the URL or linked to the URL (www.staggernation.com/garbo/). Click on the triangle next to an URL to get a list of pages linked or related to that particular URL. CapeMail is an e-mail search application that allows you to send an e-mail to google@capeclear.com with the text of your query in the subject line and get the first ten results for that query back. Maybe it’s not something you’d do every day, but if your cell phone does e-mail and doesn’t do Web browsing, this is a very handy address to know.

Search Within a Timeframe in google

Tuesday, July 15th, 2008

Daterange: (start date–end date). You can restrict your searches to pages that were indexed within a certain time period. Daterange: searches by when Google indexed a page, not when the page itself was created. This operator can help you ensure that results will have fresh content (by using recent dates), or you can use it to avoid a topic’s current-news blizzard and concentrate only on older results. Daterange: is actually more useful if you go elsewhere to take advantage of it, because daterange: requires Julian dates, not standard Gregorian dates. You can find converters on the Web (such as http://aa.usno.navy.mil/data/docs/JulianDate.html), but an easier way is to do a Google daterange: search by filling in a form at www.researchbuzz.com/toolbox/goofresh.shtml or www.faganfinder.com/engines/google.shtml. If one special syntax element is good, two must be better, right? Sometimes. Though some operators can’t be mixed (you can’t use the link: operator with anything else) many can be, quickly narrowing your results to a less overwhelming number.

Syntax Search Tricks

Tuesday, July 15th, 2008

Using a special syntax is a way to tell Google that you want to restrict your searches to certain elements or characteristics of Web pages. Google has a fairly complete list of its syntax elements at www.google.com/help/operators.html. Here are some advanced operators that can help narrow down your search results.

Intitle: at the beginning of a query word or phrase (intitle:”Three Blind Mice”) restricts your search results to just the titles of Web pages.

Intext: does the opposite of intitle:, searching only the body text, ignoring titles, links, and so forth. Intext: is perfect when what you’re searching for might commonly appear in URLs. If you’re looking for the term HTML, for example, and you don’t want to get results such as www.mysite.com/index.html, you can enter intext:html.

Link: lets you see which pages are linking to your Web page or to another page you’re interested in.

Try using site: (which restricts results to top-level domains) with intitle: to find certain types of pages. For example, get scholarly pages about Mark Twain by searching for intitle:”Mark Twain”site:edu. Experiment with mixing various elements; you’ll develop several strategies for finding the stuff you want more effectively. The site: command is very helpful as an alternative to the mediocre search engines built into many sites.