EarthlingsUnited Web Development Day4

The Best Shopping On The Planet
EarthlingsUnited Web Development Day4


About Us
Search EarthlingsUnited
Customer Service

Privatization is the shifting of government functions into the private sector. Privatization is also the transfer of public assets into private corporations. This started in 1980.

Today there is evidence suggesting that privatization has led to systemic government shutdowns and other negative unintended consequences including serious damage to the industrial base, colleges, universities, air transportation, the FAA, and the citizens of the United States.

More Information and Buy

EarthlingsUnited welcomes you to web site development tutorial. This is the most comprehensive e-commerce site development tutorial you will find anywhere at any price. This one is free.


Search Engines and Traffic

Home

Day 1: Web Development Tools
Day 2: Publishing Your Web Site
Day 3: Finding and Building PERL Scripts
Day 4: Search Engines and Traffic
Day 5: Purpose of Web Site and Server Logs
Day 6: Finding a Web Hosting Service
Day 7: Associate Programs / Micro Transaction
Day 8: Demos and Applications

Class Goals
  1. Building a clean web site (reference to deadlock.com tutorials)
  2. Submitting to search engines
  3. Using META tags
  4. SPAMING search engines and their response
  5. Search engine wars
  6. Robots

Introduction

What we pleasantly refer to as search engines fall into two distinct categories: Directories and Search indexes. Directories use humans to evaluate a web site and place in a particular directory. Search indexes are automated and rely on their internal web spiders / robots / crawlers to access your pages and place them in a database. Waiting for these web spiders to arrive without any action on your part will lead to a web site with no traffic. You must take action to list your site on a search engine.

All search engines have the ability to manually submit your URL. You can usually find it by looking for links with key words such as "submit site", "add url", "add your site" or some derivative. Its usually on the home page or in the help area. Once you submit a URL, the search engine spiders go to work.

Submit all your pages not just your home page. Some search engines limit you to 25 submissions per day or they will conclude you are spamming and bar you from the directory.

Building a clean Web Site

If you haven't done so yet, download and read 1998 CassBeth E-Commerce Findings .

CassBeth
ElizAndra
FredsSpot
OuterSpaceShopper
GiftsToHumanity
MartianShopper
EarthlingsUnited (not released yet)

Probably the best tutorial on the Internet on building a clean web site and promoting it is from a young gentleman in England. This is required reading if you hope to make a go of a web site. Its fast and filled with important information.

DeadLock Search Engines
DeadLock Site Content
DeadLock The Art of Site Promotion

IMHO the bottom line for a clean web site is based on many principals that were established by a break through in technical publications about 40 years ago. President Kennedy concerned about the vast amount of paper that was being generated by and for the government and industry issued a challenge to develop a new method of presenting information. Hughes Aircraft in Fullerton California came up with a revolutionary approach called STOP or Sequential Thematic Organization Of Publications . The publications folks at Hughes merged concepts that were developed with the making of the movie "Gone With The Wind" and the engineering discipline within the company. They specifically established a process that included storyboards and thematic organization, which were coupled with the realities of attempting to communicate highly complex technical solutions to the lay person. This process allowed an organization with hundreds of technical specialists from many companies across the country to pull together and produce proposals and studies that today represent systems which form our national infrastructure. Some very fundamental principals of STOP are:

STOP was way ahead of its time and used by certain companies attempting to communicate some of the most complex government systems ever developed. It was used mostly by very large system engineering companies tasked with creating things that no normal entities could visualize let alone describe to the average person. These products needed to be digested by many folks such as congressional staff, traditional media, the people of the USA, and major decision makers across the country.

I think STOP is very appropriate to the web and the Internet presentation of information. The additional elements beyond STOP should include:

Which Search Engines

Although there are thousands of search engines and directories on the Internet only a handful matter. This set of engines has changed in the past 3 years and will probably continue to change.

USA Search Engines USA URL Submission Int'l Search Engines Int'l URL Submission
altavista
aol
excite
hotbot
infoseek

lycos
northernlight
planetsearch
webcrawler

whatyouseek p

galaxy
goto
looksmart
magellan
netguide
planetsearch
snap!
webcrawler
yahoo

altavista
aol
excite
hotbot
infoseek

lycos
northernlight
planetsearch
webcrawler

goto
galaxy tradewave
infospace
snap
yahoo shopping
looksmart

800go

infomak
intersearch (.de) p
intersearch (.au) p
altavista (.au)

webwombat (must be .au .nz)
anzwers (.au .nz)
voila (.fr)
euroferret

infomak
intersearch de
intersearch au
altavista yellowpages

webwombat
anzwers
voila
euroferret

These sites are an excellent source of information on search engines (links, mergers, approaches, tips, etc):

SearchEngineWatch.com
Spider Food
VerySimple.com This is a great PERL script that lets you submit to multiple search engines at once.

http://www.jafsoft.com/misc/guides.html

Search Engine Wars

There was trouble in search engine land a few years ago. It is getting worse not better as we enter 2001.

A few years ago the big issue revolved around free trade. For example, all Australian engines rejected submissions from everyone except Australians (.au). European engines rejected submissions from the USA. Everyone would reject submissions from free communities such as geocities. AOL would not spider you unless you were an AOL customer. If you had traffic, they would eventually spider you in 9-12 months. Yahoo's search engine behaves like a random number generator so your submissions are almost useless. Additionally, it is almost impossible to get listed in a Yahoo directory or get spidered by Yahoo unless you pay cash or are part of a group that pays cash.

However, there were shining light a few years ago. Infoseek and Altavista were very clean operations and generated a large amount of traffic for many sites. They were followed by others such as Excite, Lycos, Northern Light, etc.

The current issue  in the USA is $$$ driven where the search engines are trying to get fees for listing your URL. We think this will destroy the Internet as we know it today. This year, 2001 will be a very important year in this area. Our experiment in the 4 quarter of 2000 showed that only 2 engines were "clean" and made the submissions available for search: Altavista and Google followed shortly by Lycos. Altavista made links available in 2 weeks, Google made links available in 3-4 weeks, and Lycos made links available 10-12 weeks later. The greatest traffic come from Google.

In the past Infoseek was our cleanest engine. They made links instantly available for searching, once submitted. They were also our #1 engine that provided visitors. Now that they are part of Disney we have no traffic from go.com, even though our old site is still in their database. The following is a list of search engines and directories.

Meta Tags

Meta tags are used by many search engines to properly spider your site. It is critical that you use meta tags. The following is an example use of meta tags:

<HTML>
<HEAD>
  <META NAME="description" CONTENT="We welcome you to our shop featuring aps cameras, batteries, blank media, calculators, camcorders, cameras 35mm, cameras, cassette recorders, cassette walkman, cd players, cd recorders, cd writers, cell phone, computer memory, computer monitors, computer projectors, computer speakers, copiers, cpu upgrades, digital cameras, disk drives, dvd players, electronic labeling, fax machines, flat panel monitors, game boy, gps, graphics cards, grundig radios, guide camcorders, guide digital cameras, guide monitors, guide pdas, guide speakers, handheld pc, hint books, home audio, home office phones, home theater, indexa, ink, inkjet cartridges, inkjet printers, input devices, laptop accessories, laser printers, mac, memory, micro cassettes, minidisc, modems, mouse pads, mp3 players, multifunction devices, multimedia, networking, nintendo 64, office phones, paper, pda organizers, pdas, phillips magnavox, portable stereos, projectors, radar detectors, radio scanners, rca, registered memory, routers hubs, scanners, sega, shredders, sony playstation, sony, speakers, subwoofers, toner drums, top, translators dictionaries, tvs, usb, vcrs, and many other great items.">
  <META NAME="keywords" CONTENT="aps-cameras, batteries, blank-media, calculators, camcorders, cameras-35mm, cameras, cassette-recorders, cassette-walkman, cd-players, cd-recorders, cd-writers, cell-phone, computer-memory, computer-monitors, computer-projectors, computer-speakers, copiers, cpu-upgrades, digital-cameras, disk-drives, dvd-players, electronic-labeling, fax-machines, flat-panel-monitors, game-boy, gps, graphics-cards, grundig-radios, guide-camcorders, guide-digital-cameras, guide-monitors, guide-pdas, guide-speakers, handheld-pc, hint-books, home-audio, home-office-phones, home-theater, indexa, ink, inkjet-cartridges, inkjet-printers, input-devices, laptop-accessories, laser-printers, mac, memory, micro-cassettes, minidisc, modems, mouse-pads, mp3-players, multifunction-devices, multimedia, networking, nintendo-64, office-phones, paper, pda-organizers, pdas, phillips-magnavox, portable-stereos, projectors, radar-detectors, radio-scanners, rca, registered-memory, routers-hubs, scanners, sega, shredders, sony-playstation, sony, speakers, subwoofers, toner-drums, top, translators-dictionaries, tvs, usb, vcrs, ">
  <META NAME="revisit-after" CONTENT="20 days">
  <TITLE>CassBeth Electronics</TITLE>
</HEAD>
<BODY BGCOLOR=FFFFFF LINK=009040 VLINK=FF0000>

Remember, its ok to mis-spell words. It may actually improve your results. Engines are case insensitive and "camcorder is a match to camcorders" in regular expression land, so use plurals. Most search engines have help links that describe how they process meta tags. Visit and read those links.

SearchEngineWatch -  How To Use HTML Meta Tags
SearchEngineWatch - Meta Tag Law Suits

Robots Spiders Crawlers

Robots fall into 2 categories: the good guys and the bad guys. The good guys travel the Internet looking for content that is then  made available to search engines. Bad robots look for things. These things include e-mail addresses, personal information, and stuff like potential copyright and trademark infringements.

http://www.jafsoft.com/misc/opinion/webbots.html

Good robots obey the rules and access your robots.txt file before they enter your site. Although most search engines say that they will eventually tree down your site and find all your pages, our experience shows that is not the case. Bad spiders ignore your robots.txt file. So the robots.txt file is almost a mote point, but its a good practice to have one in any case. It goes in your home directory.

Evil Spider

Example Spiders with IdigoPerl on your PC

spider (your local PC, once set up)
spider-here (your local PC, once set up)

Robots.txt File

# /robots.txt file for http://www.cassbeth.com/
# mail webmaster@cassbeth.com for constructive criticism

User-agent: *
Disallow: /cgi-bin/
Disallow: /surveys/
Disallow: /temp/
Disallow: /weblog/
Disallow: /logs/
Disallow: /postcard/card.cgi
Disallow: /postcard/cards
Disallow: /postcard/pictures
Disallow: /ssi/
Disallow: /ads/
Disallow: /buildpages/
Disallow: /idra/

User-agent: emailsiphon
Disallow: / 

SPAMMING Search Engines

Never try to spam a search engine in hopes of getting a better listing. They will detect and bar you from their database using automated detection techniques. Spamming examples include:


Your Very Own Spiders

These spiders run on your local PC IndigoPerl Web server.  


Spider

#!perl

print "Content-type: text/html\n\n";
print "<html><head><title>Spider</title></head>\n";
print "<body><center><h1>Spider</h1></center>\n";

&parse_form;

&control_menu;

&spider;

print "</body>\n</html>\n";


####################
sub parse_form {

 if ($ENV{'REQUEST_METHOD'} eq 'GET') {
      # Split the name-value pairs
      @pairs = split(/&/, $ENV{'QUERY_STRING'});
   }
   elsif ($ENV{'REQUEST_METHOD'} eq 'POST') {
      # Get the input
      read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
 
      # Split the name-value pairs
      @pairs = split(/&/, $buffer);
   }
   else {
      &error('request_method');
   }


 foreach $pair (@pairs){
        ($name,$value) = split(/=/,$pair);

      $name =~ tr/+/ /;
      $name =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;

      $value =~ tr/+/ /;
      $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;

      # If they try to include server side includes, erase them, so they
      # arent a security risk if the html gets returned.  Another 
      # security hole plugged up.

      $value =~ s/<!--(.|\n)*-->//g;

        $FORM{$name} = $value;
 }

$urls = $FORM{'urls'};
$files = $FORM{'files'};


@urls = split(/\s+/,$urls);
$urls = "";
foreach $url (@urls) {$urls .= "$url\n";}

@files = split(/\s+/,$files);
$files = "";
foreach $file (@files) {$files .= "$file\n";}

#print "<PRE>$urls</PRE><P>";
#print "<PRE>$files</PRE><P>";

}


####################
sub control_menu{

print "<FORM ACTION='spider.cgi' METHOD='GET'>\n";
print "<PRE>

URLs                                                                                Files
<TEXTAREA NAME='urls' WRAP=VIRTUAL ROWS='20' COLS='80'>$urls</TEXTAREA> <TEXTAREA NAME='files' WRAP=VIRTUAL ROWS='20' COLS='10'>$files</TEXTAREA>

<PRE>\n";
print "<CENTER><INPUT TYPE=submit VALUE='submit'>  <INPUT TYPE='reset' VALUE='clear'></CENTER></FORM>\n";

}


####################
sub spider {

use LWP::UserAgent;
$ua = new LWP::UserAgent;
$ua->agent("$0/0.1 " . $ua->agent);
# $ua->agent("Mozilla/8.0") # pretend we are very capable browser

$ai1 = $#urls;
$ai2 = $#files;
print "$ai1 - $ai2<BR>";

for ($i = 0; $i <= $ai1; ++$i) {

   print "url $urls[$i] ";
   $req = new HTTP::Request 'GET' => $urls[$i];
   $req->header('Accept' => 'text/html');

   # send request
   $res = $ua->request($req);


   open(DATA, ">$files[$i].html");
   # check the outcome
   if ($res->is_success) {
      print DATA $res->content;
      print "$files[$i] saved<BR>";
   } else {
      print "Error: " . $res->status_line . "\n";
   }
   close(DATA);

}

}


Spider Here
#!perl

print "Content-type: text/html\n\n";
print "<html><head><title>Spider</title></head>\n";
print "<body><center><h1>Spider</h1></center>\n";

&parse_form;

&control_menu;

&spider;

print "</body>\n</html>\n";


####################
sub parse_form {

 if ($ENV{'REQUEST_METHOD'} eq 'GET') {
      # Split the name-value pairs
      @pairs = split(/&/, $ENV{'QUERY_STRING'});
   }
   elsif ($ENV{'REQUEST_METHOD'} eq 'POST') {
      # Get the input
      read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
 
      # Split the name-value pairs
      @pairs = split(/&/, $buffer);
   }
   else {
      &error('request_method');
   }


 foreach $pair (@pairs){
        ($name,$value) = split(/=/,$pair);

      $name =~ tr/+/ /;
      $name =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;

      $value =~ tr/+/ /;
      $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;

      # If they try to include server side includes, erase them, so they
      # arent a security risk if the html gets returned.  Another 
      # security hole plugged up.

      $value =~ s/<!--(.|\n)*-->//g;

        $FORM{$name} = $value;
 }

$urls = $FORM{'urls'};
$files = $FORM{'files'};


@urls = split(/\s+/,$urls);
$urls = "";
foreach $url (@urls) {$urls .= "$url\n";}

@files = split(/\s+/,$files);
$files = "";
foreach $file (@files) {$files .= "$file\n";}

#print "<PRE>$urls</PRE><P>";
#print "<PRE>$files</PRE><P>";

}


####################
sub control_menu{

print "<FORM ACTION='spider-here.cgi' METHOD='POST'>\n";
print "<PRE>

URLs                                                                                Files
<TEXTAREA NAME='urls' WRAP=VIRTUAL ROWS='20' COLS='80'>$urls</TEXTAREA> <TEXTAREA NAME='files' WRAP=VIRTUAL ROWS='20' COLS='10'>$files</TEXTAREA>

<PRE>\n";
print "<CENTER><INPUT TYPE=submit VALUE='submit'>  <INPUT TYPE='reset' VALUE='clear'></CENTER></FORM>\n";

}


####################
sub spider {

use LWP::UserAgent;
$ua = new LWP::UserAgent;
$ua->agent("0.1 " . $ua->agent);
#$ua->agent("$0/0.1 " . $ua->agent);
# $ua->agent("Mozilla/8.0") # pretend we are very capable browser

$ai1 = $#urls;
$ai2 = $#files;
print "$ai1 - $ai2<BR>";

for ($i = 0; $i <= $ai1; ++$i) {

   print "url $urls[$i] ";
   $req = new HTTP::Request 'GET' => $urls[$i];
   $req->header('Accept' => 'text/html');

   # send request
   $res = $ua->request($req);

   #open(DATA, ">$files[$i].html");
   # check the outcome
   if ($res->is_success) {
      $page = $res->content;
      # cool mod that brings in the images
      $page =~ s/SRC=\"([^http:].*\")/SRC=\"$urls[$i]\/$1>/isg;
      print $page;
   } else {
      print "Error: " . $res->status_line . "\n";
   }
   #close(DATA);

}

}


Top of Page

Star Wars Episode I - The Phantom Menace Hear My Cry Sonique Creator Genesis - The Lamb Lies Down On Broadway You

Tools
   Drills Sanders Vacs ...
Electronics
   MP3 Players DVD Players Camcorders VCRs CD Players ...
Office Supplies
   Inkjet-Cartridges Mouse-Pads Phones Projectors Shredders Toner-Drums ...
Computers
   Printers Scanners Software Office Supplies ...

Books
  
University Text Books College SAT Prep Books ...
Gardening
   Barbeques Lighting Patio Furniture Pest Control Planters  Ponds Storage ...
Gourmet Cooking
    Blenders Juicers Espresso Knives Mixers Pots Pans Cookers Toasters ...

DVD & VHS Videos
   Hollywoods Greatest Movies Action Adventure ...
Music CDs
   Rock Oldies Musicals Karaoke ...
Toys
   Beanie Babies Furbys Games Electric Trains ...
Games
   Nintendo Sony Playstation Games Sega ...

Special Places
   TV Shows Dr Who Videos Alien and UFO Media Red Dwarf Videos Star Trek Videos

.

Welcome to another year of cyber shopping at earthlingsunited. We organize and humanize access to shopping on the entire web, like no other portal on the web. Please don't forget to tell your friends about our special place. Thanks and enjoy your visit.

3.5 5/18/2001 12:45:33 0 16125 0


EarthlingsUnited . About Us . Search EarthlingsUnited . Customer Service . Privacy
Copyright © 2000 All Rights Reserved