The Best Shopping On The Planet |
About Us Search EarthlingsUnited Customer Service |
EarthlingsUnited welcomes you to web site development tutorial. This is the most comprehensive e-commerce site development tutorial you will find anywhere at any price. This one is free. |
Home
Day 1: Web Development Tools |
Class Goals
|
What we pleasantly refer to as search engines fall into two distinct categories: Directories and Search indexes. Directories use humans to evaluate a web site and place in a particular directory. Search indexes are automated and rely on their internal web spiders / robots / crawlers to access your pages and place them in a database. Waiting for these web spiders to arrive without any action on your part will lead to a web site with no traffic. You must take action to list your site on a search engine.
All search engines have the ability to manually submit your URL. You can usually find it by looking for links with key words such as "submit site", "add url", "add your site" or some derivative. Its usually on the home page or in the help area. Once you submit a URL, the search engine spiders go to work.
Submit all your pages not just your home page. Some search engines limit you to 25 submissions per day or they will conclude you are spamming and bar you from the directory.
If you haven't done so yet, download and read 1998 CassBeth E-Commerce Findings .
CassBeth
ElizAndra
FredsSpot
OuterSpaceShopper
GiftsToHumanity
MartianShopper
EarthlingsUnited
(not released yet)
Probably the best tutorial on the Internet on building a clean web site and promoting it is from a young gentleman in England. This is required reading if you hope to make a go of a web site. Its fast and filled with important information.
DeadLock Search
Engines
DeadLock
Site Content
DeadLock The Art of Site Promotion
IMHO the bottom line for a clean web site is based on many principals that were established by a break through in technical publications about 40 years ago. President Kennedy concerned about the vast amount of paper that was being generated by and for the government and industry issued a challenge to develop a new method of presenting information. Hughes Aircraft in Fullerton California came up with a revolutionary approach called STOP or Sequential Thematic Organization Of Publications . The publications folks at Hughes merged concepts that were developed with the making of the movie "Gone With The Wind" and the engineering discipline within the company. They specifically established a process that included storyboards and thematic organization, which were coupled with the realities of attempting to communicate highly complex technical solutions to the lay person. This process allowed an organization with hundreds of technical specialists from many companies across the country to pull together and produce proposals and studies that today represent systems which form our national infrastructure. Some very fundamental principals of STOP are:
STOP was way ahead of its time and used by certain companies attempting to communicate some of the most complex government systems ever developed. It was used mostly by very large system engineering companies tasked with creating things that no normal entities could visualize let alone describe to the average person. These products needed to be digested by many folks such as congressional staff, traditional media, the people of the USA, and major decision makers across the country.
I think STOP is very appropriate to the web and the Internet presentation of information. The additional elements beyond STOP should include:
Although there are thousands of search engines and directories on the Internet only a handful matter. This set of engines has changed in the past 3 years and will probably continue to change.
USA Search Engines | USA URL Submission | Int'l Search Engines | Int'l URL Submission |
altavista aol excite hotbot infoseek
lycos
galaxy |
altavista aol excite hotbot infoseek
lycos
goto |
infomak intersearch (.de) p intersearch (.au) p altavista (.au)
webwombat (must be .au .nz) |
infomak intersearch de intersearch au altavista yellowpages |
These sites are an excellent source of information on search engines (links, mergers, approaches, tips, etc):
SearchEngineWatch.com
Spider Food
VerySimple.com
This is a great PERL script that lets you submit to multiple search engines
at once.
http://www.jafsoft.com/misc/guides.html
There was trouble in search engine land a few years ago. It is getting worse not better as we enter 2001.
A few years ago the big issue revolved around free trade. For example, all Australian engines rejected submissions from everyone except Australians (.au). European engines rejected submissions from the USA. Everyone would reject submissions from free communities such as geocities. AOL would not spider you unless you were an AOL customer. If you had traffic, they would eventually spider you in 9-12 months. Yahoo's search engine behaves like a random number generator so your submissions are almost useless. Additionally, it is almost impossible to get listed in a Yahoo directory or get spidered by Yahoo unless you pay cash or are part of a group that pays cash.
However, there were shining light a few years ago. Infoseek and Altavista were very clean operations and generated a large amount of traffic for many sites. They were followed by others such as Excite, Lycos, Northern Light, etc.
The current issue in the USA is $$$ driven where the search engines are trying to get fees for listing your URL. We think this will destroy the Internet as we know it today. This year, 2001 will be a very important year in this area. Our experiment in the 4 quarter of 2000 showed that only 2 engines were "clean" and made the submissions available for search: Altavista and Google followed shortly by Lycos. Altavista made links available in 2 weeks, Google made links available in 3-4 weeks, and Lycos made links available 10-12 weeks later. The greatest traffic come from Google.
In the past Infoseek was our cleanest engine. They made links instantly available for searching, once submitted. They were also our #1 engine that provided visitors. Now that they are part of Disney we have no traffic from go.com, even though our old site is still in their database. The following is a list of search engines and directories.
Meta tags are used by many search engines to properly spider your site. It is critical that you use meta tags. The following is an example use of meta tags:
<HTML> <HEAD> <META NAME="description" CONTENT="We welcome you to our shop featuring aps cameras, batteries, blank media, calculators, camcorders, cameras 35mm, cameras, cassette recorders, cassette walkman, cd players, cd recorders, cd writers, cell phone, computer memory, computer monitors, computer projectors, computer speakers, copiers, cpu upgrades, digital cameras, disk drives, dvd players, electronic labeling, fax machines, flat panel monitors, game boy, gps, graphics cards, grundig radios, guide camcorders, guide digital cameras, guide monitors, guide pdas, guide speakers, handheld pc, hint books, home audio, home office phones, home theater, indexa, ink, inkjet cartridges, inkjet printers, input devices, laptop accessories, laser printers, mac, memory, micro cassettes, minidisc, modems, mouse pads, mp3 players, multifunction devices, multimedia, networking, nintendo 64, office phones, paper, pda organizers, pdas, phillips magnavox, portable stereos, projectors, radar detectors, radio scanners, rca, registered memory, routers hubs, scanners, sega, shredders, sony playstation, sony, speakers, subwoofers, toner drums, top, translators dictionaries, tvs, usb, vcrs, and many other great items."> <META NAME="keywords" CONTENT="aps-cameras, batteries, blank-media, calculators, camcorders, cameras-35mm, cameras, cassette-recorders, cassette-walkman, cd-players, cd-recorders, cd-writers, cell-phone, computer-memory, computer-monitors, computer-projectors, computer-speakers, copiers, cpu-upgrades, digital-cameras, disk-drives, dvd-players, electronic-labeling, fax-machines, flat-panel-monitors, game-boy, gps, graphics-cards, grundig-radios, guide-camcorders, guide-digital-cameras, guide-monitors, guide-pdas, guide-speakers, handheld-pc, hint-books, home-audio, home-office-phones, home-theater, indexa, ink, inkjet-cartridges, inkjet-printers, input-devices, laptop-accessories, laser-printers, mac, memory, micro-cassettes, minidisc, modems, mouse-pads, mp3-players, multifunction-devices, multimedia, networking, nintendo-64, office-phones, paper, pda-organizers, pdas, phillips-magnavox, portable-stereos, projectors, radar-detectors, radio-scanners, rca, registered-memory, routers-hubs, scanners, sega, shredders, sony-playstation, sony, speakers, subwoofers, toner-drums, top, translators-dictionaries, tvs, usb, vcrs, "> <META NAME="revisit-after" CONTENT="20 days"> <TITLE>CassBeth Electronics</TITLE> </HEAD> <BODY BGCOLOR=FFFFFF LINK=009040 VLINK=FF0000>
Remember, its ok to mis-spell words. It may actually improve your results. Engines are case insensitive and "camcorder is a match to camcorders" in regular expression land, so use plurals. Most search engines have help links that describe how they process meta tags. Visit and read those links.
SearchEngineWatch
- How To Use HTML Meta Tags
SearchEngineWatch
- Meta Tag Law Suits
Robots fall into 2 categories: the good guys and the bad guys. The good guys travel the Internet looking for content that is then made available to search engines. Bad robots look for things. These things include e-mail addresses, personal information, and stuff like potential copyright and trademark infringements.
http://www.jafsoft.com/misc/opinion/webbots.html
Good robots obey the rules and access your robots.txt file before they enter your site. Although most search engines say that they will eventually tree down your site and find all your pages, our experience shows that is not the case. Bad spiders ignore your robots.txt file. So the robots.txt file is almost a mote point, but its a good practice to have one in any case. It goes in your home directory.
Example Spiders with IdigoPerl on your PC
spider (your local PC,
once set up)
spider-here (your local
PC, once set up)
Robots.txt File
# /robots.txt file for http://www.cassbeth.com/ # mail webmaster@cassbeth.com for constructive criticism User-agent: * Disallow: /cgi-bin/ Disallow: /surveys/ Disallow: /temp/ Disallow: /weblog/ Disallow: /logs/ Disallow: /postcard/card.cgi Disallow: /postcard/cards Disallow: /postcard/pictures Disallow: /ssi/ Disallow: /ads/ Disallow: /buildpages/ Disallow: /idra/ User-agent: emailsiphon Disallow: /
SPAMMING Search Engines
Never try to spam a search engine in hopes of getting a better listing. They will detect and bar you from their database using automated detection techniques. Spamming examples include:
repeating the same word many times in a background color or within a comment
repeating a word more than 3 times in the meta tags
submitting multiple pages with the same content
submitting too many URLs in a 24 hour period
These spiders run on your local PC IndigoPerl Web server.
Spider
#!perl print "Content-type: text/html\n\n"; print "<html><head><title>Spider</title></head>\n"; print "<body><center><h1>Spider</h1></center>\n"; &parse_form; &control_menu; &spider; print "</body>\n</html>\n"; #################### sub parse_form { if ($ENV{'REQUEST_METHOD'} eq 'GET') { # Split the name-value pairs @pairs = split(/&/, $ENV{'QUERY_STRING'}); } elsif ($ENV{'REQUEST_METHOD'} eq 'POST') { # Get the input read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'}); # Split the name-value pairs @pairs = split(/&/, $buffer); } else { &error('request_method'); } foreach $pair (@pairs){ ($name,$value) = split(/=/,$pair); $name =~ tr/+/ /; $name =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg; $value =~ tr/+/ /; $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg; # If they try to include server side includes, erase them, so they # arent a security risk if the html gets returned. Another # security hole plugged up. $value =~ s/<!--(.|\n)*-->//g; $FORM{$name} = $value; } $urls = $FORM{'urls'}; $files = $FORM{'files'}; @urls = split(/\s+/,$urls); $urls = ""; foreach $url (@urls) {$urls .= "$url\n";} @files = split(/\s+/,$files); $files = ""; foreach $file (@files) {$files .= "$file\n";} #print "<PRE>$urls</PRE><P>"; #print "<PRE>$files</PRE><P>"; } #################### sub control_menu{ print "<FORM ACTION='spider.cgi' METHOD='GET'>\n"; print "<PRE> URLs Files <TEXTAREA NAME='urls' WRAP=VIRTUAL ROWS='20' COLS='80'>$urls</TEXTAREA> <TEXTAREA NAME='files' WRAP=VIRTUAL ROWS='20' COLS='10'>$files</TEXTAREA> <PRE>\n"; print "<CENTER><INPUT TYPE=submit VALUE='submit'> <INPUT TYPE='reset' VALUE='clear'></CENTER></FORM>\n"; } #################### sub spider { use LWP::UserAgent; $ua = new LWP::UserAgent; $ua->agent("$0/0.1 " . $ua->agent); # $ua->agent("Mozilla/8.0") # pretend we are very capable browser $ai1 = $#urls; $ai2 = $#files; print "$ai1 - $ai2<BR>"; for ($i = 0; $i <= $ai1; ++$i) { print "url $urls[$i] "; $req = new HTTP::Request 'GET' => $urls[$i]; $req->header('Accept' => 'text/html'); # send request $res = $ua->request($req); open(DATA, ">$files[$i].html"); # check the outcome if ($res->is_success) { print DATA $res->content; print "$files[$i] saved<BR>"; } else { print "Error: " . $res->status_line . "\n"; } close(DATA); } }
#!perl print "Content-type: text/html\n\n"; print "<html><head><title>Spider</title></head>\n"; print "<body><center><h1>Spider</h1></center>\n"; &parse_form; &control_menu; &spider; print "</body>\n</html>\n"; #################### sub parse_form { if ($ENV{'REQUEST_METHOD'} eq 'GET') { # Split the name-value pairs @pairs = split(/&/, $ENV{'QUERY_STRING'}); } elsif ($ENV{'REQUEST_METHOD'} eq 'POST') { # Get the input read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'}); # Split the name-value pairs @pairs = split(/&/, $buffer); } else { &error('request_method'); } foreach $pair (@pairs){ ($name,$value) = split(/=/,$pair); $name =~ tr/+/ /; $name =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg; $value =~ tr/+/ /; $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg; # If they try to include server side includes, erase them, so they # arent a security risk if the html gets returned. Another # security hole plugged up. $value =~ s/<!--(.|\n)*-->//g; $FORM{$name} = $value; } $urls = $FORM{'urls'}; $files = $FORM{'files'}; @urls = split(/\s+/,$urls); $urls = ""; foreach $url (@urls) {$urls .= "$url\n";} @files = split(/\s+/,$files); $files = ""; foreach $file (@files) {$files .= "$file\n";} #print "<PRE>$urls</PRE><P>"; #print "<PRE>$files</PRE><P>"; } #################### sub control_menu{ print "<FORM ACTION='spider-here.cgi' METHOD='POST'>\n"; print "<PRE> URLs Files <TEXTAREA NAME='urls' WRAP=VIRTUAL ROWS='20' COLS='80'>$urls</TEXTAREA> <TEXTAREA NAME='files' WRAP=VIRTUAL ROWS='20' COLS='10'>$files</TEXTAREA> <PRE>\n"; print "<CENTER><INPUT TYPE=submit VALUE='submit'> <INPUT TYPE='reset' VALUE='clear'></CENTER></FORM>\n"; } #################### sub spider { use LWP::UserAgent; $ua = new LWP::UserAgent; $ua->agent("0.1 " . $ua->agent); #$ua->agent("$0/0.1 " . $ua->agent); # $ua->agent("Mozilla/8.0") # pretend we are very capable browser $ai1 = $#urls; $ai2 = $#files; print "$ai1 - $ai2<BR>"; for ($i = 0; $i <= $ai1; ++$i) { print "url $urls[$i] "; $req = new HTTP::Request 'GET' => $urls[$i]; $req->header('Accept' => 'text/html'); # send request $res = $ua->request($req); #open(DATA, ">$files[$i].html"); # check the outcome if ($res->is_success) { $page = $res->content; # cool mod that brings in the images $page =~ s/SRC=\"([^http:].*\")/SRC=\"$urls[$i]\/$1>/isg; print $page; } else { print "Error: " . $res->status_line . "\n"; } #close(DATA); } }
Welcome to another year of cyber shopping at earthlingsunited. We organize and humanize access to shopping on the entire web, like no other portal on the web. Please don't forget to tell your friends about our special place. Thanks and enjoy your visit.
3.5 5/18/2001 12:45:33 0 16125 0
EarthlingsUnited .
About Us . Search
EarthlingsUnited . Customer Service .
Privacy
Copyright © 2000 All Rights Reserved