Zyra's website //// Guides to Good Practice //// Search Engines //// Site Index

Search Engines - Data RecoveryGood Search Engine Design

How to get things right when building a Search Engine

It has been observed that search engines often make the same mistakes as each-other, and could do much better.

In the early days, Google was the best, but more recently Google has fallen out of favour because the quality of the search results has become so bad. Finally Google search quality had fallen so badly that it had to be written about! People are calling for Google to be Replaced! In trying to find alternatives to Google, various features of search engines have been noticed which are such that some search engines have some new original ideas, but many also seem to be copycats of Google's mistakes. It's bizarre, as anyone with a grain of sense should be able to see some of these things were daft all along.

One of the most progressive places seems to be Blekko, which still makes some mistakes, but at least makes some attempts to avoid being Google and to avoid making Google's mistakes. Also, Duck Duck Go is a search engine which might as well be "Don't Do Google", and has an astonishing "we're really not Google" style privacy policy! Good searches, but the opposite privacy policy to Google whose policy seems to be "Sod all of you little people and let's mimic Facebook".

As a general situation about search engines and reviewing their performance, here are a few Good Practise guidelines to how to run a search engine properly. These are things which you should try to get right when building a new search engine:

* Relevant results: It's important to list search results that are relevant to the search. This is in contrast to listing irrelevant rubbish. It's a surprisingly common mistake.

* Avoid webspam: It's a known fact that a small minority of websites are cheats. Some of them do very bad things. These can be reduced or even eliminated from search results, and importantly without mucking-up well tried and trusted websites which have been getting things right for years!

* If some website is the definitive place for some particular search item, it's important it appears Number1. Case in point: Zyra is www.zyra.org.uk which should appear first on a search for "zyra". If it doesn't, the search engine is seriously at fault. The site has been around since 2000 and has over 9000 pages. Note: When Google became so rubbishy that it started failing to get this right, I felt the need to create this page!

* Being honest with people's page names. Some search engines don't respect webmasters' naming of pages. This is a problem, but it's easy to get right. The webmaster's page title is usually between <title> and </title> and there should be only one title per page. That title is the title of the page. There is no need to have Google Page-Name Mangling. Other search engines please note: Don't copy Google's mistakes. Bing at one time used to mangle page titles, but they have since cleaned up their act. Well done to Bing!

* No Auto-suggest. When someone is typing in a search query in the box, it's no fun if the search engine tries to preempt what their typing by putting up misleading and distracting guesses based on what the hoi-poloi have put in previously. (You may notice some of us have to avert our eyes from the screen while putting in a search to avoid being distracted by the misleading stuff coming up. It should not be necessary to do this. Search engines could easily avoid having auto-suggest, or at least make it optional!).

Note: Google, to their credit, have at least allowed people to opt-out of the ridiculous distracting auto-suggest. To get this, instead of putting www.google.com , put www.google.com/webhp?complete=0&hl=en

* No Autocomplete. Try to avoid being tempted to do dumbed-down searching by making inane assumptions about people's search queries. If someone searches for something, that's what they should get results for. It is no good making naff assumptions that they mean something different. That kind of nanny-state style behaviour produces a search engine for idiots.

There is a proper autocomplete in the Linux command-line. But this is opt-in. If the person types half of a command they can opt to have the Linux machine complete by pressing TAB and then the machine will try to fill in the rest of the command/filename. (If the user doesn't press TAB, then no auto-complete).

* No Auto-correct. Auto-correct is auto-dimwit-mode. Try to avoid this.

If you feel sorry for people who can't spell because they are undereducated, OK, make suggestions for corrections, but if you suggest something like "Don't you mean <whatever>?", the person should be able to say "Yes, you're right! Thanks!" or "No, don't be silly. I meant what I put!".

* No regional assumptions. The Internet is global. It's no good pandering to China for example by having a China-specific search system based on the tyrannical government's views on censorship. Such things are very bad and could lead to governments routinely censoring the Internet. It is extraordinary bad manners for the primary website of a search engine to pry into what IP address the customer is using and then guess where they are situated and then redirect to some local discriminatory version of the search. It's a dangerous precedent and should be avoided. In earlier times, Altavista was Criticised and later reformed. They made regional assumptions, which I said at the time were a very bad idea. Google also makes this mistake, but you can put in local variants for areas different to those which you are in. For example, www.google.co.cr is Google Search in Costa Rica, and has Costa Rica specific preferences. To get Google no-regional-discrimination results, put /ncr on the end of the URL. So, for a Google non-regional search, do www.google.com/ncr , and for a Google search with no auto-suggest and no regional discriminatory policies it's www.google.com/webhp?complete=0&hl=en

In any of these searches, the default should be GLOBAL. It is The Internet, or to put it another way, The International Net!

* Aim high. It's worth considering that although the Internet is for everyone, the clever people are the leading edge. If you have searching that is optimised for the clever people to be able to do lots of clever things in searches, it means other people will aspire to their ability.

In contrast, if you aim low, (having a primary customer-base of low intelligence people), it will appeal only slightly better to unintelligent people, but it will be a disaster for the clever people. Search engines should be precision instruments, and if you blunt the options, the search engine becomes increasingly irrelevant.

Make a search engine geek-friendly, and you have a great many clever people recommending you to other people. Search experts should be able to put in some complex parameters and get the refined results they are looking for. If it doesn't work first time, they'll try various different things until they get it right.

* Search results should be based on Content, not popularity! Popularity is a secondary thing.

* Advanced Search. The period around 2005-2006 was the best period, when Google had a proper "Advanced Search". In those days, Google was still a decent search engine. Advanced Search, on whichever search engine, should allow multiple wildcards, search phrases, negative search options, and optional 100 or more results. Search Within A Search is also a good feature to include.

Such advanced searching is an absolute requirement for finding some things. For example, the Game of GO is typically difficult to find by bog-standard search because the name is GO which is the same combination of letters as a common word, and other words in the game are such things as "stones" and "atari". In a sequence of advanced searches, someone can find the Game of Go by successively eliminating commonplace inappropriate result sets. For example, the final successful search would be the equivalent of 'search for "go" and "atari" but without any reference to any of the known models of Atari computers (specified), but with the words "game", "stones", "black", and "white"'.

Some of the early period Google search options from when Google was still any good were such things as "[term] site:www.etc" which meant "show me all the results for [term] on the website www.etc". Also, "link:[site]" which would show a variety of pages with had links on them linking to [site].

* Fair treatment of large eclectic websites. This is something Google has fallen foul of in 2012. Admittedly some websites are about a specific subject, but there are also some websites about all kinds of different things in the same site. Wikipedia is a good example of this, but there are various other sites which have quality write-ups about diverse subjects. It is no good assuming that all websites must be about specific narrow fields, as many quality websites are about wide varieties of different things. If the equivalent prejudice were applied to TV channels, Google would list The Horror Channel and The Gardening Channel and The Jewellery Channel and The Fishing Channel, but would probably ban the BBC because they show programmes about a wide range of subjects. This would of course be ridiculous, and it's obvious in the analogy of TV channels, so let's see fair play on websites as well!

Often a web PAGE is about a specific thing, for example The Problems to do with Gas Appliances during an Electricity Power Cut! But the website may have loads and loads of pages about all kinds of things many of which are about entirely different things.

* Including everyone. When building a search engine, the obvious thing to do is to get a copy of the entire Internet by having a bot follow all links from all known websites and put a local copy on the hard disc drive array/cluster. This is obvious, but oddly there are some search engines that don't do this, and so they end up with a severely restricted set of pages.

* Allowing people to "add a site". When someone sets up their own website (see get your own website), they would like it to be included in your search. Therefore it seems a good idea to allow people to "enter your website address here" or "submit your site" and then that's another website to explore and evaluate.

It's quite natural that if a webmaster discovers your search engine doesn't include their website, the next thing they are going to do is to look through your search engine website starting at the front page and try to find an easy way to "include me". You don't need their personal details and it's often rude to ask. Just ask for a web address. If it's bogus, you can soon discover that fact and not have to do much.

* Allowing people to "contact us". It may seem a detail, but systems that can't be contacted are a problem. A company with no complaints procedure is a company which can expect terrorism. The attackers didn't start off being terrorists; they became terrorists when they found they had no other way of putting in complaints! Also see How to Complain

Examples: Most search engines that are rivals to Google have "contact us" e-mail links. Some only have contact forms, but they still have methods of people being able to write to them. Also, I have found most will reply. Google (2012) seems to have no way of contacting them or putting in a complaint. That is very bad. If you know any answers, please write in. A few years earlier, Yahoo got a terrible review here because of it being impossible to contact them to complain about a serious abuse issue. However, when Yahoo opened a London office and got a phone number, the complaint was diplomatically resolved. As a result, there's a much better Yahoo Review here. Well Done to Yahoo!

* Avoid Facebook! Note: The privacy-phishing anti-social networking site Facebook is no friend of freedom. It's important to rid the Internet of it. The correct approach of a search engine is to ban Facebook completely, or at the very least have a "boycott Facebook" tickbox option. Facebook can be considered extremely detrimental to the Internet, so you're setting a good example by banishing it.

* Avoid search engine cheats. There are practical ways of defeating bad practices of SEO "search engine obfuscation" while at the same time allowing reasonable SEO "search engine optimisation". The important thing is being open and up-front about it. A long time ago, when people invented Law, they realised quite quickly it was severely open to abuse, so they introduced the principle of fair trial and convictions being "beyond reasonable doubt". In contrast, there are some search engines that have arbitrary justice tried in internal kangaroo courts and punishments doled out without the accused having any chance to prove their innocence or even get to hear that they have been convicted. As well as being a problem that It's Not Fair, there's also the problem that the mere presence of such a "suss-law" fake justice system produces a climate of suspicion and fear, like in the time of the Witchfinder. The induced negativity can result in animosity from people.

* Having a decent privacy policy. Being fair to people rather than becoming the new Spanish Inquisition Stasi hellbent on spying on your every move.

* Webmaster Guidelines: It's important that sets of website design guidelines are not false. I have noticed the 2012 Google webmaster guidelines are quite at-odds with the behaviour of Google Search. It's hypocrisy. They say one thing about their policy, but behave quite differently.

* Long term versus Short term: For news, up-to-the-minute material should have priority. For everything else, the longer-lasting a web page is, the more likely it is to be good. This is again something which Google has got wrong in 2012. It would make more sense to prioritise websites with proper deep-linking policy and with "evergreen" status. There are things I've explained years ago, and they are still true. Where I've made mistakes I have corrected them and the pages have become refined versions. For example, pi is still what it was. Geostationary orbit is still where it was, and apostrophes, they're still the way they were. Having a "freshness layer" is absurd for many things, the exception being news. Then again, if you start ignoring history, history may come back to bite you.

* Avoid government interference. A new search engine should be set up in a country that doesn't have an overbearing government that's at war with just about everyone. As surely as a person can choose which country to live in, a new search engine company can choose to set up somewhere there's freedom from various political political problems. It's also good to be in a tax haven, but a search engine also needs a good Internet connection. One of the best ways to solve some of these issues is to have distributed systems, and have the data acquisition going on in an ideal location for that, the data processing going on somewhere else, and the websites for people to visit to do searches being in a distributed network of local centres. Crucially, the HQ needs to be in an independent regime, one of the FREE places in the world, ie not Germany just before the Second World War, not the USA during the stupid "war on terrorism and everything else" bad patch, not Communist China, etc.

* Avoid being bought-out. This is an important consideration. People build up trust in the company, but if the company is put on the stockmarket, it sells-out that trust. It is possible to preserve trustworthiness and gain higher levels of trust by making the company impossible to buy out. This may seem silly, making a company unsaleable, but the best profit is from the ongoing making of money, not from being able to cash the company in for scrap value. Some things can't be bought out. Notably, Linux can't be bought out by Microsoft or any other software monopolistic campaign. This makes Linux remain trusted by the expanding following of folks.

With a search engine it's especially important to avoid being bought out. The customer-base invests a vast amount of their own resources in the place, and it's a breach of trust selling them out.

So, if at all possible, it's worth setting up a creative-commons open-source style arrangement so people can be absolutely sure the company can't be taken over or bought out.

* Avoid being floated on the Stock Exchange. Besides the fact that many stockmarket flotations are a disaster, there's also the problem that a company becoming a public limited company can have their stock sold at any time and therefore subject to market fluctuations. In the case of Internet companies this is especially volatile. Also, although private companies can be run with some sympathy and ethics, corporate companies tend to behave like psychopaths. It's something to do with the change of priorities in the management system. Somehow the money becomes more important than the principles which have made the company what it is to start with.

Initial stockmarket flotations may or may not be a success, but it's what happens after that. Note: Iceland Frozen Food went through its "bad patch" during the years it was a public corporation on the stockmarket. However Iceland recovered after it bought back its freedom and became a private company again!

If I had shares in Google (July 2012) I would be dumping them!

Incidentally, you can TEST how good a search engine is. There are some pages which are the only page about that thing, and therefore that's what should come up top on a search for that. For example, my page about Economy7 Gas is, as far as I know, the only page which explains why there's no such thing as Economy7 Gas! So, if a search engine is any good, on a search for "Economy7 Gas", that's what should come up, not miscellaneous artificially hyped pages trying to sell you electricity or gas.

Picture of some large-scale computers - reused from Kroll Ontrack the data recovery company.