Jan
28
2008
Because there is Google. Really?
When you do a random search, Yahoo does perform not that bad as many have perceived. However, Yahoo's brand name itself does not reflect search to most of the audiences. Why do people search on Yahoo? Because most of them are Yahoo mail, news, finance users; and they use yahoo search for convenience.
How many people know http://search.yahoo.com? It is sad because Yahoo is using Civic to compete with Toyota. How can a sub-brand win such a critical battle?
Yahoo came to this awkward situation because the new internet population changed their way to use the Web while Yahoo remains unchanged for too long. When I say “change” I mean revolutionary or at least dramatic change!
Yahoo missed the deals to acquire Ebay and Google.
Yahoo bought Geocity, which was sort of web 1.5, but never turned to be web 2.0.
Now, what Yahoo needs is vision and execution to start up something different to change the whole game, or it will slowly slide into deeper trap.
Jan
23
2008
Search is so nature when one is facing oceans of information.
Google dominates internet search for the following reasons
- A well-recognized internet search service featuring
- Good coverage on most of the web contents
- Good user intention understanding and high query-results relevance
- Freshness, discovery of new contents and fast indexing speed
- High availability, and fast serving speed
- Relatively stable and improving presentation of the results
- Keeping close track of search users' behaviors to deliver relevancy online Ads
- Share profit with other publishers/sites
In this document, I am going to present another approach to deliver good search experience.
There are too many different aspects in search. That is why it is so hard to start up a real functioning search service on the internet or even for a large corporation site.
As we all know, a search service include the following components:
- a crawler and document repository
- some content analyzer, to extract linkage information and other meta data from the crawled documents
- an indexer to build keyword index for the documents
- the search engine clusters to serve the indexed data
- modern search engines always have a proxy layer to do load-balancing, caching and aggregation
- A ranking module to order the matching results
- frontend usually instruments some tracking code to monitor end user behaviors, which potentially will feedback to content system
As of now, the major challenge of build such a service lies in the following aspects:
- scalability
- meta information extraction in nonstructural data
- user intention understanding, ranking
- spam filtering
- content updating
I will try to discuss these issues in this series :-)
References:
1. http://en.wikipedia.org/wiki/Index_(search_engine)
Jan
23
2008
You have a file with the following format:
type-1 2008/01/12 20:04:40 xxx yyy
type-2 2008/01/11 10:05:20 aaa bbb
type-2 2008/01/01 11:15:10 ccc ddd
You want to count for each type, how many rows have time stamp > 2008/01/10
BEGIN{
ta1=0;
ta2=0;
}
{
split($2, a, " ");
split(a[1], ymd, "/");
split(a[2], hms, ":");
t0 = mktime("2008 01 10 00 00 00");
t = mktime(sprintf("%s %s %s %s %s %s", ymd[1], ymd[2], ymd[3], hms[1], hms[2], hms[3]));
if( t > 0) {
if($1 == "type1") {
if(t > t0) {ta1 = ta1 +1 }
}
if($1 == "type2") {
if(t > t0) {ta2 = ta2 +1 }
}
}
END {
print "type1", ta1, "type2", ta1;
}
Jan
03
2008
The Following code will get sum of column 1 :-)
cat filename | awk -F "\\t"
'BEGIN { t1=0; t2=0 } { if($1<10) t2+=$1; t1+=$1 } END { print t1, t2, t2/t1 }'
Jan
03
2008
To use so you need LWP::Protocol::http::SocksChain.
There is a number of such protocol packages, but usually, installing any one of them requires a number of other prerequisites:
Here is a process I have tested, if you do have super user privilege, you can ignore where "/path/to/install" is mentioned
0. $ sudo -s
0.1 setenv PERL5LIB /path/to/install/lib/site_perl
1. INSTALL OPENSSL
1.1 http://www.openssl.org/source/,
download http://www.openssl.org/source/openssl-0.9.8g.tar.gz
1.2 Unzip openssl-0.9.8g.tar.gz
1.3 Read INSTALL
1.4 $ ./config
1.5 $ make
1.6 $ make test
1.6 $ make install
2. INSTALL Net::SSLeay
2.1 Download it from http://search.cpan.org/~flora/Net-SSLeay-1.32/,
download http://search.cpan.org/CPAN/authors/id/F/FL/FLORA/Net-SSLeay-1.32.tar.gz
2.2 Unzip Net-SSLeay-1.32.tar.gz
2.3 $ perl Makefile.PL PREFIX=/path/to/install
2.4 $ make install # the Makefile requires -lz, on certain system, you can remove it
2.5 $ cd examples
2.6 $ get_page.pl www.cryptsoft.com 443 /
3. INSTALL IO::Socket::SSL
3.1 Download it from http://search.cpan.org/dist/IO-Socket-SSL/,
download http://search.cpan.org/CPAN/authors/id/S/SU/SULLR/IO-Socket-SSL-1.12.tar.gz
3.2 Unzip IO-Socket-SSL-1.12.tar.gz
3.3 $ perl Makefile.PL PREFIX=/path/to/install
3.4 $ make
3.5 $ make test
3.6 $ make install
4. INSTALL Net::SC
4.1 Download it from http://search.cpan.org/~gosha/Net-SC-1.20/,
download http://search.cpan.org/CPAN/authors/id/G/GO/GOSHA/Net-SC-1.20.tar.gz
4.2 Unzip Net-SC-1.20.tar.gz
4.3 $ perl Makefile.PL PREFIX=/path/to/install
4.4 $ make
4.5 $ make test
4.6 $ make install
5. INSTALL
5.1 Download it from http://search.cpan.org/~gosha/LWP-Protocol-http-SocksChain-1.4/,
download http://search.cpan.org/CPAN/authors/id/G/GO/GOSHA/LWP-Protocol-http-SocksChain-1.4.tar.gz
5.2 Unzip LWP-Protocol-http-SocksChain-1.4.tar.gz
5.3 $ perl Makefile.PL PREFIX=/path/to/install
5.4 $ make
5.5 $ make test
5.6 $ make install
Now you can use LWPGet
#!/usr/bin/perl -w
require 5.8.0;
use strict;
use lib "/path/to/install/lib/site_perl/5.8.7";
use LWP::Simple; use LWP::UserAgent;
use LWP::Protocol::http::SocksChain;
LWP::Protocol::implementor( http => 'LWP::Protocol::http::SocksChain' );
@LWP::Protocol::http::SocksChain::EXTRA_SOCK_OPTS = (
Chain_Len => 1,
Debug => 0,
Random_Chain => 1,
Chain_File_Data => [
'ip_of_socks_proxy:port:::5',
],
Auto_Save => 0,
Restore_Type => 0 );
my $ua = new LWP::UserAgent $ua->agent('Mozilla/4.0 (compatible; MSIE 6.0; Windws NT 5.1)');
my $Url = $ARGV[0];
my $response = $ua->get($Url);
if($response->is_success) {
my $page = $response->content;
print $page, "\\n";
} else {
print STDERR "Fail to get $Url\\n";
}