<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>answersy.com Blog &#187; Design Doc</title>
	<atom:link href="http://answersy.com/zchen/index.php/category/it-related/design-doc/feed/" rel="self" type="application/rss+xml" />
	<link>http://answersy.com/zchen</link>
	<description>Got questions? Got answers!</description>
	<lastBuildDate>Thu, 20 Oct 2011 04:28:21 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Data Warehouse (1)</title>
		<link>http://answersy.com/zchen/2008/04/09/data-warehouse-1/</link>
		<comments>http://answersy.com/zchen/2008/04/09/data-warehouse-1/#comments</comments>
		<pubDate>Wed, 09 Apr 2008 21:45:18 +0000</pubDate>
		<dc:creator>zchen</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Design Doc]]></category>
		<category><![CDATA[IT Related]]></category>
		<category><![CDATA[Random Thoughts]]></category>

		<guid isPermaLink="false">http://answersy.com/zchen/2008/04/09/data-warehouse-1/</guid>
		<description><![CDATA[基础设施构架 基础设施要好用、够用。这里着重要考虑的是总的数据量，进出流量以及增长率。换句话说，数据仓库最终要装多少数据，到底会承受怎样的输入输出压力，随着时间推移总量和输入输出的压力如何变化。通常用现成的Oracle RAC，按照OLAP来配置；也可以采用免费DBMS加中间件的方式组成系统；甚至连整个存储系统都自行构建，例如采用Hadoop。显然，Oracle搭建起来比较快，但成本相对高；后者需要相当的人力资源投入，但可以掌握实际技术，灵活性高。 原始数据清洗 主要是过滤噪音、打标签、补足缺失部分 数据导入 分布、存储、索引 数据归总 按照商业或者分析的需求计算统计值 数据仓库设计 明确商务流程 必须对具体的商务活动本身要有深入的了解和认知。 确认商务流程中的各元素和维度 理解系统中人、物、事件、活动以及之间的逻辑关系。确定如何用各种数据参数来描述每个元素和事件活动。好的数据仓库应该有一整套底层维度设计，上面的应用要尽可能地重复使用这些基本的维度定义。 确定商务流程的粒度 在怎样的宏观或者微观水平上描述这个商业流程 确定定量的事实 例如，营业额就是零售业的度量。 addthis_url = 'http%3A%2F%2Fanswersy.com%2Fzchen%2F2008%2F04%2F09%2Fdata-warehouse-1%2F'; addthis_title = 'Data+Warehouse+%281%29'; addthis_pub = 'zchen050815';]]></description>
			<content:encoded><![CDATA[<p><strong>基础设施构架</strong></p>
<ul>基础设施要好用、够用。这里着重要考虑的是总的<strong>数据量</strong>，<strong>进出流量</strong>以及<strong>增长率</strong>。换句话说，数据仓库最终要装多少数据，到底会承受怎样的输入输出压力，随着时间推移总量和输入输出的压力如何变化。通常用现成的Oracle RAC，按照OLAP来配置；也可以采用免费DBMS加中间件的方式组成系统；甚至连整个存储系统都自行构建，例如采用Hadoop。显然，Oracle搭建起来比较快，但成本相对高；后者需要相当的人力资源投入，但可以掌握实际技术，灵活性高。</p>
<p><strong>原始数据清洗</strong></p>
<p>主要是过滤噪音、打标签、补足缺失部分</p>
<p><strong>数据导入</strong></p>
<p>分布、存储、索引</p>
<p><strong>数据归总</strong></p>
<p>按照商业或者分析的需求计算统计值</ul>
<p><strong>数据仓库设计</strong></p>
<ul><strong>明确商务流程</strong></p>
<p>必须对具体的商务活动本身要有深入的了解和认知。</p>
<p><strong>确认商务流程</strong><strong>中的各元素和维度</strong></p>
<p>理解系统中人、物、事件、活动以及之间的逻辑关系。确定如何用各种数据参数来描述每个元素和事件活动。好的数据仓库应该有一整套<strong>底层维度设计</strong>，上面的应用要尽<strong>可能地重复使用</strong>这些基本的维度定义。</p>
<p><strong>确定商务流程的粒度</strong></p>
<p>在怎样的宏观或者微观水平上描述这个商业流程</p>
<p><strong>确定定量的事实</strong></p>
<p>例如，营业额就是零售业的度量。</ul>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fanswersy.com%2Fzchen%2F2008%2F04%2F09%2Fdata-warehouse-1%2F';
  addthis_title  = 'Data+Warehouse+%281%29';
  addthis_pub    = 'zchen050815';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://answersy.com/zchen/2008/04/09/data-warehouse-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>On Realtime Personalized Search (5)</title>
		<link>http://answersy.com/zchen/2008/04/03/on-realtime-personalized-search-5/</link>
		<comments>http://answersy.com/zchen/2008/04/03/on-realtime-personalized-search-5/#comments</comments>
		<pubDate>Thu, 03 Apr 2008 17:28:17 +0000</pubDate>
		<dc:creator>zchen</dc:creator>
				<category><![CDATA[Design Doc]]></category>
		<category><![CDATA[IT Related]]></category>
		<category><![CDATA[Internet]]></category>

		<guid isPermaLink="false">http://answersy.com/zchen/2008/04/03/on-realtime-personalized-search-x/</guid>
		<description><![CDATA[Let's talk about "real time." When we say real time, we mean to be able to serve new contents within several minutes or even less time after they are available. In any database, search and data manipulation are a pair of conflicts. If one optimizes for search, insertion, updating and deletion will suffer, vise verse. [...]]]></description>
			<content:encoded><![CDATA[<p>Let's talk about "<strong>real time</strong>." When we say real time, we mean to be able to serve new contents within several minutes or even less time after they are available.</p>
<p>In any database, search and data manipulation are a pair of conflicts. If one optimizes for search, insertion, updating and deletion will suffer, vise verse. Unless, the dataset is relatively <strong>small</strong>, it is very difficult to achieve good performance on both.</p>
<p>General web search serves billions of documents, therefore, it takes long to fully build the index; as a result, it is not feasible to serve "new" contents very quickly. To be able to maintain certain level of freshness, major web search engines often adopt a <strong>separated pipeline</strong> to handle a few frequently updated contents. This <strong>fastlane </strong>has much less documents and can be built several times a day. When search is conducted, proxy will blend results from main index with the ones from the fastlane.</p>
<p>However, for some transaction-oriented web sites, rebuilding indexed several times a day is still not acceptable. One might have heard this saying, "today's disk is yesterday's tape, today's RAM is yesterday's disk." We should really take advantage of new machine's large memory.  If we can move the fastlane into memory, and stop worrying about the mechanical disk access,  we shall have a better chance to achieve "real time" serving.</p>
<p>The key point is <strong>divide and conquer</strong>: handle different freshness requirements with a number of pipelines of different priority. We need a good algorithm to do the "divide" and a good strategy to merge. In the meanwhile, we want to take advantage of new hardware as well!</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fanswersy.com%2Fzchen%2F2008%2F04%2F03%2Fon-realtime-personalized-search-5%2F';
  addthis_title  = 'On+Realtime+Personalized+Search+%285%29';
  addthis_pub    = 'zchen050815';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://answersy.com/zchen/2008/04/03/on-realtime-personalized-search-5/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>On Realtime Personalized Search (4) &#8211; Ranking</title>
		<link>http://answersy.com/zchen/2008/02/07/on-realtime-personalized-search-4-ranking/</link>
		<comments>http://answersy.com/zchen/2008/02/07/on-realtime-personalized-search-4-ranking/#comments</comments>
		<pubDate>Thu, 07 Feb 2008 22:42:03 +0000</pubDate>
		<dc:creator>zchen</dc:creator>
				<category><![CDATA[Design Doc]]></category>
		<category><![CDATA[IT Related]]></category>
		<category><![CDATA[Internet]]></category>

		<guid isPermaLink="false">http://answersy.com/zchen/2008/02/07/on-realtime-personalized-search-4-ranking/</guid>
		<description><![CDATA[Everybody agrees that relevance is vital to information retrieval or a search engine. Hence, I would like to talk a little more on ranking here. First of all, as an old Chinese saying says, “a brilliant housewife cannot prepare wonderful dinner without rice.” Content quality is absolutely a precondition to search relevance. The whole infrastructure [...]]]></description>
			<content:encoded><![CDATA[<p class="MsoNormal">Everybody agrees that relevance is vital to information retrieval or a search engine. Hence, I would like to talk a little more on ranking here.</p>
<p class="MsoNormal">
<p class="MsoNormal">First of all, as an old Chinese saying says, “a brilliant housewife cannot prepare wonderful dinner without rice.” <strong>Content quality</strong> is absolutely a precondition to search relevance. <em>The whole infrastructure must guarantee that statistically important high-quality documents are so arranged that they will be selected as the “rice” for the housewife to cook. </em>:-)</p>
<p class="MsoNormal">
<p class="MsoNormal">Second, <em>the housewife actually runs a <strong>team</strong> to do the job. </em>Ranking function is a set of logics and models. (See figure 1 below)</p>
<p class="MsoNormal">
<p class="MsoNormal"><!--[if mso & !supportInlineShapes & supportFields]><span style='mso-element:field-begin;mso-field-lock:yes'></span><span style='mso-spacerun:yes'> </span>SHAPE <span style='mso-spacerun:yes'> </span>\* MERGEFORMAT <span style='mso-element:field-separator'></span><![endif]--><!--[if gte vml 1]><v:group  id="_x0000_s1026" editas="canvas" style='width:414.35pt;height:101.25pt;  mso-position-horizontal-relative:char;mso-position-vertical-relative:line'  coordorigin="1800,6030" coordsize="8287,2025">  <o:lock v:ext="edit" aspectratio="t"/>  <v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75"   o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f">   <v:stroke joinstyle="miter"/>   <v:formulas>    <v:f eqn="if lineDrawn pixelLineWidth 0"/>    <v:f eqn="sum @0 1 0"/>    <v:f eqn="sum 0 0 @1"/>    <v:f eqn="prod @2 1 2"/>    <v:f eqn="prod @3 21600 pixelWidth"/>    <v:f eqn="prod @3 21600 pixelHeight"/>    <v:f eqn="sum @0 0 1"/>    <v:f eqn="prod @6 1 2"/>    <v:f eqn="prod @7 21600 pixelWidth"/>    <v:f eqn="sum @8 21600 0"/>    <v:f eqn="prod @7 21600 pixelHeight"/>    <v:f eqn="sum @10 21600 0"/>   </v:formulas>   <v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"/>   <o:lock v:ext="edit" aspectratio="t"/>  </v:shapetype><v:shape id="_x0000_s1027" type="#_x0000_t75" style='position:absolute;   left:1800;top:6030;width:8287;height:2025' o:preferrelative="f">   <v:fill o:detectmouseclick="t"/>   <v:path o:extrusionok="t" o:connecttype="none"/>   <o:lock v:ext="edit" text="t"/>  </v:shape><v:shapetype id="_x0000_t110" coordsize="21600,21600" o:spt="110"   path="m10800,l,10800,10800,21600,21600,10800xe">   <v:stroke joinstyle="miter"/>   <v:path gradientshapeok="t" o:connecttype="rect" textboxrect="5400,5400,16200,16200"/>  </v:shapetype><v:shape id="_x0000_s1028" type="#_x0000_t110" style='position:absolute;   left:3960;top:6660;width:1440;height:901'>   <v:textbox style='mso-next-textbox:#_x0000_s1028' inset=",,,0">    <![if !mso]></p>
<table cellpadding=0 cellspacing=0 width="100%">
<tr>
<td><![endif]></p>
<div>
<p class=MsoNormal align=center style='text-align:center'><span      style='font-size:10.0pt;font-family:"Lucida Sans Unicode"'>Logic<o:p></o:p></span></p>
</div>
<p><![if !mso]></td>
</tr>
</table>
<p><![endif]></v:textbox>  </v:shape><v:shapetype id="_x0000_t32" coordsize="21600,21600" o:spt="32"   o:oned="t" path="m,l21600,21600e" filled="f">   <v:path arrowok="t" fillok="f" o:connecttype="none"/>   <o:lock v:ext="edit" shapetype="t"/>  </v:shapetype><v:shape id="_x0000_s1029" type="#_x0000_t32" style='position:absolute;   left:3540;top:7111;width:420;height:1' o:connectortype="straight">   <v:stroke endarrow="block"/>  </v:shape><v:rect id="_x0000_s1030" style='position:absolute;left:2670;top:6863;   width:870;height:495'>   <v:textbox style='mso-next-textbox:#_x0000_s1030' inset="0,,0,0">    <![if !mso]></p>
<table cellpadding=0 cellspacing=0 width="100%">
<tr>
<td><![endif]></p>
<div>
<p class=MsoNormal align=center style='text-align:center'><span      style='font-size:10.0pt;font-family:"Lucida Sans Unicode"'>Query<o:p></o:p></span></p>
</div>
<p><![if !mso]></td>
</tr>
</table>
<p><![endif]></v:textbox>  </v:rect><v:shapetype id="_x0000_t112" coordsize="21600,21600" o:spt="112"   path="m,l,21600r21600,l21600,xem2610,nfl2610,21600em18990,nfl18990,21600e">   <v:stroke joinstyle="miter"/>   <v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"    textboxrect="2610,0,18990,21600"/>  </v:shapetype><v:shape id="_x0000_s1031" type="#_x0000_t112" style='position:absolute;   left:6030;top:6030;width:2925;height:405'>   <v:textbox style='mso-next-textbox:#_x0000_s1031' inset=",,,0">    <![if !mso]></p>
<table cellpadding=0 cellspacing=0 width="100%">
<tr>
<td><![endif]></p>
<div>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Lucida Sans Unicode"'>Ranking      Module 1<o:p></o:p></span></p>
</div>
<p><![if !mso]></td>
</tr>
</table>
<p><![endif]></v:textbox>  </v:shape><v:shape id="_x0000_s1032" type="#_x0000_t112" style='position:absolute;   left:6030;top:6600;width:2925;height:405;mso-position-horizontal:absolute;   mso-position-vertical:absolute'>   <v:textbox style='mso-next-textbox:#_x0000_s1032' inset=",,,0">    <![if !mso]></p>
<table cellpadding=0 cellspacing=0 width="100%">
<tr>
<td><![endif]></p>
<div>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Lucida Sans Unicode"'>Ranking      Module 2<o:p></o:p></span></p>
</div>
<p><![if !mso]></td>
</tr>
</table>
<p><![endif]></v:textbox>  </v:shape><v:shape id="_x0000_s1033" type="#_x0000_t112" style='position:absolute;   left:6030;top:7650;width:2925;height:405;mso-position-horizontal:absolute;   mso-position-vertical:absolute'>   <v:textbox style='mso-next-textbox:#_x0000_s1033' inset=",,,0">    <![if !mso]></p>
<table cellpadding=0 cellspacing=0 width="100%">
<tr>
<td><![endif]></p>
<div>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Lucida Sans Unicode"'>Ranking      Module N<o:p></o:p></span></p>
</div>
<p><![if !mso]></td>
</tr>
</table>
<p><![endif]></v:textbox>  </v:shape><v:shapetype id="_x0000_t34" coordsize="21600,21600" o:spt="34"   o:oned="t" adj="10800" path="m,l@0,0@0,21600,21600,21600e" filled="f">   <v:stroke joinstyle="miter"/>   <v:formulas>    <v:f eqn="val #0"/>   </v:formulas>   <v:path arrowok="t" fillok="f" o:connecttype="none"/>   <v:handles>    <v:h position="#0,center"/>   </v:handles>   <o:lock v:ext="edit" shapetype="t"/>  </v:shapetype><v:shape id="_x0000_s1034" type="#_x0000_t34" style='position:absolute;   left:5400;top:6233;width:630;height:878;flip:y' o:connectortype="elbow"   adj=",238805,-189257">   <v:stroke endarrow="block"/>  </v:shape><v:shape id="_x0000_s1035" type="#_x0000_t34" style='position:absolute;   left:5400;top:7111;width:630;height:742' o:connectortype="elbow" adj=",-282576,-189257">   <v:stroke endarrow="block"/>  </v:shape><v:shape id="_x0000_s1036" type="#_x0000_t34" style='position:absolute;   left:5400;top:6803;width:630;height:308;flip:y' o:connectortype="elbow"   adj=",680751,-189257">   <v:stroke endarrow="block"/>  </v:shape><v:shapetype id="_x0000_t128" coordsize="21600,21600" o:spt="128"   path="m,l21600,,10800,21600xe">   <v:stroke joinstyle="miter"/>   <v:path gradientshapeok="t" o:connecttype="custom" o:connectlocs="10800,0;5400,10800;10800,21600;16200,10800"    textboxrect="5400,0,16200,10800"/>  </v:shapetype><v:shape id="_x0000_s1037" type="#_x0000_t128" style='position:absolute;   left:9825;top:6983;width:270;height:255;rotation:-90'/>  <v:shape id="_x0000_s1038" type="#_x0000_t34" style='position:absolute;left:8955;   top:6233;width:877;height:879' o:connectortype="elbow" adj="10788,-216958,-223512">   <v:stroke endarrow="block"/>  </v:shape><v:shape id="_x0000_s1039" type="#_x0000_t34" style='position:absolute;   left:8955;top:7112;width:877;height:741;flip:y' o:connectortype="elbow"   adj="10788,304586,-223512">   <v:stroke endarrow="block"/>  </v:shape><v:shape id="_x0000_s1040" type="#_x0000_t34" style='position:absolute;   left:8955;top:6803;width:877;height:309' o:connectortype="elbow" adj="10788,-657017,-223512">   <v:stroke endarrow="block"/>  </v:shape><w:wrap type="none"/>  <w:anchorlock/> </v:group><![endif]--><!--[if !vml]--><img alt="ranking.gif" id="image152" src="http://answersy.com/zchen/wp-content/uploads/2008/02/ranking.gif" /><br />
<!--[endif]--><!--[if mso & !supportInlineShapes & supportFields]><v:shape  id="_x0000_i1025" type="#_x0000_t75" style='width:414.35pt;height:101.25pt'>  <v:imagedata croptop="-65520f" cropbottom="65520f"/> </v:shape><span style='mso-element:field-end'></span><![endif]--></p>
<p class="MsoNormal">
<p class="MsoNormal">If the set of all the possible queries is denoted as <strong>input space<em> Q</em></strong>, usually, we need to slice <strong><em>Q</em></strong> into smaller sub-regions to handle them separately. Many have dreamed of having one single unified ranking module to handle all the queries, but soon or later, this will hit some bottleneck. The idea is to have <em>a number of individual ranking modules which performance well in a sub-region of <strong>Q</strong></em>. The branching logic can be based on length, language, intention or other classification of the queries or user profiles.</p>
<p class="MsoNormal">
<p class="MsoNormal">Third, we need a <strong>merging logic </strong>to combine the outputs of different ranking modules. Sometimes we trust one ranking module so much that others won’t even be called. Sometimes we might want to run more than one ranking modules in parallel and blend their results. In general, the merging logic should be based upon the following heuristic: <em>if ranking module i performs better than module j in a sub-domain of <strong>Q</strong>, for a given query q in this sub-domain, i’s output should have higher weight than j’s.</em> <em>This is to trust the experts’ domain knowledge</em>. The final weights also need to be tuned either manually or by some statistical modeling procedure.</p>
<p class="MsoNormal">
<p class="MsoNormal">To summarize, <strong>content quality + committee of special ranking modules = solution</strong>.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fanswersy.com%2Fzchen%2F2008%2F02%2F07%2Fon-realtime-personalized-search-4-ranking%2F';
  addthis_title  = 'On+Realtime+Personalized+Search+%284%29+%26%238211%3B+Ranking';
  addthis_pub    = 'zchen050815';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://answersy.com/zchen/2008/02/07/on-realtime-personalized-search-4-ranking/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>On Realtime Personalized Search (3)</title>
		<link>http://answersy.com/zchen/2008/02/02/on-realtime-personalized-search-3/</link>
		<comments>http://answersy.com/zchen/2008/02/02/on-realtime-personalized-search-3/#comments</comments>
		<pubDate>Sat, 02 Feb 2008 02:09:21 +0000</pubDate>
		<dc:creator>zchen</dc:creator>
				<category><![CDATA[Design Doc]]></category>
		<category><![CDATA[IT Related]]></category>
		<category><![CDATA[Internet]]></category>

		<guid isPermaLink="false">http://answersy.com/zchen/2008/02/02/on-realtime-personalized-search-3/</guid>
		<description><![CDATA[A lot of web services talk about Personalization. As we may know, matching contents with audiences' intention is the key value of any publishing site, especially web search. Usually, web services use aggregated features to improve relevance. For example, Google's famous PageRank: if a lot of pages link to this one, it is important! Another [...]]]></description>
			<content:encoded><![CDATA[<p>A lot of web services talk about <strong>Personalization</strong>.</p>
<p>As we may know, matching contents with audiences' intention is the key value of any publishing site, especially web search.</p>
<p>Usually, web services use <strong>aggregated </strong>features to improve relevance. For example, Google's famous PageRank:  if a lot of pages link to this one, it is important! Another example would be recommended posts on a BBS site or Blog RSS engagement service like feedburner.com. Usually, they are based on popularity: if a lot of people click on this, it is good; if a lot of people subscribe to this, it is cool!<br />
In general, this aggregation approach will achieve "much-better-than-random-pick" results. However, for a given individual, this is NOT good enough: a lot of people like ipod, but I prefer zune. Hence, when searching for "<em>mp3 player</em>," I might want to see more zune results ;-)</p>
<p>The difficulty of personalization lies in two major factors: (1) huge possible combinations of different profiles (2) how to get and store users' profile.<br />
So, how to address this issue?</p>
<p>I have to say, there is no out-of-package solution out there yet. But, the more you address this, the better position you are in the future competition! Hence, <strong>if you want to provide a serious web service to general audience, you want to think about personalization from the beginning of the system design</strong>.</p>
<p>Here are some preliminary thinkings:</p>
<p>(1) keep track of all the links a user clicks on your site</p>
<p>(2) have all the links classified based on some criteria<br />
(3) aggregate over user, links classes, and time-frame, and keep the summarized data in user's cookie</p>
<p>(4) deliver contents using the summarized data as parameters</p>
<p>For example, if a search engine finds a user (cookie) clicks <em><strong>commercial links</strong></em> <em><strong>very often</strong></em> <strong><em>recently</em></strong>, it should really start to push more ads to him. If a user clicks sports links very often, showing sport-related ads will achieve better results.</p>
<p>Remember, <em><strong>Personalization is not about being perfect, it is about achieving competitive advance</strong></em>!</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fanswersy.com%2Fzchen%2F2008%2F02%2F02%2Fon-realtime-personalized-search-3%2F';
  addthis_title  = 'On+Realtime+Personalized+Search+%283%29';
  addthis_pub    = 'zchen050815';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://answersy.com/zchen/2008/02/02/on-realtime-personalized-search-3/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>On Realtime Personalized Search (2)</title>
		<link>http://answersy.com/zchen/2008/02/02/on-realtime-personalized-search-2/</link>
		<comments>http://answersy.com/zchen/2008/02/02/on-realtime-personalized-search-2/#comments</comments>
		<pubDate>Sat, 02 Feb 2008 00:31:32 +0000</pubDate>
		<dc:creator>zchen</dc:creator>
				<category><![CDATA[Design Doc]]></category>
		<category><![CDATA[IT Related]]></category>
		<category><![CDATA[Internet]]></category>

		<guid isPermaLink="false">http://answersy.com/zchen/2008/02/02/on-realtime-personalized-search-2/</guid>
		<description><![CDATA[Let's consider ranking. What is ranking? In information retrieval, ranking is to order "matched documents." We need to define matching first. If user queries "mp3," can "ipod" in a document be considered as matched? In general, I feel so. But usually, most search engines will be very conservative in doing so. Besides whether query terms [...]]]></description>
			<content:encoded><![CDATA[<p>Let's consider <strong>ranking</strong>.</p>
<p>What is ranking? In information retrieval, ranking is to <strong>order </strong>"<strong><em>matched documents</em></strong>."</p>
<p>We need to define <strong>matching</strong> first. If user queries "<em>mp3</em>," can "<em>ipod</em>" in a document be considered as matched? In general, I feel so. But usually, most search engines will be very conservative in doing so.</p>
<p>Besides whether query terms match the document or not, we also need to consider <strong>where the matchings happen</strong>. They can happen at the title, header or body section, sometimes at anchor text or other meta information like "url" or "tag" sections. Of course, matching <strong>proximity </strong>is another important consideration. These are all standard information retrieval issues.</p>
<p>In my opinion, understanding query intention is the key to improve ranking. There is NO "general ranking algorithm" for all type of queries and it does not make sense to try to build one. As a result, overall, a good ranking function should be the combination of heuristic logics with data-trained logics. Namely, we can slice data into different categories to train special ranking models for each category, while at the same time develop algorithms to make decision on how to pick ranking modules for a given query.</p>
<p>From example, for <em>English</em> query, if the url string or host name matches the query terms, it can be a very strong indicator of "good matching", this may not be applicable at all for Asian languages. So we need a language detection module before applying ranking modules.</p>
<p>Query -> Language Detection -> English Logic, Chinese Logic ...<br />
The key is how many manually maintained logic forks we can keep so that the data trained ranking module at the leaf can perform well enough. We also need to balance the <strong>cost of maintaining a list of such modules</strong> with the performance gain by going specialty.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fanswersy.com%2Fzchen%2F2008%2F02%2F02%2Fon-realtime-personalized-search-2%2F';
  addthis_title  = 'On+Realtime+Personalized+Search+%282%29';
  addthis_pub    = 'zchen050815';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://answersy.com/zchen/2008/02/02/on-realtime-personalized-search-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>On Realtime Personalized Search (1)</title>
		<link>http://answersy.com/zchen/2008/01/23/on-realtime-search-1/</link>
		<comments>http://answersy.com/zchen/2008/01/23/on-realtime-search-1/#comments</comments>
		<pubDate>Wed, 23 Jan 2008 18:55:43 +0000</pubDate>
		<dc:creator>zchen</dc:creator>
				<category><![CDATA[Design Doc]]></category>
		<category><![CDATA[IT Related]]></category>
		<category><![CDATA[Internet]]></category>

		<guid isPermaLink="false">http://answersy.com/zchen/2008/01/23/on-realtime-search-1/</guid>
		<description><![CDATA[Search is so nature when one is facing oceans of information. Google dominates internet search for the following reasons A well-recognized internet search service featuring Good coverage on most of the web contents Good user intention understanding and high query-results relevance Freshness, discovery of new contents and fast indexing speed High availability, and fast serving [...]]]></description>
			<content:encoded><![CDATA[<p>Search is so nature when one is facing oceans of information.<br />
Google dominates internet search for the following reasons</p>
<ol>
<li>A well-recognized internet search service featuring</li>
<ol>
<li>Good <strong>coverage </strong>on most of the web contents</li>
<li>Good user intention understanding and high query-results <strong>relevance</strong></li>
<li><strong>Freshness</strong>, discovery of new contents and fast indexing speed</li>
<li>High <strong>availability</strong>, and fast serving speed</li>
<li>Relatively stable and improving <strong>presentation </strong>of the results</li>
</ol>
<li>Keeping close track of search users' behaviors to deliver relevancy online Ads</li>
<li>Share profit with other publishers/sites</li>
</ol>
<p>In this document, I am going to present another approach to deliver good search experience.</p>
<p>There are too many different aspects in search. That is why it is so hard to start up a real functioning search service on the internet or even for a large corporation site.</p>
<p>As we all know, a search service include the following components:</p>
<ol>
<li>a <strong>crawler </strong>and document repository</li>
<li>some <strong>content analyzer</strong>, to extract linkage information and other meta data from the crawled documents</li>
<li>an <strong>indexer </strong>to build keyword index for the documents</li>
<li>the <strong>search engine </strong>clusters to serve the indexed data</li>
<li>modern search engines always have a <strong>proxy </strong>layer to do load-balancing, caching and aggregation</li>
<li>A <strong>ranking </strong>module to order the matching results</li>
<li>frontend usually instruments some <strong>tracking </strong>code to monitor end user behaviors, which potentially will feedback to content system</li>
</ol>
<p>As of now, the major challenge of build such a service lies in the following aspects:</p>
<ol>
<li><strong>scalability</strong></li>
<li><strong>meta information extraction</strong> in nonstructural data</li>
<li>user intention understanding, <strong>ranking</strong></li>
<li><strong>spam</strong> filtering <strong><br />
</strong></li>
<li>content <strong>updating</strong></li>
</ol>
<p>I will try to discuss these issues in this series :-)<br />
References:<br />
1. <a href="http://en.wikipedia.org/wiki/Index_(search_engine)">http://en.wikipedia.org/wiki/Index_(search_engine)</a></p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fanswersy.com%2Fzchen%2F2008%2F01%2F23%2Fon-realtime-search-1%2F';
  addthis_title  = 'On+Realtime+Personalized+Search+%281%29';
  addthis_pub    = 'zchen050815';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://answersy.com/zchen/2008/01/23/on-realtime-search-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.380 seconds -->

