JonsenElizee

Software Developing Blog

"An idea is fragile . It can be killed by a scornful smile or a yawn .It can be mound down by irony and scared to death by a cold look."
"Most cultures throughout human history have not liked creative individuals .They ignore them or kill them.It is a very efficient way of stopping creativity."

------Advertising boss Charles Browe and Howard Gardner ,professor at Harvard

   :: 首页 :: 新随笔 ::  ::  :: 管理 ::
Get the data number that is not a duplicated one in 0.25 billion data set. the data in this big data set is of type int. And the limitation is that
you can only use 600M memory at most. the time complexity should be O(n).

For the data type is int, so, the range of data is [0, 2^32).
Here is my idea.

/* using two 256M array to record the existing state is the best way. */
int duplicated(int all[], int size)
{
    bit st1[256M] 
= {0}, st2[256M] = {0};
    
int cnt = 0;
    
for(dat : all)// for each data in all-data set
    {
        
if (dat < 2^31) {  
            
if(st1[dat] == 0) st1[dat] = 1;
            
else if(st2[dat] == 0) { st2[dat] = 1; cnt++; }
        }
    }

    reset st1 and st2;
    
for(dat : all)// for each data in all-data set
    {
        
if (dat >= 2^31) {
            dat 
-= 2^31;
            
if(st1[dat] == 0) st1[dat] = 1;
            
else if(st2[dat] == 0) { st2[dat] = 1; cnt++; }
        }
    }
    
return cnt;
}

/* the time complexity is not a constant, O(1), it's O(2 * size/2), it means O(n). */
/* space complexity is O(1), a constant size depending on your processing division.*/
/* if we need to process one billion data with type int, this algorithm can also */
/* work well. because the precondition: 0 <= all[i] < 2^32. */
/* Whitout real implementation and full test, if any bug, any reply will be */
/* appreciated. */

posted on 2010-10-11 09:42 JonsenElizee 阅读(1702) 评论(0)  编辑 收藏 引用

只有注册用户登录后才能发表评论。
网站导航: 博客园   IT新闻   BlogJava   博问   Chat2DB   管理


By JonsenElizee