lxyfirst

C++博客 首页 新随笔 联系 聚合 管理
  33 Posts :: 3 Stories :: 27 Comments :: 0 Trackbacks

#

     摘要: keytool是java提供的管理密钥和签名的工具。数据储存在keystore文件中,即jks文件 。1.创建rsa密钥对(公钥和私钥)并储存在keystore文件中: Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE ...  阅读全文
posted @ 2011-04-15 14:53 star 阅读(8571) | 评论 (1)编辑 收藏


来自http://www.usenix.org/events/osdi10/tech/full_papers/Geambasu.pdf
分布式key-value存储系统有很多,但comet是分布式“活动”key-value存储系统,特点如下:
1.在普通的key-value上做了一些回调机制,当存取key-value对象时,回调相应handler函数,从而实现逻辑控制。
  已实现的回调有onGet,onPut,onUpdate,onTimer 。
2.handler函数是由lua语言实现,comet内部集成了lua解析器的简化版,加了很多限制,形成lua代码运行的安全沙箱。
其他方面可参考论文。




posted @ 2011-03-30 15:56 star 阅读(375) | 评论 (0)编辑 收藏


http://highscalability.com/numbers-everyone-should-know

Numbers Everyone Should Know

Google AppEngine Numbers

This group of numbers is from Brett Slatkin in Building Scalable Web Apps with Google App Engine.

Writes are expensive!

  • Datastore is transactional: writes require disk access
  • Disk access means disk seeks
  • Rule of thumb: 10ms for a disk seek
  • Simple math: 1s / 10ms = 100 seeks/sec maximum
  • Depends on:
    * The size and shape of your data
    * Doing work in batches (batch puts and gets)

    Reads are cheap!

  • Reads do not need to be transactional, just consistent
  • Data is read from disk once, then it's easily cached
  • All subsequent reads come straight from memory
  • Rule of thumb: 250usec for 1MB of data from memory
  • Simple math: 1s / 250usec = 4GB/sec maximum
    * For a 1MB entity, that's 4000 fetches/sec

    Numbers Miscellaneous

    This group of numbers is from a presentation Jeff Dean gave at a Engineering All-Hands Meeting at Google.

  • L1 cache reference 0.5 ns
  • Branch mispredict 5 ns
  • L2 cache reference 7 ns
  • Mutex lock/unlock 100 ns
  • Main memory reference 100 ns
  • Compress 1K bytes with Zippy 10,000 ns
  • Send 2K bytes over 1 Gbps network 20,000 ns
  • Read 1 MB sequentially from memory 250,000 ns
  • Round trip within same datacenter 500,000 ns
  • Disk seek 10,000,000 ns
  • Read 1 MB sequentially from network 10,000,000 ns
  • Read 1 MB sequentially from disk 30,000,000 ns
  • Send packet CA->Netherlands->CA 150,000,000 ns

    The Lessons

  • Writes are 40 times more expensive than reads.
  • Global shared data is expensive. This is a fundamental limitation of distributed systems. The lock contention in shared heavily written objects kills performance as transactions become serialized and slow.
  • Architect for scaling writes.
  • Optimize for low write contention.
  • Optimize wide. Make writes as parallel as you can.

    The Techniques

    Keep in mind these are from a Google AppEngine perspective, but the ideas are generally applicable.

    Sharded Counters

    We always seem to want to keep count of things. But BigTable doesn't keep a count of entities because it's a key-value store. It's very good at getting data by keys, it's not interested in how many you have. So the job of keeping counts is shifted to you.

    The naive counter implementation is to lock-read-increment-write. This is fine if there a low number of writes. But if there are frequent updates there's high contention. Given the the number of writes that can be made per second is so limited, a high write load serializes and slows down the whole process.

    The solution is to shard counters. This means:
  • Create N counters in parallel.
  • Pick a shard to increment transactionally at random for each item counted.
  • To get the real current count sum up all the sharded counters.
  • Contention is reduced by 1/N. Writes have been optimized because they have been spread over the different shards. A bottleneck around shared state has been removed.

    This approach seems counter-intuitive because we are used to a counter being a single incrementable variable. Reads are cheap so we replace having a single easily read counter with having to make multiple reads to recover the actual count. Frequently updated shared variables are expensive so we shard and parallelize those writes.

    With a centralized database letting the database be the source of sequence numbers is doable. But to scale writes you need to partition and once you partition it becomes difficult to keep any shared state like counters. You might argue that so common a feature should be provided by GAE and I would agree 100 percent, but it's the ideas that count (pun intended).
  • Paging Through Comments

    How can comments be stored such that they can be paged through
    in roughly the order they were entered?

    Under a high write load situation this is a surprisingly hard question to answer. Obviously what you want is just a counter. As a comment is made you get a sequence number and that's the order comments are displayed. But as we saw in the last section shared state like a single counter won't scale in high write environments.

    A sharded counter won't work in this situation either because summing the shared counters isn't transactional. There's no way to guarantee each comment will get back the sequence number it allocated so we could have duplicates.

    Searches in BigTable return data in alphabetical order. So what is needed for a key is something unique and alphabetical so when searching through comments you can go forward and backward using only keys.

    A lot of paging algorithms use counts. Give me records 1-20, 21-30, etc. SQL makes this easy, but it doesn't work for BigTable. BigTable knows how to get things by keys so you must make keys that return data in the proper order.

    In the grand old tradition of making unique keys we just keep appending stuff until it becomes unique. The suggested key for GAE is: time stamp + user ID + user comment ID.

    Ordering by date is obvious. The good thing is getting a time stamp is a local decision, it doesn't rely on writes and is scalable. The problem is timestamps are not unique, especially with a lot of users.

    So we can add the user name to the key to distinguish it from all other comments made at the same time. We already have the user name so this too is a cheap call.

    Theoretically even time stamps for a single user aren't sufficient. What we need then is a sequence number for each user's comments.

    And this is where the GAE solution turns into something totally unexpected. Our goal is to remove write contention so we want to parallelize writes. And we have a lot available storage so we don't have to worry about that.

    With these forces in mind, the idea is to create a counter per user. When a user adds a comment it's added to a user's comment list and a sequence number is allocated. Comments are added in a transactional context on a per user basis using Entity Groups. So each comment add is guaranteed to be unique because updates in an Entity Group are serialized.

    The resulting key is guaranteed unique and sorts properly in alphabetical order. When paging a query is made across entity groups using the ID index. The results will be in the correct order. Paging is a matter of getting the previous and next keys in the query for the current page. These keys can then be used to move through index.

    I certainly would have never thought of this approach. The idea of keeping per user comment indexes is out there. But it cleverly follows the rules of scaling in a distributed system. Writes and reads are done in parallel and that's the goal. Write contention is removed.

    posted @ 2011-03-24 14:01 star 阅读(395) | 评论 (0)编辑 收藏

    在linux下开发的多线程系统中,每个线程的调试和监控一直比较麻烦,无法精准定位,现在有了解决办法了。
    linux下的prctl库自kernel 2.6.9后支持PR_SET_NAME选项,用于设置进程名字,linux的进程一般使用lwp,所以这个函数可以设置线程名字。
    api定义如下
    int prctl( int option,unsigned long arg2,unsigned long arg3,unsigned long arg4,unsigned long arg5); 

    PR_SET_NAME (since Linux 
    2.6.9
    Set the process name 
    for the calling process, using the value in the location pointed to by (char *) arg2. The name can be up to 16 bytes long, and should be null-terminated if it contains fewer bytes.

    PR_GET_NAME (since Linux 
    2.6.11
    Return the process name 
    for the calling process, in the buffer pointed to by (char *) arg2. The buffer should allow space for up to 16 bytes; the returned string will be null-terminated if it is shorter than that.


    简单实现代码:

    int set_thread_title(const char* fmt, )
    {
        
    char title [16={0};
        va_list ap;
        va_start(ap, fmt);
        vsnprintf (title, 
    sizeof (title) , fmt, ap);
        va_end (ap);

       
    return prctl(PR_SET_NAME,title) ;

    }

    现在能够为线程设置名字了,那么如何看到呢
    ps -eL -o pid,user,lwp,comm
    top 
    -



    posted @ 2011-03-07 16:11 star 阅读(7758) | 评论 (2)编辑 收藏

    bitcask是一个key-value存储系统,其特点是使用内存储存索引数据,使用硬盘储存实际数据。
    1.所有的key数据放在内存中,通过hashmap组织,便于快速查找,内存中同时存放了key所对应数据在磁盘上的文件指针,直接定位数据。
    2.磁盘数据使用追加写的方式,充分利用磁盘适合顺序存取的特点,每次数据更新会写入磁盘文件,同时更新索引。
    3.读数据时根据索引直接定位,利用文件系统的cache机制,bitcask不再单独实现cache机制。
    4.由于更新会写入新位置,老位置的数据会定期清理合并,减少占用的磁盘空间。
    5.读写的并发控制使用向量时钟(vector clock)。
    6.内存中的索引数据也会刷新到单独的索引文件,这样重启时不需要重建全部索引。

    http://highscalability.com/blog/2011/1/10/riaks-bitcask-a-log-structured-hash-table-for-fast-keyvalue.html

    posted @ 2011-02-16 19:23 star 阅读(827) | 评论 (0)编辑 收藏

    varnish的作者Poul-Henning Kamp,是写freebsd内核的,在写varnish时结合了内核的一些原理和机制,摘录了一些设计思路。
    1.现代的操作系统对于内存管理,磁盘读写有复杂的优化机制,以提高系统的整体性能,开发用户空间的程序时需要关注、配合这些机制,以squid为例,内部实现了对象的缓存、淘汰策略,其实现跟操作系统类似,比如被访问的对象会被缓存,冷对象会刷到磁盘,释放内存,在一些情况下,这种机制可能跟操作系统冲突,从而并不能达到预期。当squid缓存的内存对象一段时间内未被访问,并且还未被squid刷到磁盘时,操作系统可能因为内存不足将这些冷对象swap到磁盘,此时squid是不知道的,而一直认为这些冷对象还在内存中,然后squid根据淘汰策略将这些冷对象刷到磁盘时,操作系统需要先把这些冷对象从swap中重新载入内存,squid接着将这些冷对象写入磁盘。可以看出整个过程的性能损耗。
    评注:这个例子需要一分为二的看,应用程序的内存对象被系统swap,说明系统已经内存不够了,内存cache效率大打折扣。

    2.带持久化的cache,需要从持久化的数据中重构cache,一般有两种方法,一种是直接从磁盘中按需读取,由于访问是随机的,而磁盘的随机读效率很低,这种方式访问效率不高但是节省空间,适合低流量的小机器,大数据量的cache。另外一种方法是预先从磁盘中建立完整的索引,能够大大提升访问效率。
    持久化缓存和磁盘不同的是持久化缓存对可靠性要求不高,不需要严格的崩溃恢复,varnish使用了第二种方式,通过分层的保护提升可靠性,顶层通过A/B写保证可靠性。底层具体数据不保证可靠性。
    http://www.varnish-cache.org/trac/wiki/ArchitectNotes
    posted @ 2011-01-28 11:52 star 阅读(473) | 评论 (0)编辑 收藏

    消息中间件kafka简介

    目的及应用场景

    Kafkalinkedin的分布式消息系统,设计侧重高吞吐量,用于好友动态,相关性统计,排行统计,访问频率控制,批处理等系统。

    传统的离线分析方案是使用日志文件记录数据,然后集中批量处理分析。这种方式对于实时性要求很高的活动流数据不适合,而大部分的消息中间件能够处理实时性要求高的消息/数据,但是对于队列中大量未处理的消息/数据在持久性方面比较弱。

     

    设计理念

             持久化消息

             高吞吐量

             consumer决定消息状态

             系统中各个角色都是分布式集群

    consumer有逻辑组的概念,每个consumer进程属于一个consumer组,每个消息会发给每个关注此消息的consumer组中的某一个consumer进程。

    Linkedin使用了多个consumer组,每个组多个相同职责的consumer进程。

    部署架构

    http://sna-projects.com/kafka/images/tracking_high_level.png

    消息持久化和缓存

    Kafka使用磁盘文件做持久化,磁盘文件的读写速度在于如何使用,随机写比顺序写慢的多,现代os会在内存回收对性能影响不大的情况下尽量使用内存cache进行磁盘的合并写。所以用户进程再做一次缓存没有太大必要。Kafka的读写都是顺序的,以append方式写入文件。

     

    为减少内存copykafka使用sendfile发送数据,通过合并message提升性能。

     

    Kafka不储存每个消息的状态,而使用(consumer,topic,partition)保存每个客户端状态,大大减小了维护每个消息状态的麻烦。

     

    在消息的推vs拉的选择上,kafka使用拉的方式,因为推的方式会因为各个客户端的处理能力、流量等不同产生不确定性。

     

    负载均衡

    Producersbrokers通过硬件做负载均衡,brokersconsumers都以集群方式运行,通过zookeeper协调变更和成员管理。

     

     

    posted @ 2011-01-25 15:56 star 阅读(2095) | 评论 (0)编辑 收藏

    http://www.kernel.org/doc/man-pages/online/pages/man5/proc.5.html
    /proc/{pid}/下存放运行进程的所有相关数据,可以据此分析进程资源消耗和运行情况。

    1./proc/{pid}/stat
    进程运行统计
    awk '{print $1,$2,$3,$14,$15,$20,$22,$23,$24}' stat
    PID,COMM,STATE,UTIME(cpu ticks in user mode),STIME(cpu ticks in kernel mode),THREADS,START_TIME,VSIZE(virtual memory size),RSS(physical memory page)
    2./proc/{pid}/status
    包含stat的大部分数据,可读性更强。
    3./proc/{pid}/task/
    各子线程的运行情况
    4./proc/{pid}/fd/
    进程打开的fd
    5./proc/{pid}/io
    进程IO统计


    posted @ 2011-01-05 15:31 star 阅读(225) | 评论 (0)编辑 收藏

    net.ipv4.tcp_syncookies = 1
    net.ipv4.tcp_tw_reuse = 1
    net.ipv4.tcp_tw_recycle = 1
    net.ipv4.tcp_fin_timeout = 30
    net.ipv4.ip_local_port_range = 1024 65000

    net.ipv4.route.max_size = 4096000
    net.core.somaxconn = 8192
    net.ipv4.tcp_synack_retries = 1
    net.ipv4.tcp_syn_retries = 1
    net.ipv4.netfilter.ip_conntrack_max = 2621400
    net.core.rmem_max = 20000000

    ulimit -n 40960
    ulimit -c unlimited

    做个记号,有待增补完全。

    posted @ 2010-11-17 10:27 star 阅读(117) | 评论 (0)编辑 收藏

    redis根据数据的更新量和间隔时间定期将数据刷新到存储中,相当于做checkpoint。
    通过系统调用fork的copy-on-write的方式实现内存的拷贝,保证刷数据时的一致性。
    但是如果在刷数据期间数据发生大量变化,可能会造成内存的大量copy-on-write,引起系统内存拷贝的负载变化。
    逻辑:
    1.主进程调用fork 。
    2.子进程关闭listen fd ,开始刷数据到存储。
    3.主进程调整策略,减少内存数据更改。

    redis的这种策略并不能保证数据可靠性,没有write ahead日志,异常情况数据可能会丢失。
    因此redis加入了append only的日志文件,以保证数据可靠,但是每次数据更新都写日志的做法使得日志文件增长很快,redis使用跟刷数据类似
    的方式后台整理这个日志文件。

    注:目前的数据库一般通过write ahead日志保证数据可靠性,但是这种日志也不是实时刷新,而是写到buffer中,被触发刷新到文件。


    posted @ 2010-08-21 10:37 star 阅读(895) | 评论 (1)编辑 收藏

    仅列出标题
    共4页: 1 2 3 4