ICBC.pdf –finish
Reading the “NoSQL Datebase”
Reason for use NoSQL
1. Avoidance of Unneeded Complexity
2. High Throughput
3. Horizontal Scalability and Running on Commodity Hardware
4. Avoidance of Expensive Object-Relational Mapping
5. Complexity and Cost of Setting up Database Clusters
6. Compromising Reliability for Better Performance
7. The Current “One size fit’s it all” Databases Thinking Was and Is Wrong
8. The Myth of Effortless Distribution and Partitioning of Centralized Data Models
9. Movements in Programming Languages and Development Frameworks
10. Requirements of Cloud Computing
11. The RDBMS plus Caching-Layer Pattern/Workaround vs. Systems Built from Scratch with Scalability in Mind
12. Yesterday’s vs. Today’s Needs
Nosqldbs.pdf ----page19
Reading the cudaArticle—05
A multiprocessor takes four clock cycles to issue one memory instruction for a "warp"
Accessing local or global memory incurs an additional 400 to 600 clock cycles of memory latency
Cuda Memory
l The fastest form of memory on the multi-processor.
l Is only accessible by the thread.
l Has the lifetime of the thread.
Shared Memory:
l Can be as fast as a register when there are no bank conflicts or when reading from the same address.
l Accessible by any thread of the block from which it was created.
l Has the lifetime of the block.
Global memory:
l Potentially 150x slower than register or shared memory -- watch out for uncoalesced reads and writes which will be discussed in the next column.
l Accessible from either the host or device.
l Has the lifetime of the application.
Local memory:
l A potential performance gotcha, it resides in global memory and can be 150x slower than register or shared memory.
l Is only accessible by the thread.
l Has the lifetime of the thread.
// includes, system
#include <stdio.h>
#include <assert.h>
// Simple utility function to check for CUDA runtime errors
void checkCUDAError(const char* msg);
// Part 2 of 2: implement the fast kernel using shared memory
__global__ void reverseArrayBlock(int *d_out, int *d_in)
extern __shared__ int s_data[];
int inOffset = blockDim.x * blockIdx.x;
int in = inOffset + threadIdx.x;
// Load one element per thread from device memory and store it
// *in reversed order* into temporary shared memory
s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];
// Block until all threads in the block have written
//their data to shared mem
// write the data from shared memory in forward order,
// but to the reversed block offset as before
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int out = outOffset + threadIdx.x;
d_out[out] = s_data[threadIdx.x];
// Program main
int main( int argc, char** argv)
// pointer for host memory and size
int *h_a;
int dimA = 256 * 1024; // 256K elements (1MB total)
// pointer for device memory
int *d_b, *d_a;
// define grid and block size
int numThreadsPerBlock = 256;
// Compute number of blocks needed based on array size
//and desired block size
int numBlocks = dimA / numThreadsPerBlock;
// Part 1 of 2: Compute the number of bytes of shared memory needed
// This is used in the kernel invocation below
int sharedMemSize = numThreadsPerBlock * sizeof(int);
// allocate host and device memory
size_t memSize = numBlocks * numThreadsPerBlock * sizeof(int);
h_a = (int *) malloc(memSize);
cudaMalloc( (void **) &d_a, memSize );
cudaMalloc( (void **) &d_b, memSize );
// Initialize input array on host
for (int i = 0; i < dimA; ++i) {
h_a[i] = i;
// Copy host array to device array
cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );
// launch kernel
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
reverseArrayBlock<<< dimGrid, dimBlock, sharedMemSize >>>( d_b, d_a );
// block until the device has completed
// check if kernel execution generated an error
// Check for any CUDA errors
checkCUDAError("kernel invocation");
// device to host copy
cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost );
// Check for any CUDA errors
// verify the data returned to the host is correct
for (int i = 0; i < dimA; i++){
assert(h_a[i] == dimA - 1 - i );
// free device memory
// free host memory
// If the program makes it this far,
//then the results are correct and
// there are no run-time errors. Good work!
return 0;
void checkCUDAError(const char *msg)
cudaError_t err = cudaGetLastError();
if( cudaSuccess != err)
fprintf(stderr, "Cuda error: %s: %s.\n", msg,
cudaGetErrorString( err) );
Finsh reading the cudaArticle 06
Reading berkeley view on cloud computing
Page 10 classes of utility computing
Reading Makefile.pdf
List macros specified by defalut(Makefile)
Using : make –p
$@ name of target
$? List of dependents
$^ gives all dependencies,whether more recent than the target
$+ same as above,but keep the duplicate names
$< the first dependencies
Reading berkeley view on cloud computing
Page 19 Number 5 Obstacle: Performance Unpredictability
Finish reading Berkeley view on cloud computing
Coding the motion project
The Visual Studio 2005 return an error that stack overflow
“Unhandled exception at 0x00439a57 in motion.exe: 0xC00000FD: Stack overflow.”
'motion.exe': Unloaded 'C:\WINDOWS\WinSxS\x86_Microsoft.VC80.CRT_1fc8b3b9a1e18e3b_8.0.50727.4053_x-ww_e6967989\msvcr80.dll'
'motion.exe': Unloaded 'C:\WINDOWS\system32\psapi.dll'
'motion.exe': Unloaded 'C:\WINDOWS\system32\shimeng.dll'
First-chance exception at 0x00439a57 in motion.exe: 0xC00000FD: Stack overflow.
Unhandled exception at 0x00439a57 in motion.exe: 0xC00000FD: Stack overflow.
The program '[2388] motion.exe: Native' has exited with code 0 (0x0).
Problem: using huge big objet
Coding CSE332 project 2
Adding other data-counter Implementations