Latency Tolerance

Latency Tolerance ”; Previous Next The speed of microprocessors has increased by more than a factor of ten per decade, but the speed of commodity memories (DRAMs) has only doubled, i.e., access time is halved. Therefore, the latency of memory access in terms of processor clock cycles grow by a factor of six in 10 years. Multiprocessors intensified the problem. In bus-based systems, the establishment of a high-bandwidth bus between the processor and the memory tends to increase the latency of obtaining the data from the memory. When the memory is physically distributed, the latency of the network and the network interface is added to that of the accessing the local memory on the node. Latency usually grows with the size of the machine, as more nodes imply more communication relative to computation, more jump in the network for general communication, and likely more contention. The main goal of hardware design is to reduce the latency of the data access while maintaining high, scalable bandwidth. Overview of Latency Tolerance How latency tolerance is handled is best understood by looking at the resources in the machine and how they are utilized. From the processor point of view, the communication architecture from one node to another can be viewed as a pipeline. The stages of the pipeline include network interfaces at the source and destination, as well as in the network links and switches along the way. There are also stages in the communication assist, the local memory/cache system, and the main processor, depending on how the architecture manages communication. The utilization problem in the baseline communication structure is either the processor or the communication architecture is busy at a given time, and in the communication pipeline only one stage is busy at a time as the single word being transmitted makes its way from source to destination. The aim in latency tolerance is to overlap the use of these resources as much as possible. Latency Tolerance in Explicit Message Passing The actual transfer of data in message-passing is typically sender-initiated, using a send operation. A receive operation does not in itself motivate data to be communicated, but rather copies data from an incoming buffer into the application address space. Receiver-initiated communication is done by issuing a request message to the process that is the source of the data. The process then sends the data back via another send. A synchronous send operation has communication latency equal to the time it takes to communicate all the data in the message to the destination, and the time for receive processing, and the time for an acknowledgment to be returned. The latency of a synchronous receive operation is its processing overhead; which includes copying the data into the application, and the additional latency if the data has not yet arrived. We would like to hide these latencies, including overheads if possible, at both ends. Latency Tolerance in a Shared Address Space The baseline communication is through reads and writes in a shared address space. For convenience, it is called read-write communication. Receiver-initiated communication is done with read operations that result in data from another processor’s memory or cache being accessed. If there is no caching of shared data, sender-initiated communication may be done through writes to data that are allocated in remote memories. With cache coherence, the effect of writes is more complex: either writes leads to sender or receiver-initiated communication depends on the cache coherence protocol. Either receiver-initiated or sender-initiated, the communication in a hardware-supported read writes shared address space is naturally fine-grained, which makes tolerance latency very important. Block Data Transfer in a Shared Address Space In a shared address space, either by hardware or software the coalescing of data and the initiation of block transfers can be done explicitly in the user program or transparently by the system. Explicit block transfers are initiated by executing a command similar to a send in the user program. The send command is explained by the communication assist, which transfers the data in a pipelined manner from the source node to the destination. At the destination, the communication assist pulls the data words in from the network interface and stores them in the specified locations. There are two prime differences from send-receive message passing, both of which arise from the fact that the sending process can directly specify the program data structures where the data is to be placed at the destination, since these locations are in the shared address space. Proceeding Past Long-latency Events in a Shared Address Space If the memory operation is made non-blocking, a processor can proceed past a memory operation to other instructions. For writes, this is usually quite simple to implement if the write is put in a write buffer, and the processor goes on while the buffer takes care of issuing the write to the memory system and tracking its completion as required. The difference is that unlike a write, a read is generally followed very soon by an instruction that needs the value returned by the read. Pre-communication in a Shared Address Space Pre-communication is a technique that has already been widely adopted in commercial microprocessors, and its importance is likely to increase in the future. A prefetch instruction does not replace the actual read of the data item, and the prefetch instruction itself must be non-blocking, if it is to achieve its goal of hiding latency through overlap. In this case, as shared data is not cached, the prefetched data is brought into a special hardware structure called a prefetch buffer. When the word is actually read into a register in the next iteration, it is read from the head of the prefetch buffer rather than from memory. If the latency to hide were much bigger than the time to compute single loop iteration, we would prefetch several iterations ahead and there would potentially be several words in the prefetch buffer at a time. Multithreading in a Shared Address Space In terms

Cache Coherence & Synchronization

Cache Coherence and Synchronization ”; Previous Next In this chapter, we will discuss the cache coherence protocols to cope with the multicache inconsistency problems. The Cache Coherence Problem In a multiprocessor system, data inconsistency may occur among adjacent levels or within the same level of the memory hierarchy. For example, the cache and the main memory may have inconsistent copies of the same object. As multiple processors operate in parallel, and independently multiple caches may possess different copies of the same memory block, this creates cache coherence problem. Cache coherence schemes help to avoid this problem by maintaining a uniform state for each cached block of data. Let X be an element of shared data which has been referenced by two processors, P1 and P2. In the beginning, three copies of X are consistent. If the processor P1 writes a new data X1 into the cache, by using write-through policy, the same copy will be written immediately into the shared memory. In this case, inconsistency occurs between cache memory and the main memory. When a write-back policy is used, the main memory will be updated when the modified data in the cache is replaced or invalidated. In general, there are three sources of inconsistency problem − Sharing of writable data Process migration I/O activity Snoopy Bus Protocols Snoopy protocols achieve data consistency between the cache memory and the shared memory through a bus-based memory system. Write-invalidate and write-update policies are used for maintaining cache consistency. In this case, we have three processors P1, P2, and P3 having a consistent copy of data element ‘X’ in their local cache memory and in the shared memory (Figure-a). Processor P1 writes X1 in its cache memory using write-invalidate protocol. So, all other copies are invalidated via the bus. It is denoted by ‘I’ (Figure-b). Invalidated blocks are also known as dirty, i.e. they should not be used. The write-update protocol updates all the cache copies via the bus. By using write back cache, the memory copy is also updated (Figure-c). Cache Events and Actions Following events and actions occur on the execution of memory-access and invalidation commands − Read-miss − When a processor wants to read a block and it is not in the cache, a read-miss occurs. This initiates a bus-read operation. If no dirty copy exists, then the main memory that has a consistent copy, supplies a copy to the requesting cache memory. If a dirty copy exists in a remote cache memory, that cache will restrain the main memory and send a copy to the requesting cache memory. In both the cases, the cache copy will enter the valid state after a read miss. Write-hit − If the copy is in dirty or reserved state, write is done locally and the new state is dirty. If the new state is valid, write-invalidate command is broadcasted to all the caches, invalidating their copies. When the shared memory is written through, the resulting state is reserved after this first write. Write-miss − If a processor fails to write in the local cache memory, the copy must come either from the main memory or from a remote cache memory with a dirty block. This is done by sending a read-invalidate command, which will invalidate all cache copies. Then the local copy is updated with dirty state. Read-hit − Read-hit is always performed in local cache memory without causing a transition of state or using the snoopy bus for invalidation. Block replacement − When a copy is dirty, it is to be written back to the main memory by block replacement method. However, when the copy is either in valid or reserved or invalid state, no replacement will take place. Directory-Based Protocols By using a multistage network for building a large multiprocessor with hundreds of processors, the snoopy cache protocols need to be modified to suit the network capabilities. Broadcasting being very expensive to perform in a multistage network, the consistency commands is sent only to those caches that keep a copy of the block. This is the reason for development of directory-based protocols for network-connected multiprocessors. In a directory-based protocols system, data to be shared are placed in a common directory that maintains the coherence among the caches. Here, the directory acts as a filter where the processors ask permission to load an entry from the primary memory to its cache memory. If an entry is changed the directory either updates it or invalidates the other caches with that entry. Hardware Synchronization Mechanisms Synchronization is a special form of communication where instead of data control, information is exchanged between communicating processes residing in the same or different processors. Multiprocessor systems use hardware mechanisms to implement low-level synchronization operations. Most multiprocessors have hardware mechanisms to impose atomic operations such as memory read, write or read-modify-write operations to implement some synchronization primitives. Other than atomic memory operations, some inter-processor interrupts are also used for synchronization purposes. Cache Coherency in Shared Memory Machines Maintaining cache coherency is a problem in multiprocessor system when the processors contain local cache memory. Data inconsistency between different caches easily occurs in this system. The major concern areas are − Sharing of writable data Process migration I/O activity Sharing of writable data When two processors (P1 and P2) have same data element (X) in their local caches and one process (P1) writes to the data element (X), as the caches are write-through local cache of P1, the main memory is also updated. Now when P2 tries to read data element (X), it does not find X because the data element in the cache of P2 has become outdated. Process migration In the first stage, cache of P1 has data element X, whereas P2 does not have anything. A process on P2 first writes on X and then migrates to P1. Now, the process starts reading data element X, but as the processor P1 has outdated data the process cannot read it. So, a process on P1 writes

Parallel Computer Models

Parallel Computer Architecture – Models ”; Previous Next Parallel processing has been developed as an effective technology in modern computers to meet the demand for higher performance, lower cost and accurate results in real-life applications. Concurrent events are common in today’s computers due to the practice of multiprogramming, multiprocessing, or multicomputing. Modern computers have powerful and extensive software packages. To analyze the development of the performance of computers, first we have to understand the basic development of hardware and software. Computer Development Milestones − There is two major stages of development of computer – mechanical or electromechanical parts. Modern computers evolved after the introduction of electronic components. High mobility electrons in electronic computers replaced the operational parts in mechanical computers. For information transmission, electric signal which travels almost at the speed of a light replaced mechanical gears or levers. Elements of Modern computers − A modern computer system consists of computer hardware, instruction sets, application programs, system software and user interface. The computing problems are categorized as numerical computing, logical reasoning, and transaction processing. Some complex problems may need the combination of all the three processing modes. Evolution of Computer Architecture − In last four decades, computer architecture has gone through revolutionary changes. We started with Von Neumann architecture and now we have multicomputers and multiprocessors. Performance of a computer system − Performance of a computer system depends both on machine capability and program behavior. Machine capability can be improved with better hardware technology, advanced architectural features and efficient resource management. Program behavior is unpredictable as it is dependent on application and run-time conditions Multiprocessors and Multicomputers In this section, we will discuss two types of parallel computers − Multiprocessors Multicomputers Shared-Memory Multicomputers Three most common shared memory multiprocessors models are − Uniform Memory Access (UMA) In this model, all the processors share the physical memory uniformly. All the processors have equal access time to all the memory words. Each processor may have a private cache memory. Same rule is followed for peripheral devices. When all the processors have equal access to all the peripheral devices, the system is called a symmetric multiprocessor. When only one or a few processors can access the peripheral devices, the system is called an asymmetric multiprocessor. Non-uniform Memory Access (NUMA) In NUMA multiprocessor model, the access time varies with the location of the memory word. Here, the shared memory is physically distributed among all the processors, called local memories. The collection of all local memories forms a global address space which can be accessed by all the processors. Cache Only Memory Architecture (COMA) The COMA model is a special case of the NUMA model. Here, all the distributed main memories are converted to cache memories. Distributed – Memory Multicomputers − A distributed memory multicomputer system consists of multiple computers, known as nodes, inter-connected by message passing network. Each node acts as an autonomous computer having a processor, a local memory and sometimes I/O devices. In this case, all local memories are private and are accessible only to the local processors. This is why, the traditional machines are called no-remote-memory-access (NORMA) machines. Multivector and SIMD Computers In this section, we will discuss supercomputers and parallel processors for vector processing and data parallelism. Vector Supercomputers In a vector computer, a vector processor is attached to the scalar processor as an optional feature. The host computer first loads program and data to the main memory. Then the scalar control unit decodes all the instructions. If the decoded instructions are scalar operations or program operations, the scalar processor executes those operations using scalar functional pipelines. On the other hand, if the decoded instructions are vector operations then the instructions will be sent to vector control unit. SIMD Supercomputers In SIMD computers, ‘N’ number of processors are connected to a control unit and all the processors have their individual memory units. All the processors are connected by an interconnection network. PRAM and VLSI Models The ideal model gives a suitable framework for developing parallel algorithms without considering the physical constraints or implementation details. The models can be enforced to obtain theoretical performance bounds on parallel computers or to evaluate VLSI complexity on chip area and operational time before the chip is fabricated. Parallel Random-Access Machines Sheperdson and Sturgis (1963) modeled the conventional Uniprocessor computers as random-access-machines (RAM). Fortune and Wyllie (1978) developed a parallel random-access-machine (PRAM) model for modeling an idealized parallel computer with zero memory access overhead and synchronization. An N-processor PRAM has a shared memory unit. This shared memory can be centralized or distributed among the processors. These processors operate on a synchronized read-memory, write-memory and compute cycle. So, these models specify how concurrent read and write operations are handled. Following are the possible memory update operations − Exclusive read (ER) − In this method, in each cycle only one processor is allowed to read from any memory location. Exclusive write (EW) − In this method, at least one processor is allowed to write into a memory location at a time. Concurrent read (CR) − It allows multiple processors to read the same information from the same memory location in the same cycle. Concurrent write (CW) − It allows simultaneous write operations to the same memory location. To avoid write conflict some policies are set up. VLSI Complexity Model Parallel computers use VLSI chips to fabricate processor arrays, memory arrays and large-scale switching networks. Nowadays, VLSI technologies are 2-dimensional. The size of a VLSI chip is proportional to the amount of storage (memory) space available in that chip. We can calculate the space complexity of an algorithm by the chip area (A) of the VLSI chip implementation of that algorithm. If T is the time (latency) needed to execute the algorithm, then A.T gives an upper bound on the total number of bits processed through the chip (or I/O). For certain computing, there exists a lower bound, f(s), such that A.T2 >= O (f(s)) Where A=chip area and T=time Architectural Development Tracks The

Introduction

Parallel Computer Architecture – Introduction ”; Previous Next In the last 50 years, there has been huge developments in the performance and capability of a computer system. This has been possible with the help of Very Large Scale Integration (VLSI) technology. VLSI technology allows a large number of components to be accommodated on a single chip and clock rates to increase. Therefore, more operations can be performed at a time, in parallel. Parallel processing is also associated with data locality and data communication. Parallel Computer Architecture is the method of organizing all the resources to maximize the performance and the programmability within the limits given by technology and the cost at any instance of time. Why Parallel Architecture? Parallel computer architecture adds a new dimension in the development of computer system by using more and more number of processors. In principle, performance achieved by utilizing large number of processors is higher than the performance of a single processor at a given point of time. Application Trends With the advancement of hardware capacity, the demand for a well-performing application also increased, which in turn placed a demand on the development of the computer architecture. Before the microprocessor era, high-performing computer system was obtained by exotic circuit technology and machine organization, which made them expensive. Now, highly performing computer system is obtained by using multiple processors, and most important and demanding applications are written as parallel programs. Thus, for higher performance both parallel architectures and parallel applications are needed to be developed. To increase the performance of an application Speedup is the key factor to be considered. Speedup on p processors is defined as − $$Speedup(p processors)equivfrac{Performance(p processors)}{Performance(1 processor)}$$ For the single fixed problem, $$performance of a computer system = frac{1}{Time needed to complete the problem}$$ $$Speedup _{fixed problem} (p processors) =frac{Time(1 processor)}{Time(p processor)}$$ Scientific and Engineering Computing Parallel architecture has become indispensable in scientific computing (like physics, chemistry, biology, astronomy, etc.) and engineering applications (like reservoir modeling, airflow analysis, combustion efficiency, etc.). In almost all applications, there is a huge demand for visualization of computational output resulting in the demand for development of parallel computing to increase the computational speed. Commercial Computing In commercial computing (like video, graphics, databases, OLTP, etc.) also high speed computers are needed to process huge amount of data within a specified time. Desktop uses multithreaded programs that are almost like the parallel programs. This in turn demands to develop parallel architecture. Technology Trends With the development of technology and architecture, there is a strong demand for the development of high-performing applications. Experiments show that parallel computers can work much faster than utmost developed single processor. Moreover, parallel computers can be developed within the limit of technology and the cost. The primary technology used here is VLSI technology. Therefore, nowadays more and more transistors, gates and circuits can be fitted in the same area. With the reduction of the basic VLSI feature size, clock rate also improves in proportion to it, while the number of transistors grows as the square. The use of many transistors at once (parallelism) can be expected to perform much better than by increasing the clock rate Technology trends suggest that the basic single chip building block will give increasingly large capacity. Therefore, the possibility of placing multiple processors on a single chip increases. Architectural Trends Development in technology decides what is feasible; architecture converts the potential of the technology into performance and capability. Parallelism and locality are two methods where larger volumes of resources and more transistors enhance the performance. However, these two methods compete for the same resources. When multiple operations are executed in parallel, the number of cycles needed to execute the program is reduced. However, resources are needed to support each of the concurrent activities. Resources are also needed to allocate local storage. The best performance is achieved by an intermediate action plan that uses resources to utilize a degree of parallelism and a degree of locality. Generally, the history of computer architecture has been divided into four generations having following basic technologies − Vacuum tubes Transistors Integrated circuits VLSI Till 1985, the duration was dominated by the growth in bit-level parallelism. 4-bit microprocessors followed by 8-bit, 16-bit, and so on. To reduce the number of cycles needed to perform a full 32-bit operation, the width of the data path was doubled. Later on, 64-bit operations were introduced. The growth in instruction-level-parallelism dominated the mid-80s to mid-90s. The RISC approach showed that it was simple to pipeline the steps of instruction processing so that on an average an instruction is executed in almost every cycle. Growth in compiler technology has made instruction pipelines more productive. In mid-80s, microprocessor-based computers consisted of An integer processing unit A floating-point unit A cache controller SRAMs for the cache data Tag storage As chip capacity increased, all these components were merged into a single chip. Thus, a single chip consisted of separate hardware for integer arithmetic, floating point operations, memory operations and branch operations. Other than pipelining individual instructions, it fetches multiple instructions at a time and sends them in parallel to different functional units whenever possible. This type of instruction level parallelism is called superscalar execution. Print Page Previous Next Advertisements ”;

Useful Resources

Parallel Computer Architecture – Resources ”; Previous Next The following resources contain additional information on Parallel Computer Architecture. Please use them to get more in-depth knowledge on this. Useful Video Courses E-commerce Web with Angular 8 (Material) and Firebase Featured 61 Lectures 5 hours University Code More Detail Ansible for Absolute Beginners – Hands-on – DevOps 12 Lectures 1 hours Sagar Mehta More Detail Learn DOM manipulation with easy modern JavaScript Best Seller 101 Lectures 7 hours Juan Lizarazo More Detail Colombia, Ultimate Bucket List, Travel Tribe 9 Lectures 36 mins Kelsey Betzelberger More Detail Apache Airflow 2.0 using Docker, Docker Swarm 20 Lectures 1 hours Ganesh Dhareshwar More Detail Cloud Computing Essentials 14 Lectures 31 mins Santhana Lakshmi More Detail Print Page Previous Next Advertisements ”;

Multiprocessors and Multicomputers

Multiprocessors and Multicomputers ”; Previous Next We will discuss multiprocessors and multicomputers in this chapter. Multiprocessor System Interconnects Parallel processing needs the use of efficient system interconnects for fast communication among the Input/Output and peripheral devices, multiprocessors and shared memory. Hierarchical Bus Systems A hierarchical bus system consists of a hierarchy of buses connecting various systems and sub-systems/components in a computer. Each bus is made up of a number of signal, control, and power lines. Different buses like local buses, backplane buses and I/O buses are used to perform different interconnection functions. Local buses are the buses implemented on the printed-circuit boards. A backplane bus is a printed circuit on which many connectors are used to plug in functional boards. Buses which connect input/output devices to a computer system are known as I/O buses. Crossbar switch and Multiport Memory Switched networks give dynamic interconnections among the inputs and outputs. Small or medium size systems mostly use crossbar networks. Multistage networks can be expanded to the larger systems, if the increased latency problem can be solved. Both crossbar switch and multiport memory organization is a single-stage network. Though a single stage network is cheaper to build, but multiple passes may be needed to establish certain connections. A multistage network has more than one stage of switch boxes. These networks should be able to connect any input to any output. Multistage and Combining Networks Multistage networks or multistage interconnection networks are a class of high-speed computer networks which is mainly composed of processing elements on one end of the network and memory elements on the other end, connected by switching elements. These networks are applied to build larger multiprocessor systems. This includes Omega Network, Butterfly Network and many more. Multicomputers Multicomputers are distributed memory MIMD architectures. The following diagram shows a conceptual model of a multicomputer − Multicomputers are message-passing machines which apply packet switching method to exchange data. Here, each processor has a private memory, but no global address space as a processor can access only its own local memory. So, communication is not transparent: here programmers have to explicitly put communication primitives in their code. Having no globally accessible memory is a drawback of multicomputers. This can be solved by using the following two schemes − Virtual Shared Memory (VSM) Shared Virtual Memory (SVM) In these schemes, the application programmer assumes a big shared memory which is globally addressable. If required, the memory references made by applications are translated into the message-passing paradigm. Virtual Shared Memory (VSM) VSM is a hardware implementation. So, the virtual memory system of the Operating System is transparently implemented on top of VSM. So, the operating system thinks it is running on a machine with a shared memory. Shared Virtual Memory (SVM) SVM is a software implementation at the Operating System level with hardware support from the Memory Management Unit (MMU) of the processor. Here, the unit of sharing is Operating System memory pages. If a processor addresses a particular memory location, the MMU determines whether the memory page associated with the memory access is in the local memory or not. If the page is not in the memory, in a normal computer system it is swapped in from the disk by the Operating System. But, in SVM, the Operating System fetches the page from the remote node which owns that particular page. Three Generations of Multicomputers In this section, we will discuss three generations of multicomputers. Design Choices in the Past While selecting a processor technology, a multicomputer designer chooses low-cost medium grain processors as building blocks. Majority of parallel computers are built with standard off-the-shelf microprocessors. Distributed memory was chosen for multi-computers rather than using shared memory, which would limit the scalability. Each processor has its own local memory unit. For interconnection scheme, multicomputers have message passing, point-to-point direct networks rather than address switching networks. For control strategy, designer of multi-computers choose the asynchronous MIMD, MPMD, and SMPD operations. Caltech’s Cosmic Cube (Seitz, 1983) is the first of the first generation multi-computers. Present and Future Development The next generation computers evolved from medium to fine grain multicomputers using a globally shared virtual memory. Second generation multi-computers are still in use at present. But using better processor like i386, i860, etc. second generation computers have developed a lot. Third generation computers are the next generation computers where VLSI implemented nodes will be used. Each node may have a 14-MIPS processor, 20-Mbytes/s routing channels and 16 Kbytes of RAM integrated on a single chip. The Intel Paragon System Previously, homogeneous nodes were used to make hypercube multicomputers, as all the functions were given to the host. So, this limited the I/O bandwidth. Thus to solve large-scale problems efficiently or with high throughput, these computers could not be used.The Intel Paragon System was designed to overcome this difficulty. It turned the multicomputer into an application server with multiuser access in a network environment. Message Passing Mechanisms Message passing mechanisms in a multicomputer network needs special hardware and software support. In this section, we will discuss some schemes. Message-Routing Schemes In multicomputer with store and forward routing scheme, packets are the smallest unit of information transmission. In wormhole–routed networks, packets are further divided into flits. Packet length is determined by the routing scheme and network implementation, whereas the flit length is affected by the network size. In Store and forward routing, packets are the basic unit of information transmission. In this case, each node uses a packet buffer. A packet is transmitted from a source node to a destination node through a sequence of intermediate nodes. Latency is directly proportional to the distance between the source and the destination. In wormhole routing, the transmission from the source node to the destination node is done through a sequence of routers. All the flits of the same packet are transmitted in an inseparable sequence in a pipelined fashion. In this case, only the header flit knows where the packet is going. Deadlock and Virtual Channels

Quick Guide

Parallel Computer Architecture – Quick Guide ”; Previous Next Parallel Computer Architecture – Introduction In the last 50 years, there has been huge developments in the performance and capability of a computer system. This has been possible with the help of Very Large Scale Integration (VLSI) technology. VLSI technology allows a large number of components to be accommodated on a single chip and clock rates to increase. Therefore, more operations can be performed at a time, in parallel. Parallel processing is also associated with data locality and data communication. Parallel Computer Architecture is the method of organizing all the resources to maximize the performance and the programmability within the limits given by technology and the cost at any instance of time. Why Parallel Architecture? Parallel computer architecture adds a new dimension in the development of computer system by using more and more number of processors. In principle, performance achieved by utilizing large number of processors is higher than the performance of a single processor at a given point of time. Application Trends With the advancement of hardware capacity, the demand for a well-performing application also increased, which in turn placed a demand on the development of the computer architecture. Before the microprocessor era, high-performing computer system was obtained by exotic circuit technology and machine organization, which made them expensive. Now, highly performing computer system is obtained by using multiple processors, and most important and demanding applications are written as parallel programs. Thus, for higher performance both parallel architectures and parallel applications are needed to be developed. To increase the performance of an application Speedup is the key factor to be considered. Speedup on p processors is defined as − $$Speedup(p processors)equivfrac{Performance(p processors)}{Performance(1 processor)}$$ For the single fixed problem, $$performance of a computer system = frac{1}{Time needed to complete the problem}$$ $$Speedup _{fixed problem} (p processors) =frac{Time(1 processor)}{Time(p processor)}$$ Scientific and Engineering Computing Parallel architecture has become indispensable in scientific computing (like physics, chemistry, biology, astronomy, etc.) and engineering applications (like reservoir modeling, airflow analysis, combustion efficiency, etc.). In almost all applications, there is a huge demand for visualization of computational output resulting in the demand for development of parallel computing to increase the computational speed. Commercial Computing In commercial computing (like video, graphics, databases, OLTP, etc.) also high speed computers are needed to process huge amount of data within a specified time. Desktop uses multithreaded programs that are almost like the parallel programs. This in turn demands to develop parallel architecture. Technology Trends With the development of technology and architecture, there is a strong demand for the development of high-performing applications. Experiments show that parallel computers can work much faster than utmost developed single processor. Moreover, parallel computers can be developed within the limit of technology and the cost. The primary technology used here is VLSI technology. Therefore, nowadays more and more transistors, gates and circuits can be fitted in the same area. With the reduction of the basic VLSI feature size, clock rate also improves in proportion to it, while the number of transistors grows as the square. The use of many transistors at once (parallelism) can be expected to perform much better than by increasing the clock rate Technology trends suggest that the basic single chip building block will give increasingly large capacity. Therefore, the possibility of placing multiple processors on a single chip increases. Architectural Trends Development in technology decides what is feasible; architecture converts the potential of the technology into performance and capability. Parallelism and locality are two methods where larger volumes of resources and more transistors enhance the performance. However, these two methods compete for the same resources. When multiple operations are executed in parallel, the number of cycles needed to execute the program is reduced. However, resources are needed to support each of the concurrent activities. Resources are also needed to allocate local storage. The best performance is achieved by an intermediate action plan that uses resources to utilize a degree of parallelism and a degree of locality. Generally, the history of computer architecture has been divided into four generations having following basic technologies − Vacuum tubes Transistors Integrated circuits VLSI Till 1985, the duration was dominated by the growth in bit-level parallelism. 4-bit microprocessors followed by 8-bit, 16-bit, and so on. To reduce the number of cycles needed to perform a full 32-bit operation, the width of the data path was doubled. Later on, 64-bit operations were introduced. The growth in instruction-level-parallelism dominated the mid-80s to mid-90s. The RISC approach showed that it was simple to pipeline the steps of instruction processing so that on an average an instruction is executed in almost every cycle. Growth in compiler technology has made instruction pipelines more productive. In mid-80s, microprocessor-based computers consisted of An integer processing unit A floating-point unit A cache controller SRAMs for the cache data Tag storage As chip capacity increased, all these components were merged into a single chip. Thus, a single chip consisted of separate hardware for integer arithmetic, floating point operations, memory operations and branch operations. Other than pipelining individual instructions, it fetches multiple instructions at a time and sends them in parallel to different functional units whenever possible. This type of instruction level parallelism is called superscalar execution. Convergence of Parallel Architectures Parallel machines have been developed with several distinct architecture. In this section, we will discuss different parallel computer architecture and the nature of their convergence. Communication Architecture Parallel architecture enhances the conventional concepts of computer architecture with communication architecture. Computer architecture defines critical abstractions (like user-system boundary and hardware-software boundary) and organizational structure, whereas communication architecture defines the basic communication and synchronization operations. It also addresses the organizational structure. Programming model is the top layer. Applications are written in programming model. Parallel programming models include − Shared address space Message passing Data parallel programming Shared address programming is just like using a bulletin board, where one can communicate with one or many individuals by posting information at a particular location, which is

Convergence of Parallel Architectures

Convergence of Parallel Architectures ”; Previous Next Parallel machines have been developed with several distinct architecture. In this section, we will discuss different parallel computer architecture and the nature of their convergence. Communication Architecture Parallel architecture enhances the conventional concepts of computer architecture with communication architecture. Computer architecture defines critical abstractions (like user-system boundary and hardware-software boundary) and organizational structure, whereas communication architecture defines the basic communication and synchronization operations. It also addresses the organizational structure. Programming model is the top layer. Applications are written in programming model. Parallel programming models include − Shared address space Message passing Data parallel programming Shared address programming is just like using a bulletin board, where one can communicate with one or many individuals by posting information at a particular location, which is shared by all other individuals. Individual activity is coordinated by noting who is doing what task. Message passing is like a telephone call or letters where a specific receiver receives information from a specific sender. Data parallel programming is an organized form of cooperation. Here, several individuals perform an action on separate elements of a data set concurrently and share information globally. Shared Memory Shared memory multiprocessors are one of the most important classes of parallel machines. It gives better throughput on multiprogramming workloads and supports parallel programs. In this case, all the computer systems allow a processor and a set of I/O controller to access a collection of memory modules by some hardware interconnection. The memory capacity is increased by adding memory modules and I/O capacity is increased by adding devices to I/O controller or by adding additional I/O controller. Processing capacity can be increased by waiting for a faster processor to be available or by adding more processors. All the resources are organized around a central memory bus. Through the bus access mechanism, any processor can access any physical address in the system. As all the processors are equidistant from all the memory locations, the access time or latency of all the processors is same on a memory location. This is called symmetric multiprocessor. Message-Passing Architecture Message passing architecture is also an important class of parallel machines. It provides communication among processors as explicit I/O operations. In this case, the communication is combined at the I/O level, instead of the memory system. In message passing architecture, user communication executed by using operating system or library calls that perform many lower level actions, which includes the actual communication operation. As a result, there is a distance between the programming model and the communication operations at the physical hardware level. Send and receive is the most common user level communication operations in message passing system. Send specifies a local data buffer (which is to be transmitted) and a receiving remote processor. Receive specifies a sending process and a local data buffer in which the transmitted data will be placed. In send operation, an identifier or a tag is attached to the message and the receiving operation specifies the matching rule like a specific tag from a specific processor or any tag from any processor. The combination of a send and a matching receive completes a memory-to-memory copy. Each end specifies its local data address and a pair wise synchronization event. Convergence Development of the hardware and software has faded the clear boundary between the shared memory and message passing camps. Message passing and a shared address space represents two distinct programming models; each gives a transparent paradigm for sharing, synchronization and communication. However, the basic machine structures have converged towards a common organization. Data Parallel Processing Another important class of parallel machine is variously called − processor arrays, data parallel architecture and single-instruction-multiple-data machines. The main feature of the programming model is that operations can be executed in parallel on each element of a large regular data structure (like array or matrix). Data parallel programming languages are usually enforced by viewing the local address space of a group of processes, one per processor, forming an explicit global space. As all the processors communicate together and there is a global view of all the operations, so either a shared address space or message passing can be used. Fundamental Design Issues Development of programming model only cannot increase the efficiency of the computer nor can the development of hardware alone do it. However, development in computer architecture can make the difference in the performance of the computer. We can understand the design problem by focusing on how programs use a machine and which basic technologies are provided. In this section, we will discuss about the communication abstraction and the basic requirements of the programming model. Communication Abstraction Communication abstraction is the main interface between the programming model and the system implementation. It is like the instruction set that provides a platform so that the same program can run correctly on many implementations. Operations at this level must be simple. Communication abstraction is like a contract between the hardware and software, which allows each other the flexibility to improve without affecting the work. Programming Model Requirements A parallel program has one or more threads operating on data. A parallel programming model defines what data the threads can name, which operations can be performed on the named data, and which order is followed by the operations. To confirm that the dependencies between the programs are enforced, a parallel program must coordinate the activity of its threads. Print Page Previous Next Advertisements ”;

Interconnection Network Design

Interconnection Network Design ”; Previous Next An interconnection network in a parallel machine transfers information from any source node to any desired destination node. This task should be completed with as small latency as possible. It should allow a large number of such transfers to take place concurrently. Moreover, it should be inexpensive as compared to the cost of the rest of the machine. The network is composed of links and switches, which helps to send the information from the source node to the destination node. A network is specified by its topology, routing algorithm, switching strategy, and flow control mechanism. Organizational Structure Interconnection networks are composed of following three basic components − Links − A link is a cable of one or more optical fibers or electrical wires with a connector at each end attached to a switch or network interface port. Through this, an analog signal is transmitted from one end, received at the other to obtain the original digital information stream. Switches − A switch is composed of a set of input and output ports, an internal “cross-bar” connecting all input to all output, internal buffering, and control logic to effect the input-output connection at each point in time. Generally, the number of input ports is equal to the number of output ports. Network Interfaces − The network interface behaves quite differently than switch nodes and may be connected via special links. The network interface formats the packets and constructs the routing and control information. It may have input and output buffering, compared to a switch. It may perform end-to-end error checking and flow control. Hence, its cost is influenced by its processing complexity, storage capacity, and number of ports. Interconnection Network Interconnection networks are composed of switching elements. Topology is the pattern to connect the individual switches to other elements, like processors, memories and other switches. A network allows exchange of data between processors in the parallel system. Direct connection networks − Direct networks have point-to-point connections between neighboring nodes. These networks are static, which means that the point-to-point connections are fixed. Some examples of direct networks are rings, meshes and cubes. Indirect connection networks − Indirect networks have no fixed neighbors. The communication topology can be changed dynamically based on the application demands. Indirect networks can be subdivided into three parts: bus networks, multistage networks and crossbar switches. Bus networks − A bus network is composed of a number of bit lines onto which a number of resources are attached. When busses use the same physical lines for data and addresses, the data and the address lines are time multiplexed. When there are multiple bus-masters attached to the bus, an arbiter is required. Multistage networks − A multistage network consists of multiple stages of switches. It is composed of ‘axb’ switches which are connected using a particular interstage connection pattern (ISC). Small 2×2 switch elements are a common choice for many multistage networks. The number of stages determine the delay of the network. By choosing different interstage connection patterns, various types of multistage network can be created. Crossbar switches − A crossbar switch contains a matrix of simple switch elements that can switch on and off to create or break a connection. Turning on a switch element in the matrix, a connection between a processor and a memory can be made. Crossbar switches are non-blocking, that is all communication permutations can be performed without blocking. Evaluating Design Trade-offs in Network Topology If the main concern is the routing distance, then the dimension has to be maximized and a hypercube made. In store-and-forward routing, assuming that the degree of the switch and the number of links were not a significant cost factor, and the numbers of links or the switch degree are the main costs, the dimension has to be minimized and a mesh built. In worst case traffic pattern for each network, it is preferred to have high dimensional networks where all the paths are short. In patterns where each node is communicating with only one or two nearby neighbors, it is preferred to have low dimensional networks, since only a few of the dimensions are actually used. Routing The routing algorithm of a network determines which of the possible paths from source to destination is used as routes and how the route followed by each particular packet is determined. Dimension order routing limits the set of legal paths so that there is exactly one route from each source to each destination. The one obtained by first traveling the correct distance in the high-order dimension, then the next dimension and so on. Routing Mechanisms Arithmetic, source-based port select, and table look-up are three mechanisms that high-speed switches use to determine the output channel from information in the packet header. All of these mechanisms are simpler than the kind of general routing computations implemented in traditional LAN and WAN routers. In parallel computer networks, the switch needs to make the routing decision for all its inputs in every cycle, so the mechanism needs to be simple and fast. Deterministic Routing A routing algorithm is deterministic if the route taken by a message is determined exclusively by its source and destination, and not by other traffic in the network. If a routing algorithm only selects shortest paths toward the destination, it is minimal, otherwise it is non-minimal. Deadlock Freedom Deadlock can occur in a various situations. When two nodes attempt to send data to each other and each begins sending before either receives, a ‘head-on’ deadlock may occur. Another case of deadlock occurs, when there are multiple messages competing for resources within the network. The basic technique for proving a network is deadlock free, is to clear the dependencies that can occur between channels as a result of messages moving through the networks and to show that there are no cycles in the overall channel dependency graph; hence there is no traffic patterns that can lead to a deadlock. The common way of

Discussion

Discuss Parallel Computer Architecture ”; Previous Next Parallel Computer Architecture is the method of organizing all the resources to maximize the performance and the programmability within the limits given by technology and the cost at any instance of time. It adds a new dimension in the development of computer system by using more and more number of processors. This tutorial covers the basics related to Parallel Computer Architecture, discussing the various concepts and terminologies associated with the topic. Print Page Previous Next Advertisements ”;