Friday, January 7, 2011

Alternate Multicore Architectures

Alternate Multicore Architectures
    The following designs are possible alternatives or successors to my existing hypercube or array processor designs.

D1: DataFlow Processor
   The data flows in a single direction in a pipeline or tree fashion where each core performs a single process on the dataset which is passed along to the next node(s).
D2: Data Parallel Processor
   This architecture is fine grained and usually assigns a single core to each data point for SIMD oriented user software.

Thursday, November 4, 2010

Kilocore SIMD Multiprocessor Array based on the Parallax Propeller 8-core microcontroller

Purpose:
    This blog will follow the construction of a prototype SIMD multiprocessor array of simple 32-bit processors arranged in a mesh architecture.
Details about experiments leading up to this prototype were detailed here.
The initial hardware configuration will be on breadboards as I get the connection topology, modularization and grid/host software worked out.
    I will be using the 160 MIPS Parallax propeller P8X32 DIP (8-core/8-thread) microcontroller as the mesh PU.
  

I may use the superior 1600 MIPS XMOS XS1-G4 (4-core / 32-thread), but XMOS does not currently ship a DIP version of their surface mount chip like Parallax Inc. - making prototypes difficult to implementdoes. I could use the G4 as the host bridge between the processor array and the PC however, but not until i get a replacement for .binary loading of the mesh via the host - which the propeller excels at.

Requirements:
    Our goal is to design a software/hardware combination that results in a grid/mesh of equal processing units (PU) controlled by a single host controller that is accessible from a host PC.
    Hardware modules are defined as follows...
  • M1: Host PC
  • M2: Host Controller
  • M3: PU Grid
  • M4: Grid Monitor Display (optional)
    Software Modules are defined as follows...
  • S1: Host PC Interface (Java)
    S1.1: Serial bidirectional connector (javax.comm)
    S1.2: HTTP unidirectional connector (java.net)
    S1.3: Persistence connector (org.eclipse.persistence.jpa)
  • S2: Host Controller (SPIN/Assembly)
    S2.1: Serial bidirectional connector
    S2.2: LED display driver
    S2.3: Grid Clock Generator (Assembly)
    S2.4: Grid Parallel Loader    
    S2.5: SIPO Grid Input Register (74hc595 out)
    S2.6: PISO Grid Output Register (74hc165/597 in)
  • S3: Grid PU (SPIN/Assembly)
Constraints:

  • C1: Power: Power consumption under 5A - I am currently using a 15W bench supply.  (In the production prototype I may use a 500W supply that has a 25W 3.3v rail but i will need to load the 12 and 5v rails)
  • C2: Grid Boostrap: SIMD bootstrap model for the PU (processing unit) grid, or 0..1 EEPROM in total.

Analysis:
    S1.1: Serial bidirectional connector (javax.comm)
    In the past I developed direct port drivers for the PPT and SER port using VisualStudio 6, I could not get the SUN Comm API to work outside of Linux. However, I came across a page by Rick Proctor for the Lego RCX Brick at http://dn.codegear.com/article/31915 and at http://llk.media.mit.edu/projects/cricket/doc/serial.shtml which explains how to setup and implement the SerialPortEventListener interface.

    S2.4: Grid Parallel Loader: 
    In the past I used the technique originally posted on the Parallax Propeller Forum by users [godzich/Christian, pems] in 2008.  This involved connecting up to 12 propellers to a single EEPROM and taking advantage of the I2C bus mastering by resetting each propeller in serial sequnce by the previously loaded propeller.  Each chip requires 1.3 seconds to boot and we are limited by parasitic capacitance to around 12 chips off a single EEPROM.  Therefore I started running into trouble with an 80 chip SIMD grid - where I required 10 EEPROMS for the entire grid - a programming headache.
    Use of the PropellerLoader by Chip Gracey was not really feasible without some elaborate 3-state bus mastering logic or use of 160 pins to load all the chips in parallel.  However, there was a recent post by [clock loop] that expanded on Chip's loader by setting up the PU grid to listen on the RX port but only reply to the TX port with one of the grid chips.  Essentially the host programs one of the grid chips with the others acting as listeners and getting programmed in parallel as long as we account for worst case timing.
    This latest approch to bootstrapping by using the Grid controller to load all the Grid PU chips requires that the SIMD grid SPIN/Assembly code be written to a bytcode .binary file by using the PT IDE command [Run | Compile Current | View Info (F8) | Save Binary File]
    The issue of determining whether the entire grid was loaded successfully is still solved by having the chips respond to the host using the PISO output grid register - which is read by the host after grid programming.

    Topology:
    We will initially be implementing a 2-dimensional mesh network that may be toroidal.  Although a hypercube architecture would be more computationally efficient with an O(log(n)) depth and ability to simulate tree and mesh architectures itself - the initial program space is local so we do not need arbitrary communication between distant nodes.  One of the main reasons we are not implementing a hypercube routing network at this time is that it would require 1-3 of the processors on the 8-core chip for external and internal routing.  We would also only be implementing a hypercube of clusters of 4 cores - because a router node for each core would not be efficient.  The use of a router-less design allows us fine 1:1 granular control over the network.


    The cost of a hypercube network is also O(n log(n)) where a mesh network is O(n), as well the processor count must be on power of 2 boundaries (I therefore need to implement 64 or 128 chips for example - not 80). 
    Some statistics...
    The number of wires for a 6 dimensional hypercube would be n-cube = 2(2^((n-1)-cube) + (2^d)) = 384 lines, however the number of lines for mesh would be 64 x 24 = 1536 lines - (these are minus the on-chip internal software connections). - we actually would have 64 x 12  Therefore for small quantities of processing units - the cost is actually cheaper for hypercubes.  But if we could somehow power up 1024 chips - which I think unlikely due to my inexact implementation of power, capacitance, induction and resistance factors - we would be requiring a 10 dimensional hypercube of 4-cog clusters.  The number of wires would be 49152 for 4096 cogs in a hypercube vs. 1024 x 24 = 24576 for an 8-core 8192 cog mesh - which is around 1/4 the lines per/processing unit.

Design:

Software Modules (UML static diagram):


Software Simulation
    In this section we detail a software abstraction of our actual hardware implementation so we can verify the logic and design of the entire system while it is in use.
    The following UML class diagram shows a view of the simulation model implemented as a standard JEE6/JPA2 persistence unit.


2 chip - 16 core breadboard initial wiring prototype (fritzing.org)

Implementation:
In-use Rectilinear Mesh (no routing)


Pinout Mesh Array chip:
                    +-----+--+-----+
    in0  N6 --> p0  |1    +--+   40| p31 <-- Host RX
    in1  N7 --> p1  |2           39| p30 N/C --> (1 chip TX)
    in2  NE --> p2  |3           38| p29 SDA --> N/C
    in3  E0 --> p3  |4           37| p28 SCL <-- N/C
    in4  E2 --> p4  |5           36| p27 --> DONE/LED
    in5  E4 --> p5  |6           35| p26 <-- C2 (DATA)
    in6  E6 --> p6  |7           34| p25 <-- C1
    in7  SE --> p7  |8           33| p24 <-- C0
                vss |9     BB    32| VDD
                boe |10  n-grid  31|  XO <-- N/C
    RES_bkst--> res |11   mesh   30|  XI <-- Host clk
                VDD |12    (8)   29| vss
    in8  S1 -->  p8 |13          28| p23 --> 165 Y7
    in9  S0 -->  p9 |14          27| p22 --> 165 Y6
    in10 SW --> p10 |15          26| p21 --> 165 Y5
    in11 W7 --> p11 |16          25| p20 --> 165 Y4
    in12 W5 --> p12 |17          24| p19 --> 165 Y3
    in13 W3 --> p13 |18          23| p18 --> 165 Y2
    in14 W1 --> p14 |19          22| p17 --> 165 Y1
    in15 NW --> p15 |20          21| p16 --> 165 Y0
                    +--------------+
Pinout Host chip:
                    +-----+--+-----+
       RES0 <-- p0  |1    +--+   40| p31 N/C --> (r)cog_in_all
  RDY_STATE <-- p1  |2           39| p30 N/C --> (r)cog_out_all
       165S <-- p2  |3           38| p29 SDA --> EEPROM
       165C <-- p3  |4           37| p28 SCL <-- EEPROM
       165D --> p4  |5           36| p27 --> c2
     595D_S <-- p5  |6           35| p26 --> c1
     595D_R <-- p6  |7           34| p25 --> c0
     595D_A <-- p7  |8           33| p24 
                vss |9     BB    32| VDD
                boe |10  n-grid  31|  XO
            --> res |11   host   30|  XI
                VDD |12    (8)   29| vss
            <-- p8  |13          28| p23 
            <-- p9  |14          27| p22 
            <-- p10 |15          26| p21 
            <-- p11 |16          25| p20 
 MESH_CLOCK <-- p12 |17          24| p19 
 MESH_RESET <-- p13 |18          23| p18 -->
    MESH_RX <-- p14 |19          22| p17 -->
    MESH_TX <-- p15 |20          21| p16 --> LED0
                    +--------------+
Deprecated 3-Hypercube (with routers)



Testing:

Simulation in Software using Java JEE:   Instead of using VHDL/Verilog we will simulate our SIMD devices in software using Java as the computing substrate along with JPA to persist our model and simulation runs.

Performance Results:
Without JPA persistence (in memory Entity creation/traversal only)
[  11   111  1            1111  1] iter: 65536 time: 12319 ns
Total time: 2.699895754 sec @ 24273.52978458738 iter/sec
With JPA persistence (Derby 10.5.3.0 on the same server)
[  11   111  1            1111  1] iter: 65536 time: 13232403 ns
Total time: 967.985705124 sec @ 67.703479145495 iter/sec
From these results we are able to remove the object instantation overhead from the test and concentrate on persistence times.


References: