tag:blogger.com,1999:blog-2818664674242667082024-03-19T11:35:00.518-07:00KiloCoreAbout Kilo core class custom processors or machines.Michael O'Brienhttp://www.blogger.com/profile/14907623981077693781noreply@blogger.comBlogger2125tag:blogger.com,1999:blog-281866467424266708.post-23598939836514555912011-01-07T20:52:00.000-08:002011-01-07T20:52:08.999-08:00Alternate Multicore ArchitecturesAlternate Multicore Architectures<br />
The following designs are possible alternatives or successors to my existing hypercube or array processor designs.<br />
<br />
D1: DataFlow Processor<br />
The data flows in a single direction in a pipeline or tree fashion where each core performs a single process on the dataset which is passed along to the next node(s).<br />
D2: Data Parallel Processor <br />
This architecture is fine grained and usually assigns a single core to each data point for SIMD oriented user software.Michael O'Brienhttp://www.blogger.com/profile/14907623981077693781noreply@blogger.com0tag:blogger.com,1999:blog-281866467424266708.post-21400948973425908992010-11-04T09:35:00.000-07:002010-11-16T13:33:33.510-08:00Kilocore SIMD Multiprocessor Array based on the Parallax Propeller 8-core microcontroller<strong>Purpose:</strong><br />
This blog will follow the construction of a prototype <a href="http://en.wikipedia.org/wiki/SIMD">SIMD</a> multiprocessor array of simple 32-bit processors arranged in a mesh architecture.<br />
Details about experiments leading up to this prototype were <a href="http://www.objectivej.com/hardware/propcluster/index.html">detailed here</a>.<br />
The initial hardware configuration will be on breadboards as I get the connection topology, modularization and grid/host software worked out.<br />
I will be using the 160 MIPS <a href="http://www.parallax.com/propeller/">Parallax propeller P8X32</a> DIP (8-core/8-thread) microcontroller as the mesh PU.<br />
<br />
<br />
I may use the superior 1600 MIPS <a href="http://www.xmos.com/products/development-kits/xc-1a-development-kit">XMOS XS1-G4</a> (4-core / 32-thread), but XMOS does not currently ship a DIP version of their surface mount chip like Parallax Inc. - making prototypes difficult to implementdoes. I could use the G4 as the host bridge between the processor array and the PC however, but not until i get a replacement for .binary loading of the mesh via the host - which the propeller excels at.<br />
<br />
<strong>Requirements:</strong><br />
Our goal is to design a software/hardware combination that results in a grid/mesh of equal processing units (PU) controlled by a single host controller that is accessible from a host PC.<br />
Hardware modules are defined as follows...<br />
<ul><li>M1: Host PC</li>
<li>M2: Host Controller</li>
<li>M3: PU Grid</li>
<li>M4: Grid Monitor Display (optional)</li>
</ul> Software Modules are defined as follows...<br />
<ul><li>S1: Host PC Interface (<strong>Java</strong>)<br />
<u>S1.1:</u> Serial bidirectional connector (<strong>javax.comm</strong>)<br />
<u>S1.2:</u> HTTP unidirectional connector (<strong>java.net</strong>)<br />
<u>S1.3:</u> Persistence connector (<strong>org.eclipse.persistence.jpa</strong>)</li>
<li>S2: Host Controller (<strong>SPIN/Assembly</strong>)<br />
S2.1: Serial bidirectional connector<br />
S2.2: LED display driver<br />
S2.3: Grid Clock Generator (Assembly)<br />
S2.4: Grid Parallel Loader <br />
S2.5: SIPO Grid Input Register (74hc595 out)<br />
S2.6: PISO Grid Output Register (74hc165/597 in)</li>
<li>S3: Grid PU (<strong>SPIN/Assembly</strong>)</li>
</ul><strong>Constraints:</strong><br />
<br />
<ul><li><u>C1: Power:</u> Power consumption under 5A - I am currently using a 15W bench supply. (In the production prototype I may use a 500W supply that has a 25W 3.3v rail but i will need to load the 12 and 5v rails)</li>
<li><u>C2: Grid Boostrap:</u> SIMD bootstrap model for the PU (processing unit) grid, or 0..1 EEPROM in total.</li>
</ul><div><br />
</div><u>Analysis:</u><br />
<u> S1.1:</u> Serial bidirectional connector (<strong>javax.comm</strong>)<br />
In the past I developed direct port drivers for the PPT and SER port using VisualStudio 6, I could not get the SUN Comm API to work outside of Linux. However, I came across a page by Rick Proctor for the Lego RCX Brick at <a href="http://dn.codegear.com/article/31915">http://dn.codegear.com/article/31915</a> and at <a href="http://llk.media.mit.edu/projects/cricket/doc/serial.shtml">http://llk.media.mit.edu/projects/cricket/doc/serial.shtml</a> which explains how to setup and implement the SerialPortEventListener interface.<br />
<br />
<u> S2.4: Grid Parallel Loader: </u><br />
In the past I used the technique originally posted on the Parallax Propeller Forum by users [<a href="http://forums.parallaxinc.com/forums/default.aspx?f=25&m=301878&p=1">godzich/Christian, pems</a>] in 2008. This involved connecting up to 12 propellers to a single EEPROM and taking advantage of the I2C bus mastering by resetting each propeller in serial sequnce by the previously loaded propeller. Each chip requires 1.3 seconds to boot and we are limited by parasitic capacitance to around 12 chips off a single EEPROM. Therefore I started running into trouble with an 80 chip SIMD grid - where I required 10 EEPROMS for the entire grid - a programming headache.<br />
Use of the <a href="http://obex.parallax.com/objects/61/">PropellerLoader</a> by Chip Gracey was not really feasible without some elaborate 3-state bus mastering logic or use of 160 pins to load all the chips in parallel. However, there was a recent post by <a href="http://forums.parallax.com/showthread.php?t=124343&page=2">[clock loop</a>] that expanded on Chip's loader by setting up the PU grid to listen on the RX port but only reply to the TX port with one of the grid chips. Essentially the host programs one of the grid chips with the others acting as listeners and getting programmed in parallel as long as we account for worst case timing.<br />
This latest approch to bootstrapping by using the Grid controller to load all the Grid PU chips requires that the SIMD grid SPIN/Assembly code be written to a bytcode <strong>.binary</strong> file by using the PT IDE command [Run | Compile Current | View Info (F8) | Save Binary File]<br />
The issue of determining whether the entire grid was loaded successfully is still solved by having the chips respond to the host using the PISO output grid register - which is read by the host after grid programming.<br />
<br />
<u> Topology:</u><br />
We will initially be implementing a 2-dimensional mesh network that may be toroidal. Although a hypercube architecture would be more computationally efficient with an <strong>O(log(n))</strong> depth and ability to simulate tree and mesh architectures itself - the initial program space is local so we do not need arbitrary communication between distant nodes. One of the main reasons we are not implementing a hypercube routing network at this time is that it would require 1-3 of the processors on the 8-core chip for external and internal routing. We would also only be implementing a hypercube of clusters of 4 cores - because a router node for each core would not be efficient. The use of a router-less design allows us fine 1:1 granular control over the network.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgXfAySxquQSGMLsruS_pxq9k3jyU81dl6BjMED1hypCvOUwXcKoTD8ZoAfLToXkVOms_Tc8D7K65QVyJdoxYnKPbbf7Ts7Ik4C0rfKjuBN_L66V7vkk-1z-VdQZTag7chluHJpVWqem08/s1600/propCAS_16core_module_ext_connect_block_v20100907.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" px="true" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgXfAySxquQSGMLsruS_pxq9k3jyU81dl6BjMED1hypCvOUwXcKoTD8ZoAfLToXkVOms_Tc8D7K65QVyJdoxYnKPbbf7Ts7Ik4C0rfKjuBN_L66V7vkk-1z-VdQZTag7chluHJpVWqem08/s400/propCAS_16core_module_ext_connect_block_v20100907.jpg" width="352" /></a></div><br />
The cost of a hypercube network is also <strong>O(n log(n))</strong> where a mesh network is <strong>O(n)</strong>, as well the processor count must be on power of 2 boundaries (I therefore need to implement 64 or 128 chips for example - not 80). <br />
Some statistics...<br />
The number of wires for a 6 dimensional hypercube would be <em>n-cube = 2(2^((n-1)-cube) + (2^d)) = 384 </em>lines, however the number of lines for mesh would be 64 x 24 = 1536 lines - (<strong><em>these are minus the on-chip internal software connections)</em></strong>. - we actually would have 64 x 12 Therefore for small quantities of processing units - the cost is actually cheaper for hypercubes. But if we could somehow power up 1024 chips - which I think unlikely due to my inexact implementation of power, capacitance, induction and resistance factors - we would be requiring a 10 dimensional hypercube of 4-cog clusters. The number of wires would be <em>49152</em> for 4096 cogs in a hypercube vs. 1024 x 24 = 24576 for an 8-core 8192 cog mesh - which is around 1/4 the lines per/processing unit.<br />
<br />
<strong>Design:</strong><br />
<br />
<u>Software Modules (UML static diagram):</u><br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXWSAhUbCpb7F0IAKpuKzwE39q9tVdUYt46L030EPYEpiUcFMeNeVuL3oZA1w2BbLVPe82KVsDt7qkpxrxFUJf0weRXenY71dj7PudwmhARUKyyVMyYIpnoC8qoHWocXhERs90-Kw3-pM/s1600/pac_uml_v20101111.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="640" px="true" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXWSAhUbCpb7F0IAKpuKzwE39q9tVdUYt46L030EPYEpiUcFMeNeVuL3oZA1w2BbLVPe82KVsDt7qkpxrxFUJf0weRXenY71dj7PudwmhARUKyyVMyYIpnoC8qoHWocXhERs90-Kw3-pM/s640/pac_uml_v20101111.jpg" width="459" /></a></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><br />
<u>Software Simulation</u><br />
In this section we detail a software abstraction of our actual hardware implementation so we can verify the logic and design of the entire system while it is in use.<br />
The following UML class diagram shows a view of the simulation model implemented as a standard JEE6/JPA2 persistence unit.<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgB2t4LZiBN55wifKntqdVOYZof2RCOKsgIQyFo4kGArWxWrqKCiMi7GgYxQFANzLa0B86xm0KK5jYisT7bifb1yZBoW3YlPEYmzKPkFfHpbGEv0LvCTwZF057nUAkyfsxIvIKyP1XPhu4/s1600/dataparallel_uml_model_v20101116.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="640" px="true" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgB2t4LZiBN55wifKntqdVOYZof2RCOKsgIQyFo4kGArWxWrqKCiMi7GgYxQFANzLa0B86xm0KK5jYisT7bifb1yZBoW3YlPEYmzKPkFfHpbGEv0LvCTwZF057nUAkyfsxIvIKyP1XPhu4/s640/dataparallel_uml_model_v20101116.jpg" width="480" /></a></div><br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEitJlR_1TuKlYxUmqpeCQ5DLBouLhmym8g0b_HIE2-YuA73JGTMY8BMlXiLZ38EEnYVpQUT98jHFJKPwGDvC18q7lbt7w5mGzFZTwEfJc2GTn8GaK065pMbFB77OUSBst-N1tWiwnswGO0/s1600/pcas_24chip_192core_prototype_20100720.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" px="true" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEitJlR_1TuKlYxUmqpeCQ5DLBouLhmym8g0b_HIE2-YuA73JGTMY8BMlXiLZ38EEnYVpQUT98jHFJKPwGDvC18q7lbt7w5mGzFZTwEfJc2GTn8GaK065pMbFB77OUSBst-N1tWiwnswGO0/s400/pcas_24chip_192core_prototype_20100720.JPG" width="353" /></a></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2eFGjacTcxBkVHihYusy2nmnU2qNBLN2zLbn80-G82qeIBjzmMgBCL1ThpM6-L-UvP7e2F4RTTqlNY3SS1cwOtDePMvWpZtN_b2D5HT2nBj-h8VixxGE0LJ4tyYJqEPzQZufIgEwlhCc/s1600/prop8_mesh_v01_bb.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="412" px="true" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2eFGjacTcxBkVHihYusy2nmnU2qNBLN2zLbn80-G82qeIBjzmMgBCL1ThpM6-L-UvP7e2F4RTTqlNY3SS1cwOtDePMvWpZtN_b2D5HT2nBj-h8VixxGE0LJ4tyYJqEPzQZufIgEwlhCc/s640/prop8_mesh_v01_bb.jpg" width="640" /></a></div>2 chip - 16 core breadboard initial wiring prototype (fritzing.org)<br />
<br />
<strong>Implementation:</strong><br />
In-use Rectilinear Mesh (no routing)<br />
<br />
<br />
Pinout Mesh Array chip:<br />
<pre> +-----+--+-----+
in0 N6 --> p0 |1 +--+ 40| p31 <-- Host RX
in1 N7 --> p1 |2 39| p30 N/C --> (1 chip TX)
in2 NE --> p2 |3 38| p29 SDA --> N/C
in3 E0 --> p3 |4 37| p28 SCL <-- N/C
in4 E2 --> p4 |5 36| p27 --> DONE/LED
in5 E4 --> p5 |6 35| p26 <-- C2 (DATA)
in6 E6 --> p6 |7 34| p25 <-- C1
in7 SE --> p7 |8 33| p24 <-- C0
vss |9 BB 32| VDD
boe |10 n-grid 31| XO <-- N/C
RES_bkst--> res |11 mesh 30| XI <-- Host clk
VDD |12 (8) 29| vss
in8 S1 --> p8 |13 28| p23 --> 165 Y7
in9 S0 --> p9 |14 27| p22 --> 165 Y6
in10 SW --> p10 |15 26| p21 --> 165 Y5
in11 W7 --> p11 |16 25| p20 --> 165 Y4
in12 W5 --> p12 |17 24| p19 --> 165 Y3
in13 W3 --> p13 |18 23| p18 --> 165 Y2
in14 W1 --> p14 |19 22| p17 --> 165 Y1
in15 NW --> p15 |20 21| p16 --> 165 Y0
+--------------+
</pre>Pinout Host chip:<br />
<pre> +-----+--+-----+
RES0 <-- p0 |1 +--+ 40| p31 N/C --> (r)cog_in_all
RDY_STATE <-- p1 |2 39| p30 N/C --> (r)cog_out_all
165S <-- p2 |3 38| p29 SDA --> EEPROM
165C <-- p3 |4 37| p28 SCL <-- EEPROM
165D --> p4 |5 36| p27 --> c2
595D_S <-- p5 |6 35| p26 --> c1
595D_R <-- p6 |7 34| p25 --> c0
595D_A <-- p7 |8 33| p24
vss |9 BB 32| VDD
boe |10 n-grid 31| XO
--> res |11 host 30| XI
VDD |12 (8) 29| vss
<-- p8 |13 28| p23
<-- p9 |14 27| p22
<-- p10 |15 26| p21
<-- p11 |16 25| p20
MESH_CLOCK <-- p12 |17 24| p19
MESH_RESET <-- p13 |18 23| p18 -->
MESH_RX <-- p14 |19 22| p17 -->
MESH_TX <-- p15 |20 21| p16 --> LED0
+--------------+
</pre>Deprecated 3-Hypercube (with routers)<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2z7Eo21dxc10E6pfxZW4n1Gv-QCm1XQUULIf-7lXW_f6FF7suCxN32asmU4nfKnivUKgzFF05xwg8mJ0h1P_QcPIiz1S-QEhlidbfz8j3DqMMRhCi_CHItUqkz8Ve_ZQHCdItoE7nrfU/s1600/prop_3hypercube.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="563" px="true" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2z7Eo21dxc10E6pfxZW4n1Gv-QCm1XQUULIf-7lXW_f6FF7suCxN32asmU4nfKnivUKgzFF05xwg8mJ0h1P_QcPIiz1S-QEhlidbfz8j3DqMMRhCi_CHItUqkz8Ve_ZQHCdItoE7nrfU/s640/prop_3hypercube.jpg" width="640" /></a></div><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgpILQXSWouBFedtDvdKaarBOHcrj0WCEmGDVAePz8rMGSO12noLfjy23tCe70sjXorqGsNvj8d0wzn20-DSV-fu2IC69al5Ux5QIY-_5QlTx50pRemyCLYzwz9aPUl169p8qeh0Y-A14U/s1600/IMG_7978_propCAS_64cog_proto_20100907c.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" px="true" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgpILQXSWouBFedtDvdKaarBOHcrj0WCEmGDVAePz8rMGSO12noLfjy23tCe70sjXorqGsNvj8d0wzn20-DSV-fu2IC69al5Ux5QIY-_5QlTx50pRemyCLYzwz9aPUl169p8qeh0Y-A14U/s400/IMG_7978_propCAS_64cog_proto_20100907c.JPG" width="300" /></a></div><br />
<br />
<strong>Testing:</strong><br />
<br />
<strong>Simulation in Software using Java JEE:</strong> Instead of using VHDL/Verilog we will simulate our SIMD devices in software using Java as the computing substrate along with JPA to persist our model and simulation runs.<br />
<br />
<u>Performance Results:</u><br />
Without JPA persistence (in memory Entity creation/traversal only)<br />
[ 11 111 1 1111 1] iter: 65536 time: 12319 ns<br />
Total time: 2.699895754 sec @ 24273.52978458738 iter/sec<br />
With JPA persistence (Derby 10.5.3.0 on the same server)<br />
[ 11 111 1 1111 1] iter: 65536 time: 13232403 ns<br />
Total time: 967.985705124 sec @ 67.703479145495 iter/sec<br />
From these results we are able to remove the object instantation overhead from the test and concentrate on persistence times.<br />
<br />
<br />
<strong>References:</strong>Michael O'Brienhttp://www.blogger.com/profile/14907623981077693781noreply@blogger.com1