Monday, 31 March 2014

Reading Image File using HDL

Storing image file onto FPGA has two main steps

  1. Converting image file format to co-efficient file format(COE format)
  2. Storing the COE image file using BRAM(block ram IP core)








1. Converting image to COE format:

In this tutorial we will try and convert JPEG image to COE format using a matlab code. It can also be used for file formats  as bmp, gif, and tif too. Locate the image to be converted in the working directory and run the program given below. After successful execution the program will ask for the location of the image to be converted following which the converted image file will be stored in the same working directory. Copy this .COE file and don't make any alteration to it.  (Here make sure that the image file if of size 240x160)
















































This format is a RAW image file which saves the values of the pixel in an array(1 Dimension). First two lines of this file are used to indicate the data type(hex/binary/octal etc) and second line marks the beginning of  the pixel values. 


2. Storing COE file in BRAM IP core

Create a new project in Xilinx and in the new source file wizard click on IP(core generator and architecture) . Name this file as "mymemory" and click next


Select Block Memory Generator v2.8 in the submenu RAMs & ROMs under the Memories & Storage Select Next, then Finish.



The .coe file produced by  will contain 240×160 = 38,400 bytes. Following the procedure described we can use the Core Generator to create a read only block memory component "mymemory" of size 240x160, which is 8 bits wide with a depth of 38400 .





















Click on Load Int File and click on Browse. Place the COE file generated in the location where the browsers points to. It will be in the "ipcore_dir" folder of your project.





















Once the file is loaded we have an option to view the contents of your COE file . To view your coe file click on "show" . Verify if the width and the depth of your image file matches with the one which you have provided for the IP core. Once verified click next and generate. It might take time for the IP BRAM to get generated after which it will appear as a module in your project directory. After successful creation of IP BRAM and COE files your image is now ready to be operated on. You can perform all types of image processing algorithms like smoothing,edge detection , noise reduction etc. Try all these steps and check if you got it right. queries are welcome . Thank you



Link for the coupons : Here


Sunday, 23 March 2014

Frequency dividing circuit with minimum hardware




Link for the coupons : Here

       

Most the FPGA boards these days come to high frequency oscillators in orders of 50Mhz/100Mhz and the circuits which we have to drive using FPGA work on lower clock rates. So clock division is the need for such applications.  Though Xilinx provides DCM IP cores for clock division, they are board specific and appear as a black box. With the counters explained below you can customize your code and synthesize it as per your requirement which are independent of the board you use.


Divide by 2 counter :


Its one of the simplest and most used counter which requires only one D-Flip Flop. We have to first design a D-FF and connect the Q-bar output to the D input to get division by 2. Waveform based simulation can be used to verify your design. You can notice that the o/p will have 50% duty cycle too.
Given below are the code with RTL schematic created using PlanAhead tool. You can verify the code for and the RTL.





RTL/CODE:















Simulation results:








Divide by 3 counter :


For divide by 3 counter we have to desin two modules
1. mod 3 counter
2.Dff

Dff code has to be modified in this case as the FF has to be transparent to D only on the negative edge of the clock..

With slight modification for the D-FF code we can create the DFF for neg- edge.  We have to use a OR gate to get the final o/p. The RTL gives the clear design for divide by 3 counter.





RTL/CODE:














simulation results :









Divide by 4 counter :


If you notice , the design for divide by 4 is similar to the divide by 2. Here we have two divide by 2 counters in cascade. This situation makes our efforts easy as we don't have to rewrite the code for div4 counter. All we have to do is to instantiate the div2 counter twice and connect them is cascade as shown. Here there is no need of any other logical gates like in case of Div3 counter. Refer to the code and RTL for div3 counter as shown below. Individual reset are used for the DFF and they must be set in order. First the D-FF X1(rst1) must be reset(high) and then turned low. Next you have to reset the D-FF X2 high followed by low.





RTL/CODE:




simulation results :








Once the deign for divide by 4 is complete you can extend the same concept for all mulitple of 4 division like 4,8, 12 etc. Only modification you require is the number of D-FF stages. For divide by 8 counter you need  3 D-FF in cascade and for 12 you will need 4 and so on... Below are the design for 8 and 12 counters . Try and design it with the same codes and instantiate them accordingly.























Divide by 6 counter :

Not much complex , just 3 D-FF and cascading them gives us divide by 6 counter. Coding is quite simpler as compared to divide by 3 counter. Try and code it yourself if you get stuck , you can refer my code below with RTL schematic.












RTL/CODE:



















Simulation results:





We will update the remaining counters as and when the design is ready. Thank you . Keep reading .

Tuesday, 28 January 2014

Full Adder using generate statement

                 



Link for the coupons : Here





Generate statement in verilog comes in handy when we have to instantiate a sub circuit multiple times. Consider for example you have a 32 bit adder which has 32 full adder circuits and we have to call the FA module 32 times to design a 32 bit adder. This is where the "generat" statement will be of great help. Here we will try to design a 32 bit adder using full adders and by using generate statement try to instantiate this module 32 times as per our design.
                    If we need a variable in our design in order to hold temporary values we have "genvar" key word for declaration.

given below is the code for 32 bit adder using generate statement:


Code:




Test fixture:


Simulation results:












RTL




                         















Once you have simulated the code and synthesied it. Eloborate the RTL by double clicking on the FA module and you can notice that there will be exactly 32 FA module generated.


Monday, 27 January 2014

Design and implementation of 16 Bit Vedic Arithmetic Unit

        Hello guys , i have recently worked on vedic multipliers and have referred few papers too to implement it. I want to make this project open to everyone so that you can build your own Vedic multipliers and compare the results.Previously i have written about 2x2 bit Vedic multipliers  which you can refer back again. We will start by designing a 2x2 multipliers and will develop a 16x16 multipliers. Once we are done with this we will proceed to build a MAC unit. A complete module which has 16x16 multiplier/MAC/ADD-SUB will be our end design


2X2 multiplier:
Design:
Figure illustrates the steps to to multiply two 2 bit numbers (design detail). Converting the above figure to a hardware equivalent we have 3 and gates which will act as 2 bit multipliers and two half adders to add the products to get the final product. Here is the hardware detail of the multiplier 


 Where "a" and "b" are two numbers to be multiplied and "q" is the product. With this design we are now ready to code this in verilog easily using and gates and HA(half adders). To make the design more modular we try to write  code for HA first and then instantiate it to have the final product. 








Code:


4X4 multiplier:
Design:
Using 4 such 2x2 multipliers and 3 adders we can built 4x4 bit multipliers as shown in the design. Proper instantiating of the 2x2 multipliers and adders. We have to first write code for 4bit and 6 bit adders. Its your choice to choose your adders. If in case you want to have better performance you can replace these normal adders with CSA or compressors. For a simpler design we have used the "+" operator which is supported by the XST synthesis tool which by default selects a low hardware adder. This architecture follows wallace tree which reduces the addition levels from 3 to just 2 stages as shown. Arrangement of the adders and the addition is explained from the figure shown below:

Code:



Link for the coupons : Here

8X8 multiplier:
Design:
similar to the previous design of 4x4 multiplier , we need 4 such  4x4 multipliers to develop 8x8 multipliers. Here we need to first design 8bit and 12 bit adders and by proper instantiating of the module and connections as shown in the figure we have designed a 8x8 bit multiplier. At this point of time its necessary for you to even verify the  RTL code and check if the hardware is as per your design. PlanAhead tool by xilinx gives better view of the hardware design with design elaborate option(will explain this in my next posts). Refer the addition tree diagram to know the process for 8x8 multiplier:


Code:


16x16 multiplier:
Design:


Follow the same steps as in case of previous multipliers and develop 16x16 multipliers. Refer the adder tree diagram below :

code:



Here is the test bench for 16x16 vedic multipliers :



Simulation results:





RTL:





















MAC design:

TO being with MAC design we have to first design a accumulator which adds two number . One of which is the output of the previous stage and the other is the output from the multiplier module. Figure below shows the implementation  design for mac.













It can be seen from the block diagram that the accumulator module has one input( we have designed this module be be synchronous so we have used Clock as second input). Few more control signals are required to clear(clr) the ACC unit and enable signal(en) to  initiate the process of accumulation. We replace the MUL unit shown in the diagram above with out 16x16 multiplier module. Here is the code for MAC unit

code:


Simulation results :


This module cane be further devloped to convert the top module into ALU by designing your own adder/substractor and making this as the top module. please let me know if there is anything you did not understand. We are happy to help. Thank you.










Note: Replace the modules with name  "add_N_bit" with a N bit adder. You can use your own adder in place of this module like csa/cla etc. If speed is not of major concern for your design use the "+" operator to create the adder modules.

RTL:




















Referece Papers:


Paper 1

Paper 2


contact: verilogblog@gmail.com

Wednesday, 8 January 2014

Abstractions in VLSI

       Abstraction is VLSI could be defined as the amount the information an entity is hiding within it. Consider a simple analogy of the universe. We can say that universe is huge(more hidden information) , It has planets and one of which is our earth (some details are revealed ), Earth is the 5th largest planet( more details )..... I hope you get the logic. The more detailed it is , the less information it is hiding and hence we may say that it is at the low level of abstraction. The more information it hides , higher is the abstraction level. Let us just get to VLSI now from planets :-) Consider you computer for example(system) , its at a higher level of abstraction, at the next level we have boards(mother board,cd drive,disks,etc) , next we have chip level( cpu etc), inside those chips (Now we are talking about VLSI abstraction) we have four levels of abstraction namely:


  1. Register level
  2. Gate level(logic gates, mux, decoder etc)
  3. circuit level (transistor)
  4. layout level (geometry )
The figure explains the abstraction level in the deceasing order :


Designs can be expressed / viewed in one of three possible domains
  1. Behavioral Domain (Behavioral  View- using boolean expressions )
  2. Structural/Component Domain   (Structural  View-connection of modules).
  3. Physical Domain     (Physical  View-layout).
              A design modeled in a given domain can be represented at several levels of abstraction (Details). A circuit can be represented in three level as shown in the figure below
.....................................................................................
At architecture level its done using the available resources. Like "+" operator , predefined functions like "fetch"  , "decode" etc....



......................................................................................
This is done using logical gates as shown or logical functions.



......................................................................................


Here each device appears like a geometric device and only the size, dimensions does matter here.

........................................................................................


Link for the coupons : Here





                 A designer can view in either behavior level or structural level of physical level. Lets say we have a half adder design wherein we need a "and" gate and a "xor" gate. We first design these gates using behavioral description. once the gates are defined HA can be designed by instantiating the gates to give out sum and carry output for the adder. At the physical level each gate is described with its exact dimensions (layout) with its connection with other gates.

Figure below shows the abstractions ans corresponding views 



Now let us try and make a simpler study about the different level of abstraction and the design views. Figure below will help you understand it better 



System level:
  • Behavioral : Its s written specification about the end product with focus on only the end functions ,power consumption, area
  • Structural: Modules required for design is identified
  • Physical: Physical partitioning, board size is defined


Chip level:
  • Behavioral : behavior is explained using graphical charts and sometimes algorithms
  • Structural: Chips in the design and their connection is shown
  • Physical: size of pcb, etc, clusters(strongly connected components )

Register level:
  • Behavioral : Data flow from the i/p port to the o/p port  using registers and combinational blocks
  • Structural: Components like ALU,MUX etc and their connection is shown
  • Physical: Floor planning , standard cell 

Gate level:
  • Behavioral : Equation are used to define a function
  • Physical: Module and cell plan
  • Structural: Components are gates like and ,or , nand etc and their connection is show

Transistor level:
  • Behavioral : Tx element equation 
  • Structural: components are transistor , resistors , capacitors etc
  • Physical: Mask geometry 
 Daniel Gajski and Robert Kuhn developed a model called the "Y" model in 1983 which was refined by Donald Thomas it in 1985. Along the tree axis of this model are the 3 abstraction level with the design view in each level as shown in the figure below.
A designer can start with one perspective and may later switch to other in the chart. There is no hard rule to stick to one view on the "Y" model.


Points to remember :





  • Behavior: This domain describes the temporal and functional behavior of a system.
  • Structure: A system is assembled from subsystems. Here the different subsystems and their interconnection to each other is contemplated for each level of abstraction.
  • Geometry: Important in this domain are the geometric properties of the system and its subsystems. So there is information about the size, the shape and the physical placement. Here are the restrictions about what can be implemented e. g. in respect of the length of connections.

  • Architectural:A system’s requirements and its basic concepts for meeting the requirements are specified here.
  • Algorithmic:The “how” aspect of a solution is refined. Functional descriptions about how the different subsystems interact, etc. are included.
  • Functional block or register-transfer: Detailed descriptions of what is going on, from what register over which line to where a data is transferred, is the contents of this level.
  • Logic: The single logic cell is in the focus here, but not limited to AND, OR gates, also Flip-Flops and the interconnections are specified.
  • Circuit: This is the actual hardware level. The transistor with its electric characteristics is used to describe the system. Information from this level printed on silicon results in the chip.

Thursday, 17 October 2013

Verilog code square root of a number using IP core

Hello friends after a long gap i am writing a new post as i was keeping busy with some other work.
I received about 12 mails asking for Verilog code to find square root of a number so thought of writing is small post to find sqrt of a number

   Here we will use the IP core from the Xilinx tool box and hoping that this module is just a part of your design and not the main project.
Create a new Verilog module and name is as " sqrt" or any another name which will help you identify the module easily.

Follow these steps

  1. create a new Verilog project 
  2. Right click on the module created and click on new New Source 
  3. select the IP (core generator and architecture wizard) > give  a name to the core ex. sqrt
  4. Go to Math Function >Square Root>Cordic 4.0 select the core and click next
  5. select the Square Root option and set the pipe lining mode to maximum and click next.
  6. select data format > unsigned integer(u can use floating if you require floating point sqrt).
  7. set round mode to truncate, this will give you the nearest square root of the number and click next.
  8. click on generate.
these are the snapshots which may help you to follow though the IP core generation process














After completing the IP core generation process declare the i/p and o/p ports in your main module or u can just copy paste this piece of code :

// copy from here
module sqrt(
x_in,
x_out,
clk
    );
 
input [15:0]x_in;
output [8:0]x_out;
input clk;
// PASTE YOUR INSTANCE OF IP CORE HERE
endmodule 
// end of copy

Now the next step is to paste the IP core instance to connect your main module ports to the IP core. you can get the IP instance as a file with extension as .VEO file. This instance can also be generated without the need to find the file with .veo extension. Just go the implementation mode from simulation mode and click on the IP core . In the process window you will find the instance code as shown below . Copy the instance and paste is immediately after the main code






















After completing this process your final code must look like this :

//
module sqrt(
x_in,
x_out,
clk
    );

input [15:0]x_in;
output [8:0]x_out;
input clk;

//instance copied now
sqare_root YourInstanceName (
.x_in(x_in), // input [15 : 0] x_in
.x_out(x_out), // ouput [8 : 0] x_out
.clk(clk)); // input clk

endmodule
//

Well you are done. We are left with only the function verification which we can do with simulation, 
Try testing your module with different value and check if its working fine . 

Here is the simulation results which i have got for few values and the IP works perfectly fine



You can try with i/p which do not produce the perfect whole number and you may notice that the IP truncates the square root to nearest number . 

I hope you will like the post , though it was quite brief  :p . do let me know if there is anything you did not understand . 
 have a great day :-)
Also have a look at this new course : 


Link for the coupons : HeRE 

Tuesday, 1 October 2013

Linear Feed Back Shift Registers Using Verilog

Hie ,Here is yet another post about VLSI testing. In the last post we discussed about the testing of sequential circuits with the help of Scan Cells. Lets assume if we had the input bits to be some 100 bits long . In such a situation its again a nightmare to manually enter the inputs to the circuit under test and which is not practical too. Various test patterns generators have been proposed to trigger the inputs to the Circuit Under Test(CUT) which will produce random patterns for every clock cycle and reduces the burden to manually insert these as inputs to the CUT.    Figure below Shows the general scheme to test any circuit













       

     The Control Unit is responsible to coordinate the operation of the testing circuit.When the MUX select signal is HIGH (1) The circuit is said to be in the TEST mode or else its in the normal mode.Under Test mode the input to the CUT is from the Test Pattern Generator which will apply the test vectors to the CUT to be tested. The output response of the CUT are compared with the fault free response to declare the CUT as fault or fault free. In this post we will discuss about the Test Generators(TG) and the remaining blocks will be explained in my next post. 
      The choice of TG is an important criteria to ensure high fault coverage for the CUT and to make sure the circuit is working or not. Pattern generators like LFSR(Linear Feed Back Shift Registers ) can produce random patterns with low hardware requirements and is preferred choice for testing. It is categorized under pseudo-random test pattern generators which can produce a random pattern for every clock cycle applied to it. The figure below shows the general structure for a LFSR




 
     It consist of D-FF connected in cascade as shown with the same clock applied to all the FF to make them act like a shift register. But the only change is that the input to the first (D3 in th figure) is from the XOR of the o/p from FF's 0 and 3 (from fig). This XOR operation introduces a new  bit into the shift register .When we take out the output of these FF they will have a random pattern. This is a general structure for a 4 bit LFSR. The inputs to the XOR are called the Taps. So from the figure above the Taps are 0 and 3 FF's. There is no such order from where the inputs to the XOR comes from to produce a random pattern. But the pattern has to be of maximum length . By maximum length we mean that the pattern must repeat itself after 2^N clock cycles for a N bit LFSR. In our example if the LFSR has to be of maximum length then the pattern has to repeat after 16(2^4) clock cycles. For a small LFSR like the present one (4bit) its easy to identify the Taps to the XOR gate which can produce maximum length output but just imagine how can we identify the Taps for the XOR if the number of bits is 10bits ? Obviously we cant go by BRUTE FORCE method by trying all possible combination to identify the Taps which will produce maximum length sequence. Figure below shows the maximum length sequence produced by a 4 bit LFSR.











   

     You can notice that after 16 cycles the pattern is repeating for the LFSR. The Tap identification is the major criteria to produce a sequence like this which will repeat after 2^N clock cycles.But the fact is that the inputs for a CUT cannot be practically more than 128 bits or so. Xilinx has documented the Taps to be given for a given LFSR up to 165 bits. This makes the task for coding for LFSR by just using DFF and XOR gates with the Taps given by the Xilinx documentation. With these basics we can now proceed to design a LFSR for TG used in testing.

Design :
Components Required for Design : 
  • D-Flip Flops
  • XOR Gates
To illustrate the concept of LFSR and maximun length sequecne we will 4 bit LFSR. The Taps according to the Xilinx Document to produce a 6 bit maximum length sequence are 4 & 3(i.e the inputs to the XOR gate are from output of FF number 4 and 3). Figure below shows the RTL of the 4 bit LFSR.

                                                                        RTL





















   
    You can notice that the inputs to the XOR gate are from o/p for DFF 4 and 3 and the output of this XOR gate is fed as input to the first FF. Figure below shows the simulation results of 4 bit LFSR which produces random patterns and which repeats exactly after 16 clock cyles.

                                                                 SIMULATION 



















                                                           




CODE

module lfsr_N_1(

d,
q,
rst,
clk
    );
parameter N=3;//given N one less than the number of FF in your design
input clk;
input d;
input rst;
output [0:N]q;
reg [0:N]q;

always@(posedge clk)

begin
 if(rst && d)
q<=1'b1;
else
4'b011:q={q[N]^q[N-1],q[0:N-1]};    //change the taps here for your design
end
endmodule


Advantages:
  • Low hardware 
  • Maximum length sequence can be produced 
  • Used for BIST
If you want to code for a N bit LFSR where N can be any number from 3 to 165 all you need to do is to declare a parameter N and write you code for the LFSR with Taps from the Xilinx Document. The links to the Xilinx Document and the references are given below



References:
This is our new course, happy learning


Link for the coupons : Here


Note: The code above works for only 4 bit, To make it work for any given N bit just change the Tap inputs. And the memory format used for coding is big endian format and you make change it to little endian format