Friday 27 June 2014

Debug Data

Things are not as easy as they seem to look. After plenty of debugging and testing on hardware I was finally able to get the debug code working. I haven't implemented the part which outputs size of encoded data because it is a 3 byte value and I didn't have the foresight to write a very generalised code. Other debug data are 1 byte.

So this is what I get from the UART. By the way I am using the test image and encoding quality of 50%.

  • Input Frame Rate: 60 Hz. This is what it is suppose to be.
  • Output Frame Rate: ranges from 15 to 21. So lets take the average value to be 18.
  • Time taken to write frame into DDR2 Ram: 16-17 ms. This is what I approximated
  • Time taken for encoder to compress: 23 ms Again I was close
  • No. of frames dropped: 2
So in a second is 60 frames. And for every frame read, 2 frames are dropped which means only 20 frames are being processed.

Writing + processing time = 40 ms => 25 fps. So I *guess* we are losing 10 milliseconds in sending the data. (will check this).

So I guess we know what our bottleneck is : When image is written into ddr2, it is not processed and when image is processed, image is not written into ddr2.


 

Wednesday 25 June 2014

Daily Snippet

  • Got the UART working: The uart of atlys without the vizzini driver enumerates as ttyACMx. Rohit was able to use it to send data but I was getting crap values. Turns out that you can only send data from fpga via this and not send data to fpga. My uart test code would echo back the byte sent and since fpga did not receive anything I got crap values. This wasted a good 2 hours of my life.
  • Shenki's UART driver is working: To install vizzini I have to first remove cdc-acm which I did not do and hence  it did not work initially. Then after reading carefully the README I was able to get it working. The uart enumerates as  ttyUSB0. 
  • Tested my code: My code was working in part, did some correction. But yesterday  there was power cut even in the evening so couldn't do much work. So I will be travelling to my parent's place today and work from there till things become normal here.
Sorry I missed the deadline. Will complete it ASAP.

Tuesday 24 June 2014

Daily Snippet

  • Coded debug firware, almost done. Pushed to my gsoc branch.
  • Design Documentation
  • Testing UART, worked for Rohit not working for me(grrr)

Sunday 22 June 2014

I think I have some idea on why frame rate is slow. In my previous blog I explained how the state machine of image buffer works. Basically, a frame is stored first and then read into encoder for compression. From simulation data, a 1024*768 frame takes approx 25 ms (@ 78Mhz) to compress a frame. But if the frame rate of the HDMI source is 60hz then the minimum time taken to write a frame into the DDR2 Ram is 1/60s ie 17ms (assuming it takes no time to transfer data into DDR2 RAM which ofcourse is not true). So it takes 42 ms to store and compress a frame which gives you a frame rate of 23 fps. There is no pipelining between frames.

In simulation, the encoder is fed with new data if its buffer is not full. But in actual firmware, data needs to be accessed from DDR2 RAM which even though is small is not negligible. Also the raw image data is buffered before being written into RAM to prevent loss of data, so it takes far more than 17 ms to write the whole frame into RAM. So my guess is frame rate of DDR2 + encoder system is around 20 fps.

Also, I ran test bench for different encoding qualities. Turns out that for 100% encoding quality, the compression ratio is around 5-6 whereas for 50% encoding quality, the compression ratio is as high as 20-25.

So output size of a frame at high encoding quality:
1024*768*24/5/1024/1024=3.6 MB
Now bandwidth of cypress fx2 is 40 MBps. So  fps in high quality case will be 40/3.6=11.11 fps. That's why changing the frequency did not change the frame rate(I guess).

In low quality case as the compression is high, fx2 bandwidth does not limit the frame rate but the firmware limits the frame rate. So the observed fps is around 18.5-19 which is close to the fps I am guessing using calculation.

Friends, Romans, Mentors and countrymen please comment.

Disclaimer: I have used a lot of handwaiving calculation. So if I have used too much liberty please comment on it. I will try and give rigorous  maths.

Friday 20 June 2014

 Today I planned to complete the simulation of HDMI2USB but it turns out that the calibration_done signal is not yet high( it has been 6 hours now). I tried changing the tb, using some advance options like Joelw suggested but nothing seems to work. So I am either doing something terribly wrong or it is supposed to take a lot of time. So finally I decided to use pen and paper and try understand the code. It took me a lot of time because VHDL by nature is concurrent  and I had not fully understood the working of Xilinx MCB but I guess I have done it correctly. Turns out that image buffer works fine.

The DDR2 read and write state machines are pretty complex because in general memory controller are complex. From what I could comprehend, there are three state machine in image buffer of HDMI2USB, one read from the RAM, second writes into the RAM and the third which controls the read and write state machine.

The third state machine looks something like this:
1) Wait for start of frame
2) If start of frame is detected, start writing the frame onto RAM until end of frame is detected (wr_img=1)
3) Once end of frame is detected, send start command to JPEG encoder and wait for "Jpeg is busy" signal(wr_img=0)
4) If "Jpeg is busy" detected, start reading from RAM till the entire frame has been read(rd_img=1)
5) Wait for done signal from encoder.(rd_img=0)
6) Go back to step (1)

I don't think read and write can be pipelined as DDR2 Ram do not allow simultaneous read and write.

Here the only optimisation I can see is that instead of waiting for done after completing reading of frame (step 4), the state machine can wait for start of next frame and start reading.

To understand the read and the write state machine, I dug the data sheet of MIG.
Read state machine looks something like this:

1) RESET: Wait for calibration to be done.
2) read_cmd: if rd_img=1. Put read command and address into the command fifo.
3) Wait for read data fifo of RAM to fill up. (64 words)
4) Once full, send the data into JPEG buffer if the Jpeg buffer is not full.
5) If 64 words are read goto step (2)

Write state machine:
1) Wait for calibration
2) If wr_img=1 and there is something to be written fill the write data fifo of RAM(64 words)
3) Once full, push write command and address to command fifo
4) Wait for write to complete.
5) Once done goto 2

The raw rgb data from image selector is first buffered using fifos and then sent to RAM. This helps prevent loss of data but adds to the latency.

Everything seems to be legit. Only optimisation I can see is that instead of one read port, two can be used to pipline read cycles. So when one read data fifo is completely read and there is still space in the JPEG buffer, data from second read data fifo can be used. But since this is a DDR2 ram operating at 325 Mhz read time from RAM to fifo should not be great, so using two ports won't change much.

Also, an inherent problem with the jpeg algorithm design is that 8 lines are required to start encoding. Since processing of frame is not pipelined as seen above and after processing of a frame the system resets, for a resolution of 1024x768, in every frame there is a stall of 1024*8 cycles.

Tomorrow (I mean today) I will try to test the bandwidth of USB. This article says maximum throughput is 40 MBps. So for 30 fps frame rate of 1024x768 resolution frames, the minimum bandwidth (assuming compression ratio of 10) should be (1024*768*24*30/10)/1024/1024 = 54MBps. Am I missing something?


Thursday 19 June 2014

Today I completed the coding part for the simulation of HDMI2USB. The main aim for this was to check whether the read and write pattern used in DDR2 Ram is optimum. For this I removed all the parts which did not affect the read and write performance of DDR2 Ram like EDID, USB etc. The  only problem is that the simulation takes up a lot of time. It takes upto 4 hrs for the calibration of DDR2 ram. I am currently simulating it. I will wake up and check the waveforms.

Wednesday 18 June 2014

I'm Back!

Finally I am back from my trip to US. The competition didn't go the way I wanted it to be. Our cansat did not send any telemetry. Also it landed in a sunflower field which meant we could not recover the cansat and the onboard memory which had all the telemetry data. But it was a good learning experience, plus I got a chance to meet people from all over the world which was great.
I will resume working on the project. I will continue my work on simulating the HDMI2USB firmware. Hopefully, I will be able to come up with something fruitful by the end of this week.

Friday 6 June 2014

Used Joelw's code on his blog to test the DDR2 ram. After initial hiccups I was finally able to do it. Also I underclocked the DDR2 Ram to check if there is a change in frame rate. When clock was reduced by half there was no change and when I reduced it further mplayer showed a timeout error.
@shenki: I have added a new commit with the xsvf files. Try them on your board and please tell me the frame rate you observed.

Thursday 5 June 2014

Daily Snippet

-Studied ug416 and ug388 to understand XILINX MIG
-Studied working of DDR2 Ram
-Ran example design to understand the working further. Took me lot of time to get the clocks working at correct rate as PLL values were changed manually which I didn't take notice of.

The read FSM of DDR2 looks something like this^. So 1 burst read gives 64 bytes. Jpeg encoder needs 1280*8*3 bytes to start. That means it takes a lot of cycle to fill the jpeg. I will try under clocking the DDR2 ram today and see whether it affects the frame rate.

Also I will start coding the test bench.

Wednesday 4 June 2014

Daily Snippet

As I am planning to write a test bench to simulate HDMI2USB, I spent most of the day trying to understand the TMSB protocol as I had not encountered it earlier. I was finally able to write a code which generates TMSB signal given RGB values.

Work for Today: Understand the working of DDR2 ram and simulate it successfully. Then I will be ready for simulation of the whole.

Tuesday 3 June 2014

UART Documentation

Atlys board has a UART-USB bridge which can be used for UART communication with other devices. UART is useful as a debug so I have made a simple UART.

Features:
1.Variable Data Bits: 7, 8, or 9 data bits and 1 or 2 stop bits
2.Parity generation and checking: odd, even, or none.
3.One transmit and one receive data buffer.
4.Received data and status can optionally be read from a single register
5.Built-in Baud Rate Generator.
6.Variable Baudrates. Use case: 19,200
7. Data is received in frame.

Architecture
UART Transmitter
 It has 3 major components:
1.FIFO Buffer
2.Baudrate Generator
3.Transmitter interface circuit

UART Receiver:
 It has 3 major components:
1.Baudrate Generator
2.Receiver interface circuit
3.FIFO Buffer 

Primary Inputs/Outputs
Inputs
1.clk,reset
2.WIRE[7:0] W_DATA - DATA INPUT TO TRANSMITTER FIFO
3.WIRE Wr_UART- WHEN SET TO HIGH W_DATA IS WRITTEN INTO TRANSMITTER FIFO 
4.WIRE Rx - UART RECIEVER LINE
5.WIRE RD_UART- READS  THE DATA FROMP FIFO
Outputs
1.WIRE TX_FULL- HIGH WHEN TX FIFO IS FULL
2.WIRE TX- UART TRANSMITTER LINE
3.WIRE RX_EMPTY- HIGH WHEN RX FIFO IS EMPTY
4.WIRE R_DATA- OUPUT FROM RX FIFO

FSM for Reciever


As the communication is asynchronous, data is oversampled. Each bit is oversampled 16 times. Oversampling is done using a mod m counter.
As the circuit is oversampling, I am running UART at 50 MHz which is generated using PLL.

How To Use?

  • Add UART files to your design file.
  • Add UART_clock.xco. You may have to regenerate the core depending on you design.
  • Set paramerters of UART_main.(The following are default parameters)
    • Data_Bits=8                  // No. of data bits
    • StopBit_ticks = 16        // No. of ticks for stop bits. 16/24/32 for 1/1.5/2 bits
    • DIVSIOR = 326           // Use it to set baud rate. Divisor= 50/(16*BaudRate)
    • DVSR_BIT=9              // No. of bits of Divisor
    • FIFO_Add_Bit= 2        // No. of address bits of FIFO
  • To transmit data, drive w_data and strobe the wr_uart signal.
    • Data will not be written into fifo if tx_full is high
  • To dequeue data from receiver fifo, strobe the rd_uart signal. 
    • Data in receiver is invalid if rx_empty is high
  • Add the following lines to your ucf file.
  • Download exar driver from here and install them. 
    • NOTE: Turns out that the linux drivers are outdated. Shenki has made changes to it which can be found here but I was unable to get it running. I have test the code on windows.
  • Use hyperterminal or gtkterm to monitor/send data.
  • Enjoy!
  •  
    You can get uart files from UART folder in this link:

Monday 2 June 2014

Daily Snippet

  • Wrote a code to test the UART on hardware
  • Wasted a lot of time trying to get the driver running. Shenki changed the source code to make it compatible with latest kernel but my terminal hangs when I install the drivers.
  • Finally tested the UART using Windows. Working!
  • Tested the code for different baud rates. Working for 19200.

Tomorrow:
  • Integrate the code with HDMI2USB and test it.
  • If things work, write documentation and give to HDMI2USB community to test.

Sunday 1 June 2014

Weekend Review

Work done in last week:
  • Investigated the effect of encoding quality.
  • Investigated the effect of post jpeg stages
  • Changed FSM of JPEG core
  • Underclocking of Jpeg core
  • Studied Chroma Subsampling
  • Wrote code of UART (alpha stage)
Work to be done this week:
  • Complete UART implementation with documentation on how to use it so that it can be used by other developers.(2-3 days)
  •  Then try one of these: 
    • Investigate performance of pre-jpeg blocks.
    • Write a test bench to simulate the HDMI2USB firmware.(Will take a week or two).
    • Implement chroma subsampling 4:2:0.