Friday, 30 May 2014

Snippet

 Today
[Code] UART done with Testbench

Tomorrow:

Test it on hardware

Thursday, 29 May 2014

FSM Changed, FPS Did NOT!

I changed the FSM which controls the JPEG encoder yesterday. I initially went in the wrong direction as I understood the OPB protocol incorrectly. When I finally understood the problem, the implementations was of few lines. Earlier quantisation tables were written every time a new frame was loaded. Now it will change only when encoding quality changes. So it saves 1024 clock cycles per frame if encoding quality does not change. It took me several attempts to get it working. But FPS didn't change. It will probably make a difference when the fps is very high, so the clock cycle saved per frame will be significant.

Also, I tried underclocking. I changed the clock divider of jpeg clock(freq=625Mhz/clock divider) from 7 to 11. FPS did not change for 7,8,9. But I got timeout error for 10 and 11 which I guess means FPS was too low. I honestly don't know what to make out of this.

Also there is a bug. I have no idea about the bug. Here is the video.
 bug video
In the video, things worked after two attempts but it can take more. .

Today I was supposed to start changes chroma subsampling of the encoder but I guess adding UART functionality would be better as it will help me and other developers to debug easily. So I will spend the day trying to get the UART working on the board.

Snippet:
Work done- Changed FSM, Played with underclocking, Studied chroma subsamling
Work to be done today-
- Implement UART on board


Wednesday, 28 May 2014

Yesterday probably I worked the longest (10 hrs maybe). I was trying to check whether the reason for slow fps is blocks after jpeg core. To check this I removed the the signals which the usb controller gave to the jpeg core. Also I wrote a code which counts the done signal of the jpeg core and I output the range of the count using LEDs(like for count less than 10 LED0 would light up). And it turns out that count of the done signal was in same range as the output fps. I tried different combinations like different encoding quality, allowing some signals of usb controller to control the jpeg core while ignoring other and the result was more or less the same. Meanwhile the code threw a timing error which took me 2 hrs to debug. 

 Also I did some reading on quantisation tables for JPEG. Turns out that quantisation should be low for luminance table (quality-85%) and high for chrominance of around 50%. But this is a general rule and cannot be used for all applications.

Today I plan to change the jpeg top module and see whether it improves fps. I have few ideas.

Tuesday, 27 May 2014

Finally the FPS changes!

In my previous post I mentioned how it appears that JPEG encoder is working at 100% encoding quality. Turns out it is true. I hardcoded to make the core run at 50% quality and 100% quality and there was a clear change in the fps. At 100%, the fps was same as the normal HDMI2USB frame rate. And at 50%, framerate jumps to around 20 fps.
Now 100% quality encoding is not good.Why? In the quantisation step, a each sample of a 8x8 block is divided by a number which depends on encoding quality and frequency component the sample. Human eye is sensitive to low frequency so in quantisation process, low frequencies are preserved and high frequencies are suppressed. But when for 100% quality all components are preserved, so the output file is large but for the human eye it looks as good as say 75% quality.
Plus at 100% quality, the clock cycles spent on quantisation are waste as we are basically dividing by 1.  

 (^100%-11.50 fps)
(^50%-19 fps)

I will look up some material on chrominance and luminance tables and see what is a good trade off between quality and frame rate.


Monday, 26 May 2014

Today I tried to change the encoding quality manually. The reason for this is that if the jpeg core is running at 100% quality then the quantisation step of JPEG encoding is useless. For that I changed the code of JPEG top and used switched in board to select the encoding quality. Unfortunately after flashing the firmware, mplayer and guvcviewer could not play the HDMI input. Things are going slow because it takes around 15 mins in my system to generate the bitstream. I will try to get things working tomorrow too. If things don't work, I will reach out for help.

1 Week Down!

In the past week I stuck to my goal of improving the fmax of MPEG coder. I tried to run at 105 Mhz(because clock dividers should be integer) but timing constraints failed. Tried a bunch of stuff but didn't succeed. Then I finally ran the MPEG  core at 90 MHz. All timing constraints were met but there was no improvement in the speed which was a bummer. Anyway during weekend I tried an optimisation but was not able to run it successfully on the hardware. The JPEG top module has the following FSM


As you can see, after every frame compressed lumination and chrome tables are written again. Plus because of the OPB interface, there is atleast a cycle long wait for ack. So for every frame 512*2=1024 cycles are wasted when a frames is compressed. So I was trying to change the FSM so that the table writing step is bypassed if there is no command to change the encoding quality. Unfortunately my changes are not working on hardware. I am planning to spend some time on it. Also I plan to check the efficieny of HDMI2USB minus the cypress firmware so that I can test whether or not the cypress is slowing down HDMI2USB. Finally I am planning to start modifying the code for 4:2:0 sub sampling. This will take some time as I need to rewrite the FSM of JPEG core. This will probably take 2 weeks. Hope this week gives me some positive results.

Friday, 23 May 2014

As it turns out, the JPEG encoder works at 78 MHz. As I didn't have a oscilloscope, I wrote a VHDL code too check the frequency of the PLL using LED. And yes, it turned out to be greater that 70 Mhz. So my theory that the core is running slow because the clock is slow turns out to be false. I did run smartexplorer to remove see if I can remove the timing faliure but it didn't. Further analysis revealed that I can safely ignore the path that was failing timing constraint as it was the PLL MUX selecting signal which was not important. I used a TIG in the UCF file to remove it. I changed PLL parameters to make the encoder run at 100 MHz, and strangely ISE did not throw any timing violation but things didn't work on hardware. I will try once again in the weekend. I use guvcviewer to check the frame rate of the video streaming and it comes out to be 12 fps on average. I will study other parts of HDMI2USB and try to find the bottleneck. 

Wednesday, 21 May 2014

Clocking of HDMI2USB

Today I spent almost my whole day trying to understand the clocking of the HDMI2USB. Well it turned out to be a tedious deal because the firmware is in VHDL spanning arcoss multiple files and it has a memory interface which is generated using the MIG tool. THe MIG tool is very poorly documented. Thanks to a blog post by Joelw (http://www.joelw.id.au/FPGA/XilinxMIGTutorial) things got slightly easier. So basically, the oscillator output (100Mhz) is connected to the memory interface which has a PLL and this PLL generates 6 clocks. The first 2 clocks are used by MSB and the rest are for user fabric. The PLL parameter used in HDMI2SB is same as given in the blog. So except the image buffer which is the DDR2RAM, rest of the blocks of HDMI2usb run on 78.125 Mhz clock.
But something that caught my attention was this line from the blog

wire c3_clk0; // 32 MHz clock generated by PLL. Actually, this should be 78 MHz! Investigate this.

c3_clk0 is the clock driving rest of the HDMI2USB blocks. If it really working at 32 MHz then we actually know our problem.

HDMI2USB is currently working at 15fps for 720p. The JPEG encoder in simulation works at 20fps for a clock of 100MHz and the image resolution being 1080p. So it should work at ~50fps for 720p images @100Mhz. So if the JPEG encoder is made to run at 32Mhz clock, frame rate will be 16fps.
So if the PLL output is actually 32Mhz as written by Joelw then we know what our problem is. If PLL output is 78 MHz then I can't say what the problem is. It might be that the problem is not with JPEG core but something else. Anyways I have pinged Joelw about this and waiting for his opinion. Tomorrow I will try investigating this. Let's see how it turns up.

Tuesday, 20 May 2014

Board Working!

Finally got the board working. I had initially used Adept and fxloader to flash the firmware but it turns out there seems to be a problem with fxloader. So I used libFPGAlink. I used the wiki by mithro on the same and after some efforts things starts to work.

(Lets ignore the tooth brush)
Now I am planning to try to improve the frequency of the operation of the JPEG encoder. Simulations show that at 100Mhz, 20Hz can be achieved for 1080p images. I am planning to run smartXplorer on the design and see which mapping and PAR strategy is the best. Also I will try to removing the timing violation as ISE slows other parts of the design to meet the timing requirements. So lets see how things work out.

Saturday, 17 May 2014

Failed to get the board running!

I came back to college. A friend of mine was kind enough to lend me his HD monitor for the summer as he will be interning in Japan. I usually work do my fpga stuff on Windows but as everyone in the org uses linux I installed ISE on ubuntu. I did receive my board last week and was planning to get it running by the end of this week. Unfortunately I couldn't. I was able to successfully complete the flashing the FPGA with HDMI2USB firmware using Adept tool and generate test pattern on the monitor. Then I flashed the Cypress Fx2 chip with HDMI2USB firmware using the latest version of fxload. But when I tried the dmesg command, the output did not show HDMI2SB as aUSB device.
Posted the problem on irc, waiting for some help.

Test image 

Tuesday, 13 May 2014

Work done part-2

New day!  New blog post!

This post will describe the last idea in my head which I did mention in my proposal but has a serious con. But this idea suddenly seems to be an attractive option as the limited BRAM memory seems to be a problem.

The idea is basically to divide the picture into two parts and then process them simultaneously. Send them two computer and then stitch them together.

(Taken from my proposal)

This has two advantage:
1) Currently 1 buffer of 16 lines is being used. To implement this I need to split the buffer into half. So extra input buffer is not required but an extra output buffer will be required in case the right half of the image gets processed first.This buffer will obviously be of smaller size as it is storing a compressed image. Of course, some extra pipe lining buffers will be added due to an extra  jpeg core but the main reason for increased memory utilization the input buffer size remains the same.
2) Since each jpeg encoder has a smaller size(half) data to process, processing will be faster.

The biggest problem with this is that the HDMI2USB will output two frames one after the another, which will then be decoded by the UVC driver and then have to be stitched back together. Honestly I don't have much idea about UVC drivers and how much extra time the "stitching" process will take  (Joel if you can give me some pointers on this it would be great) but one thing is for sure: HDMI2USB WILL NO LONGER BE PLUG AND PLAY DEVICE. This is why Jahanzeb was not impressed with this idea. Also whether this will be suitable for streaming, how much extra resources will be used etc can only be accurately stated once it has been implemented.

Currently I am reviewing the code of JPEG encoder. If I do get something interesting I will blog about it.



Monday, 12 May 2014

Work Done so far!

Honestly I am not a blogging  person so this kind of seems unnatural to me. Any way I will be using this blog post to write about the stuff I have done and that which I have planning to do.

Some time back I ran some a test bench on the mkjpeg core which was actually a modified version of the test bench provided by mkjpeg.
It turns out that in simulation a 1080 HD pic takes around 50 ms(simulation time) to process  which turns out to be 20 fps if the clock has a frequency of 100 MHz. In fact I contacted one of the developer of mkjpeg core and he had this to say:
"I managed to run the ip at 125mhz on a cyclone 4 grade C6.
Because i modified the ip for 4:2:0 compression, it managed  to
compress 50mpix/s : 1600x1200 images at 25fps.
I planned to run it at 150Mhz on a arriaV grade c4. At this speed, I
am sure that you can compress 1080p at 30fps with only one ip core."
The mkjpeg core currently has 4:2:2 subsampling which is acceptable in industry but making it 4:2:0 will definitely improve fps with reduction in quality. But the point is if the core can be run a faster rate half the job is done.

So my first step towards optimisation will be trying to improve the max frequency of operation.
Well how will I go about it.

Well the easiest thing to do is try unsing SmartXplorer. SmartXplorer is a xilinx tool which tries different placement and routing strategies to meet the timing requirement. Other things that I am planning to manually locking PLL of jpeg core, changing the timing constraint to make the PAR tool work harder. Changing the fanout constraint can also help.
The slowest block of the mkjpeg core is the HUFFMAN encoder. I will try to add a pipeline stage somewhere in the block to reduce the combinational delay but what I got from the reading the VHDL codes, it seems already there is a lot of pipelining done in the block so it will be tough.


Last week was spent mostly going through the mkjpeg core line by line and believe me it is really tough. Reading some one else's code is always tough but the fact that the core is written in VHDL makes is painful. VHDL is based on ADA and has poor readability. Plus the fact that the language is concurrent makes it extremely tough. Lot of things happen on the same time. But from whatever I did read and managed to understand my respect for the author of the core has increased manifold.
The core uses every trick in the book. It uses pingpong buffers to pipeline each block and FIFO to do data flow pipelining. So the optimisation I thought I could have done seems to be done already.

Also I did investigate about why reading from line buffer stalls.
Well there are two kinds of stall:
1) When fifo are full. The only FIFO that fills up is the one that store DCT values. I haven't investigated this further as the stalls are for few cycles.
2) The second stall is a compulsory stall. The core processes in 16x8 blocks but takes in data in 8x8 blocks.
First, it reads the left 8x8 block pixel by pixel and store it in a RAM called FRAM1. As it is reading the pixel, it is also sending the data for processing of the Y1 component of colour space. This takes 64 cycles.

Then the right 8x8 block is read and stored into FRAM1. Also while is being read, processing of Y2 is being done. This takes another 64 cycle.

Now data for Cr, Cb component are send for processing. There are 128 samples but since 4:2:2 sampling is done, only 64 samples are send for Cr and Cb processing respectively. This takes 128 cycles. And during this 128 cycle no data is being read from the line buffer, hence a stall.

I plan to use this idle 128 cycle to add some extra processing blocks. While Cr, Cb values are getting fed to the DCT block, next 16x8 block can be be read and processed by another DCT block which is then passed onto a new set of zig zag, quantiser block. But the catch is I can't use two Run Length encoder and hence the speed of RLE is critical. So the reading will stop only when pipeline fifos are full.

Another idea is a simple brute force technique, use two core instead of one. But the problem is this

Specific Feature Utilization:
 Number of Block RAM/FIFO:               77  out of    116    66%  
    Number using Block RAM only:         77
 Number of BUFG/BUFGCTRLs:               13  out of     16    81%  
 Number of DSP48A1s:                     21  out of     58    36%  
 Number of PLL_ADVs:                      4  out of      4   100%  

Already 66% of RAM /FIFO resource has been used. The synthesis report of only the JPEG core looks something likes this regarding RAM utilisation:

Specific Feature Utilization:
 Number of Block RAM/FIFO:               65  out of    116    56%  
    Number using Block RAM only:         65
 Number of BUFG/BUFGCTRLs:                2  out of     16    12%  
 Number of DSP48A1s:                     10  out of     58    17%  
The reason for high utilisation is the highly pipelined nature of JPEG core. Each block has has FIFO for storing intermediate values in a block in addition to ping pong memories for pipelining of each step of encoding. Also a line buffer is used which stores 16 lines of image and is biggest hogger of BRAMs.


Can the algorithms used in the blocks be improved?
I have completely read the codes of DCT, Zig-Zager and Quantiser. It seems the answer is no. Why?
Well DCT core does not use multiplier but ROM based look up table which makes it super efficient. When the pipe line is full, 8x8 blocks are processed in 64 cycles! Quantiser and Zigzager also take 64 cycles when the pipeline is full. This is pretty good. It is very difficult to beat this. Well done Mr Krepa!

I haven't completed reading the RLE and Huffman encoding block hence I can't comment on it right now.

Okay now I am getting bored so will add some more stuff tomorrow. Honestly I am in a sticky situation even before the coding period has started. I guess I should spend more time praying than coding now! :D