One possible solution example

 

Warning! Given solution could be not the best, however it suits as example…

 

Coefficients: c1 & c9 = 0.125; c2 & c8 = 0.25; c3 & c7 = -0.75; c4 & c6 = 1.25; and c5 = 1.0.

 

Equation transformation (di==data_in):

data_out := (1/8) * (di-0+di-8) + (1/4) * (di-1+di-7) - (3/4) * (di-2+di-6) + (1 1/4) * (di-3+di-5) + di-4

 

Equation after substitution (division-> shifting):

data_out := ( (di-0+di-8) >> 3 ) + ( (di-1+di-7) >> 2 ) - ( (di-2+di-6) - ( (di-2+di-6) >> 2 ) ) + (di-3+di-5) + ( (di-3+di-5) >> 2 ) + di-4

 

Together 8 additions and 2 subtractions, what scheduled among 4 steps. The result of every operation is saved always in the same register (reg1, reg2, …, reg4). Multiplexers in adder/subtractor inputs are not optimized…

 

Step

 add #1

 add #2

 add #3

 sub #4

 1

 reg1 := di-1 + di-7

 reg2 := di-2 + di-6

 reg3 := di-3 + di-5

 -

 2

 reg1 := di-0 + di-8

 reg2 := di-4 + (reg1>>2)

 reg3 := reg3 + (reg3>>2)

 reg4 := reg2 - (reg2>>2)

 3

 reg1 := (reg1>>3) + reg2

 -

 -

 reg4 := reg3 - reg4

 4

 reg1 := reg1 + reg4

 -

 -

 -

 

RTL code you can be found here. Synthesis results: area 3248 (1181+2067), delay 19.99.

In order to restrict the area size during synthesis use command:

set_max_area [value]