A decision maker observes a discrete-time system which moves between states {s1,s2,s3,s4} according to the following transition probability matrix:
p= 0.3 0.4 0.2 0.1
0.2 0.3 0.5 0
0.1 0 0.8 0.1
0.4 0 0 0.6
At each point of time, the decision maker may leave the system and receive a reward of R=20 units, or alternatively remain in the system and receive a reward of r(si) units if the system occupies state si. If the decision maker decides to remain in the system, its state at the next decision epoch is determined by p. Assume a discount rate of 0.9 and that r(si)=i. Find the optimal policy that maximizes expected total discounted reward.(if you do with computer attach with the code)