paint-brush
Market-Making & Liquidity-Taking Agents Leverages Independent Policies & Reward-driven Strategiesby@reinforcement
101 reads

Market-Making & Liquidity-Taking Agents Leverages Independent Policies & Reward-driven Strategies

tldt arrow

Too Long; Didn't Read

This section focuses on RL-based Market-Making (MM) and Liquidity-Taking (LT) agents in a simulated market. By eliminating replayed market activity, each agent uses independent policies and adapts through rewards tied to PnL, liquidity, and execution. Realistic connectivity enhances the simulation's fidelity and insights.
featured image - Market-Making & Liquidity-Taking Agents Leverages Independent Policies & Reward-driven Strategies
Reinforcement Technology Advancements HackerNoon profile picture


This is Part 4 of a 11-part series based on the research paper “Reinforcement Learning In Agent-based Market Simulation: Unveiling Realistic Stylized Facts And Behavior”. Use the table of links below to navigate to the next part.

Part 1: Abstract & Introduction

Part 2: Important Concepts

Part 3: System Description

Part 4: Agents & Simulation Details

Part 5: Experiment Design

Part 6: Continual Learning

Part 7: Experiment Results

Part 8: Market and Agent Responsiveness to External Events

Part 9: Conclusion & References

Part 10: Additional Simulation Results

Part 11: Simulation Configuration

3.2 Agents

Our formulation for the MM and LT agents is based on the formulation provided in [6]. The scope of the original paper was to simulate a dealer market rather than a CDA stock market. In the original paper the quotes are provided by multiple RL agents called liquidity providers (dealers) and by an Electronic Communication Network (ECN) agent. In practical implementation, the ECN is a replay of market quotes from a particular day. The other class of RL agents are liquidity takers which have an investment goal defined by their own reward function. The focus of the paper is to understand the interactions among the two types of agents. Specifically, they want to understand how the agents cooperate in the presence of an external public order flow.


In our simulation, the ECN is gone. This allows us to diverge from the one reality problem inherent in financial simulators. When the simulator replays market activity there is a single predetermined price path. Asset prices may diverge slightly but they will all come back to the real history. In the original work, all agents use a shared policy which they learn collaboratively. We believe this approach works and it is relevant in their case, precisely due to the fixed flow ECN. In our case, the shared policy creates behavior that is quite correlated despite the fact that each agent has its own set of hyper-parameters as well as reward structure. This is why in our implementation each agent has its own policy function.


Since our system does not have a stream of orders guiding the evolution of asset prices, we have to establish the realism of the simulated market activity. Only in a realistic market can we try to understand RL agents’ behavior. We present the specific formulations for both MM and LT agents in the following subsection.


3.2.1 Market-Making (MM) Agent


Observation Space. Each MM agent observes: mid-prices for the past 5 time steps, prices and quantities of top 5 levels of the LOB, liquidity provision percentage, current inventory, and its buying power. The liquidity provision percentage for MM agent i is measured by



Action Space. At each time step, the MM agent places two limit orders on both sides of the LOB. The policy function of the MM agent generates three values that collectively determine both the price and size of these limit orders:


• A percentage of the buying power, which is used to determine the size of the limit orders,


• The symmetric price tweak ϵs,


• The asymmetric price tweak ϵa.


The symmetric and asymmetric price tweaks determine the prices of both bid and ask orders (See illustration in Figure 1). The price of the bid order and ask order follow the formula



where s is the spread between the best bid and ask prices. The symmetric price tweak controls the distance between the ask price and the bid price. The asymmetric price tweak controls the average of bid and ask prices moving up/down (see Figure 1). We follow the original formulations in [6].


Reward The reward function for MM agents follows



 Figure 1: A demonstration of the formulation of market makers’ action.



The reward structure in (2) includes both the PnL as well as the amount of liquidity provided by the agents. The first term in the reward function provides an incentive to increase the PnL while imposing penalties for PnL fluctuations resulting from outstanding inventory and price oscillations. The second term in the reward function aims to minimize the discrepancy between the actual liquidity provision percentage and the target liquidity provision percentage.


3.2.2 Liquidity Taking (LT) Agent


Observation Space LT agents have the same observation space as the MM agents without the liquidity provision percentage.


Action Space At each step, an LT agent can choose to place a bid or an ask order, or do nothing and skip this step. Once the order type is chosen, the agent will send a market order with a fixed order size.


Reward The reward function for LT agents follows



3.3 Simulation Details

We initialize the MM and the LT agents with a random amount of buying power and assets. All hyper-parameters of MM and LT agents are randomly sampled. When a simulation starts, all agents are launched using their individual threads.


At the beginning of each time step, all MM and LT agents observe the system and send market or limit orders to the system. Each agent collects experience and stores it in an individual dataset. Training of the agent takes place independently after a certain amount of data has been collected. We train the agents using Proximal Policy Optimization (PPO) [20].


The simulation is implemented within the SHIFT system [22], a real-time high-frequency trading platform. The SHIFT system allows clients to connect through the FIX protocol across the Internet, which realistically replicates the connectivity in modern exchanges. Additionally, random latency due to network communication introduces an element of true randomness to the simulation, enhancing its realism.


Authors:

(1) Zhiyuan Yao, Stevens Institute of Technology, Hoboken, New Jersey, USA (zyao9@stevens.edu);

(2) Zheng Li, Stevens Institute of Technology, Hoboken, New Jersey, USA (zli149@stevens.edu);

(3) Matthew Thomas, Stevens Institute of Technology, Hoboken, New Jersey, USA (mthomas3@stevens.edu);

(4) Ionut Florescu, Stevens Institute of Technology, Hoboken, New Jersey, USA (ifloresc@stevens.edu).


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.