Standards and Conventions

Data Normalization

Normalization simplifies data by standardizing formats, making it easier to use. However, this process often introduces errors. This article explores these issues, trade-offs, and the reasoning behind Blockhouse's schema design.

1. On Orderbook: Price Impact Regression

Orderbook data processing involves transforming raw order book snapshots into a structured format that allows for meaningful market insights such as price impact regression.

EDA

The preprocessing starts with ensuring timestamps are in a consistent datetime format, filtering the data to focus on a specific trading symbol, and sorting chronologically to maintain the sequence of market events.

Feature Engineering

The data is augmented with calculated metrics such as order flow imbalance (OFI), mid-prices, and market depth. This involves extracting insights about buying and selling pressure, identifying the middle ground between the highest bid and lowest ask prices, and calculating the size of orders at different levels in the book.

Resampling

A key normalization step involves resampling the data to a fixed time interval (e.g. 1s, 5s, 10s, etc.). The goal here is to further standardize the data which helps smooth out noise from high-frequency updates and ensures the dataset has uniform spacing.

For instance, data points within each interval are grouped together, with the latest price, the sum of order flow, and the average market depth captured to represent that time period.

Normalization

The order flow imbalance is normalized by dividing it by the average market depth. This helps account for varying levels of liquidity and ensures that patterns are more comparable across different time frames.

Regression Feature Creation

After the data is normalized, additional regression-friendly features are created. This involves calculating forward returns to estimate future price movements and generating lagged versions of the normalized OFI to capture short-term trends. For example, the data is shifted forward to represent future returns and backward to include past observations, creating a comprehensive set of predictors.

PreviousData Schemas & Execution Formats NextVenues and Datasets

Last updated 3 months ago