I worked as a backend engineer for the notifications team. My team was responsible for the delivery of every kind of notification from Razorpay. This included SMSes, Emails directly to end users containing OTPs, invoices, etc along with payment confirmation webhooks to the merchants' servers. My team was one of the most crucial teams at Razorpay and as a result, we have aggressive NFR goals, such as availability of five nines.
- I implemented async DB writes of our notification attempts using kafka streams. This led to enormous horizontal scalability of worker pods. What used to be previously capped at ~190 pods could now be scaled to 1000 and more based on the load. Controlling the rate at which ingestion is happening in the DB made sure the DB did not go down due to unprecedented load, thus improving the reliability of our system.
- I implemented quality of service based flow control for webhooks based on the merchant server's response time using Apache Pinot as the data source. Essentially, what this would do is stream merchant response time details to Pinot after each webhook attempt. Then in our Web pods, we would query Pinot to find out merchants that have a response time greater than some predefined value and route such merchants' webhooks to a lower priority queue. This tremendouesly improved the p99 webhooks delivery time (over 50% reduction in time taken was seen).
- I worked on and helped deliver the webhooks flow control as well as rate limiting feature. I drove this feature end to end, did tech solutioning and execution. Did numerous performance tests and consulted with various stakeholders. This protected our service from DDOS attacks