Example Software Architecture Documentation with arc42 and the C4 model

Software Architecture Documentation with arc42, C4 model and Documentation as Code

1. Introduction and Goals

The purpose of this document is to describe the architecture of a highly scalable image-sharing platform that allows users to upload, store, and share images with others across the globe. The platform needs to handle millions of concurrent users while ensuring high availability, performance, and security.

This document provides a high-level overview of the platform’s architecture, its main components, and the underlying technologies used to meet business and technical requirements. Additionally, it outlines key design decisions that influence the system’s scalability, reliability, and maintainability.

This document is intended for developers, architects, DevOps engineers, and other stakeholders involved in the design, implementation, and maintenance of the platform.

1.1. Goals

The platform should provide a responsive and intuitive user interface that works seamlessly across different devices (web, mobile, etc.).
The architecture should be designed with cloud-native principles to optimize operational costs, using autoscaling and serverless services where appropriate.
The platform should be built with future feature expansion in mind, allowing for easy addition of new services (e.g., image editing, tagging, or machine learning features like image recognition).

1.2. Requirements Overview

RQ1: When a user registers, they provide:

Mandatory: First name, Last name, Email address, Password, Profile Image
Optional: age, location, interests etc …

RQ2: Users can share only images

Formats: png, jpeg
Sizes: 320x240, 1024x768

RQ3: Unidirectional relationship

User A follows User B *User B may/may not follow User A

RQ4 A User can:

Post a new image
Search for other users by first name/last name/username
view other user’s public info and shared images
Follow/unfollow other users
See a personalized timeline/newsfeed of the latest images, posted by people they follow
- The images need to be in the chronological order

Out of Scope

Adding new formats like txt or videos should be easy to add in the future.
A User can:
- Add reactions, comments, sharing etc..

1.3. Quality Goals

ID	Quality Category	Quality	Description
QGS1	Scalability	Users visits	Billions of active users (~1-2 Billion) Hundreds of millions of visits/day (100-500 Million)
QGS3	Scalability	Dynamic Scaling	The system should scale out when user activity increases without performance degradation and scale down when no activity.
QGS4	Scalability	Geographic Distribution	The application should be accessible with minimal latency in the europe regions through data centers or edge servers.
QGR1	Reliability	Minimizes downtime	The system must be designed to ensure minimal downtime, providing a 99.9% availability SLA.
QGP1	Performance	Low Latency	Response time get images, search user <500ms at 99pt download image of size 2mb ~2000 ms Timeline/Newsfeed load time < 1000ms at 99pt Out of scope Uploads and downloads of images must be fast and efficient, with minimal latency, even under heavy loads.
QGM1	Maintainability	Modular Design	The system should be easy to modify and extend. Allowing multiple independent teams to work on different modules simultaneously without conflicts. Supports isolated updates and bug fixes. Changes, such as bug fixes, feature enhancements, or performance improvements, should be implemented quickly with minimal impact on other components.
QGM3	Maintainability	Teams Scalability	Enables multiple teams to work on the system simultaneously without dependencies that cause delays. The architecture should support decoupled microservices, allowing teams to deploy and update their services independently.
QGCE1	Cost Efficiency	Storage Optimization	Reduces storage space usage to minimize costs.

Quality Category

Quality

Description

Scenario

QGS1

Scalability

Users visits

Billions of active users (~1-2 Billion)

Hundreds of millions of visits/day (100-500 Million)

QGS3

Scalability

Dynamic Scaling

The system should scale out when user activity increases without performance degradation and scale down when no activity.

QGS4

Scalability

Geographic Distribution

The application should be accessible with minimal latency in the europe regions through data centers or edge servers.

QGR1

Reliability

Minimizes downtime

The system must be designed to ensure minimal downtime, providing a 99.9% availability SLA.

QGP1

Performance

Low Latency

Response time

get images, search user <500ms at 99pt
download image of size 2mb ~2000 ms
Timeline/Newsfeed load time < 1000ms at 99pt

Out of scope

Uploads and downloads of images must be fast and efficient, with minimal latency, even under heavy loads.

QGM1

Maintainability

Modular Design

The system should be easy to modify and extend.
Allowing multiple independent teams to work on different modules simultaneously without conflicts.
Supports isolated updates and bug fixes. Changes, such as bug fixes, feature enhancements, or performance improvements, should be implemented quickly with minimal impact on other components.

QGM3

Maintainability

Teams Scalability

Enables multiple teams to work on the system simultaneously without dependencies that cause delays. The architecture should support decoupled microservices, allowing teams to deploy and update their services independently.

QGCE1

Cost Efficiency

Storage Optimization

Reduces storage space usage to minimize costs.

1.4. Stakeholders

Role/Name	Contact	Expectations
Legal and Compliance Team	LegalAndComplianceReam@gmail.com	Ensure that the platform complies with global data protection regulations such as GDPR, CCPA, and HIPAA.
Marketing Team Director	marketing@gmail.com	Ensure the platform’s content is optimized for search engine discoverability (SEO), allowing users to find images through organic search traffic.

Role/Name

Contact

Expectations

Legal and Compliance Team

LegalAndComplianceReam@gmail.com

Ensure that the platform complies with global data protection regulations such as GDPR, CCPA, and HIPAA.

Marketing Team Director

marketing@gmail.com

Ensure the platform’s content is optimized for search engine discoverability (SEO), allowing users to find images through organic search traffic.

2. Architecture Constraints

ID	Quality Category
Con1	The platform must comply with data protection laws (e.g., GDPR, CCPA) and intellectual property regulations. This includes implementing necessary data privacy measures, consent mechanisms, and content moderation practices to avoid legal liabilities.
Con2	The solution must adhere to budgetary constraints for development, deployment, and ongoing operational costs.
Con3	.NET technologies
Con4	.NET Use Google Authentication,
Con5	Use Azure Cloud

Quality Category

Con1

The platform must comply with data protection laws (e.g., GDPR, CCPA) and intellectual property regulations. This includes implementing necessary data privacy measures, consent mechanisms, and content moderation practices to avoid legal liabilities.

Con2

The solution must adhere to budgetary constraints for development, deployment, and ongoing operational costs.

Con3

.NET technologies

Con4

.NET Use Google Authentication,

Con5

Use Azure Cloud

3. System Scope and Context

3.1. Business Context

Legende

Actor/System	Description
User	Users can upload images to the platform. can view and browse images uploaded by themselves and other users. can manage their profiles, including security settings like password management, and other preferences. might follow other users to receive updates or view their image feed.
Follower	Followers are notified about new images, posts, or activities (e.g., stories, comments) from the users they follow. Followers can browse through and view images and other media from their followed users in their feed. Followers can follow or unfollow users based on their preferences.
Content Manager	Monitor and correct false positives or errors from automated systems (e.g., AI-based content filtering). Review images, comments, and other user-generated content for violations of the platform’s guidelines (e.g., inappropriate, offensive, or illegal material). Respond to user reports of harmful content by investigating flagged images or accounts.
Image sharing system	Use the Explore page to discover new posts, accounts, and trends based on interests and previous interactions. Enable users to upload images and store them in a scalable and durable storage solution.
Azure storage	Acts as the primary storage for uploaded images, allowing users to save and retrieve them quickly. Work with Azure CDN to cache images globally for faster access and reduced latency.
Google auth system	Allow users to log in to the application using their Google account credentials. Provide secure tokens (JWT or OAuth tokens) for authenticated sessions, managing the lifecycle of these tokens (e.g., refresh, revoke).

Actor/System

Description

User

Users can upload images to the platform. can view and browse images uploaded by themselves and other users. can manage their profiles, including security settings like password management, and other preferences. might follow other users to receive updates or view their image feed.

Follower

Followers are notified about new images, posts, or activities (e.g., stories, comments) from the users they follow. Followers can browse through and view images and other media from their followed users in their feed. Followers can follow or unfollow users based on their preferences.

Content Manager

Monitor and correct false positives or errors from automated systems (e.g., AI-based content filtering). Review images, comments, and other user-generated content for violations of the platform’s guidelines (e.g., inappropriate, offensive, or illegal material). Respond to user reports of harmful content by investigating flagged images or accounts.

Image sharing system

Use the Explore page to discover new posts, accounts, and trends based on interests and previous interactions. Enable users to upload images and store them in a scalable and durable storage solution.

Azure storage

Acts as the primary storage for uploaded images, allowing users to save and retrieve them quickly. Work with Azure CDN to cache images globally for faster access and reduced latency.

Google auth system

Allow users to log in to the application using their Google account credentials. Provide secure tokens (JWT or OAuth tokens) for authenticated sessions, managing the lifecycle of these tokens (e.g., refresh, revoke).

3.2. Technical Context

Legende

Actor/System	Description
User	Images: Formats: JPEG, PNG Max Size: 2 MB per image Max Count: 5 Max Resolution: 1080x1080 User information: phone number, age, avatar …
Image sharing system	A highly scalable microservices architecture that utilizes eventual consistency to ensure availability and fault tolerance. It is exposed to external services via an API gateway, with access restricted exclusively to the Europe region.
Azure storage	Stores images as blob files, stores multiple resoultions, stores with minimal redundency configurations. Image Resolutions: Thumbnails (Grid View or Previews) Resolution: 161 x 161 pixels (1:1 aspect ratio) Max Size: Under 100 KB Cover Images (IGTV, Collections, etc.) Resolution: 420 x 654 pixels (4:5 aspect ratio) Max Size: Around 100-300 KB Other types will be provided by designers
Google auth system	Integrates using OAuth 2.0 with the implicit flow.

Actor/System

Description

User

Images:
Formats: JPEG, PNG
Max Size: 2 MB per image
Max Count: 5
Max Resolution: 1080x1080

User information: phone number, age, avatar …

Image sharing system

A highly scalable microservices architecture that utilizes eventual consistency to ensure availability and fault tolerance. It is exposed to external services via an API gateway, with access restricted exclusively to the Europe region.

Azure storage

Stores images as blob files, stores multiple resoultions, stores with minimal redundency configurations.
Image Resolutions:
Thumbnails (Grid View or Previews)
Resolution: 161 x 161 pixels (1:1 aspect ratio)
Max Size: Under 100 KB
Cover Images (IGTV, Collections, etc.)
Resolution: 420 x 654 pixels (4:5 aspect ratio)
Max Size: Around 100-300 KB
Other types will be provided by designers

Google auth system

Integrates using OAuth 2.0 with the implicit flow.

4. Solution Strategy

Quality Goal	Quality goal category	Solution approach	Details
QGM1, QGM2, QGM3	Maintainability, Scalability	The system will adopt a microservices architecture to meet the requirement of handling 1 billion users, as defined by the quality goals. This architecture allows each microservice to scale independently, ensuring efficient load management. The structure of microservices also supports parallel development by multiple teams, enabling faster and more flexible system evolution.	Microservices architecture design
QGM1	Maintainability	The Microservice will follow Clean Architecture principles to maintain a clear separation of concerns between business logic and external systems (e.g., databases, user interfaces). This approach promotes high maintainability, testability, and flexibility. By isolating core business rules from technical details, the system will remain adaptable to future changes, allowing easy updates and enhancements.	Clean architecture
QGS1, QGS2	Maintainability, Scalability	REST APIs for microservice interface design. REST is based on some constraints and principles that promote simplicity, scalability, and statelessness in the design. The client-server design pattern enforces the separation of concerns, which helps the client and the server components evolve independently.	What is REST
QGC1	Compatibility	API Versioning and Backward Compatibility API versioning was implemented to ensure that newer versions of services can be introduced without breaking existing clients. This supports smooth transitions and ensures system stability during upgrades.	api versioning
QGS3, QGR2	Reliability, Scalability	An event-driven architecture was chosen to decouple services, improve system responsiveness, and allow services to scale independently. Asynchronous communication via messaging queues supports high availability and scalability.	What is REST
QGS3, QGR2	Fault Tolerance, Reliability	Materialized Views for Data Replication Materialized views were chosen to create copies of data in different microservice databases. This approach increases fault tolerance by ensuring that each microservice has its own local copy of the data, reducing dependency on other services and improving reliability.	Materialized view
QGS1, QGS3	Scalability	NoSQL databases with data partitioning, sharding, and replication were selected to handle large volumes of data and high load efficiently. Sharding ensures that data is distributed across multiple servers, allowing horizontal scaling as the dataset grows. Replication ensures that copies of data are stored across multiple nodes, improving fault tolerance and system reliability. Replication guarantees data availability even if a node fails, while sharding helps with load balancing by isolating data into smaller, manageable chunks.	Materialized view
QGR1, QGR2	Reliability, Availability	We chose to implement system Health Monitoring and Alerts and automated alerting to ensure real-time tracking of system performance, failures, and anomalies. This enables proactive issue resolution, minimizes downtime, and improves overall reliability. Tools like Azure Monitor or Application Insights will be used to set up alerts and notifications based on defined thresholds. Reliability, Availability	Monitor azure resource
QGR3	Fault Tolerance	An active-active availability strategy was chosen to ensure that multiple instances of the system are running concurrently across 2 regions. This reduces downtime and improves fault tolerance, as traffic can be rerouted instantly to other active nodes in the event of a failure. It also helps with load balancing and improves response times by directing users to the nearest active instance.	active-active availability strategy
QGCE1	Cost Efficiency	To reduce storage space usage, images are resized before storage, retaining only necessary dimensions and compressing files to improve performance. This strategy minimizes storage costs and optimizes retrieval speeds, especially for high-resolution or large images that may not be necessary in full resolution for all use cases.
QGCE1	Cost Efficiency	Images are stored in a single copy, as they are not considered mission-critical. This strategy reduces storage costs and complexity while ensuring that the system can still operate effectively if images are lost or corrupted. Since the loss of images will not affect core system functionality, this approach reduces redundancy overhead, making it more cost-efficient.	storage-redundancy
QGS1	Performance, Availability, Scalability, Cost Efficiency	A Azure Front Door CDN caching was chosen to offload the delivery of images content to edge servers located closer to users.	TODO
QGSEC1	Security	A Azure Front Door as an API Gateway was chosen to act as a single entry point for all incoming requests, hiding the internal complexities of the system. The API Gateway abstracts the details of microservices and routes requests to the appropriate backend services. It also provides centralized management for security, authentication, rate limiting, logging, and monitoring. This reduces the need for client applications to directly communicate with multiple services and simplifies the architecture by providing a uniform interface for users.	gateway
QGC1	Maintainability	External configuration storage was chosen to centralize the management of system configuration settings on the environments. Configuration settings can be easily updated without needing to redeploy services. This allows dynamic configuration changes without downtime, enables more secure management of sensitive information (e.g., API keys, database connections), and simplifies maintenance and updates.	azure-app-configuration
QGC1	Maintainability, Compatibility	Feature toggles (also known as feature flags) were chosen to enable or disable features dynamically without deploying new code. This allows for gradual rollouts, A/B testing, and safe experimentation in production environments. By decoupling feature releases from deployment cycles, it also minimizes the risk of introducing new features or changes that could negatively impact the user experience. This strategy enables greater flexibility in managing features and reduces the complexity of large-scale system changes.	Concept feature management
QGM3	Maintainability	CI/CD pipelines were selected to automate deployment and testing, allowing faster and more reliable releases. This strategy reduces manual errors and ensures continuous integration of new features, improving overall system agility.	Concept feature management
QGM2	Maintainability	A comprehensive testing strategy was selected to ensure system robustness, quality, and stability across all stages of development. This includes unit tests, integration tests, load tests, and acceptance tests. By automating testing as part of the CI/CD pipeline, we ensure that code is consistently validated before deployment. Load and performance testing help simulate real-world scenarios, while automated acceptance tests ensure that new features meet business requirements. This proactive approach improves the reliability and maintainability of the system.	4-different-types-of-testing integration-tests
QGS1	Scalability	Load balancing was implemented to evenly distribute traffic across multiple instances of the service, ensuring that no single instance becomes a bottleneck. This improves system performance, minimizes downtime, and ensures high availability by dynamically routing traffic to healthy instances. It also allows the system to scale horizontally by adding new instances as needed. Load balancing is crucial for handling increased load and providing a seamless user experience under varying traffic conditions.	TODO
QGS1, QGR1	Scalability, Fault Tolerance	Service discovery was chosen to enable dynamic detection and registration of services within the system. This allows services to communicate with each other without hard-coded addresses or manual configuration. It ensures that services can automatically locate and interact with the correct endpoints, even as instances are added or removed. Service discovery is critical in a microservice architecture where services may scale up or down, and it simplifies the management of complex, dynamic environments.	TODO

Quality Goal

Quality goal category

Solution approach

Details

QGM1, QGM2, QGM3

Maintainability, Scalability

The system will adopt a microservices architecture to meet the requirement of handling 1 billion users, as defined by the quality goals. This architecture allows each microservice to scale independently, ensuring efficient load management. The structure of microservices also supports parallel development by multiple teams, enabling faster and more flexible system evolution.

Microservices architecture design

QGM1

Maintainability

The Microservice will follow Clean Architecture principles to maintain a clear separation of concerns between business logic and external systems (e.g., databases, user interfaces). This approach promotes high maintainability, testability, and flexibility. By isolating core business rules from technical details, the system will remain adaptable to future changes, allowing easy updates and enhancements.

Clean architecture

QGS1, QGS2

Maintainability, Scalability

REST APIs for microservice interface design. REST is based on some constraints and principles that promote simplicity, scalability, and statelessness in the design. The client-server design pattern enforces the separation of concerns, which helps the client and the server components evolve independently.

What is REST

QGC1

Compatibility

API Versioning and Backward Compatibility API versioning was implemented to ensure that newer versions of services can be introduced without breaking existing clients. This supports smooth transitions and ensures system stability during upgrades.

api versioning

QGS3, QGR2

Reliability, Scalability

An event-driven architecture was chosen to decouple services, improve system responsiveness, and allow services to scale independently. Asynchronous communication via messaging queues supports high availability and scalability.

What is REST

QGS3, QGR2

Fault Tolerance, Reliability

Materialized Views for Data Replication Materialized views were chosen to create copies of data in different microservice databases. This approach increases fault tolerance by ensuring that each microservice has its own local copy of the data, reducing dependency on other services and improving reliability.

Materialized view

QGS1, QGS3

Scalability

NoSQL databases with data partitioning, sharding, and replication were selected to handle large volumes of data and high load efficiently. Sharding ensures that data is distributed across multiple servers, allowing horizontal scaling as the dataset grows. Replication ensures that copies of data are stored across multiple nodes, improving fault tolerance and system reliability. Replication guarantees data availability even if a node fails, while sharding helps with load balancing by isolating data into smaller, manageable chunks.

Materialized view

QGR1, QGR2

Reliability, Availability

We chose to implement system Health Monitoring and Alerts and automated alerting to ensure real-time tracking of system performance, failures, and anomalies. This enables proactive issue resolution, minimizes downtime, and improves overall reliability. Tools like Azure Monitor or Application Insights will be used to set up alerts and notifications based on defined thresholds. Reliability, Availability

Monitor azure resource

QGR3

Fault Tolerance

An active-active availability strategy was chosen to ensure that multiple instances of the system are running concurrently across 2 regions. This reduces downtime and improves fault tolerance, as traffic can be rerouted instantly to other active nodes in the event of a failure. It also helps with load balancing and improves response times by directing users to the nearest active instance.

active-active availability strategy

QGCE1

Cost Efficiency

To reduce storage space usage, images are resized before storage, retaining only necessary dimensions and compressing files to improve performance. This strategy minimizes storage costs and optimizes retrieval speeds, especially for high-resolution or large images that may not be necessary in full resolution for all use cases.

QGCE1

Cost Efficiency

Images are stored in a single copy, as they are not considered mission-critical. This strategy reduces storage costs and complexity while ensuring that the system can still operate effectively if images are lost or corrupted. Since the loss of images will not affect core system functionality, this approach reduces redundancy overhead, making it more cost-efficient.

storage-redundancy

QGS1

Performance, Availability, Scalability, Cost Efficiency

A Azure Front Door CDN caching was chosen to offload the delivery of images content to edge servers located closer to users.

TODO

QGSEC1

Security

A Azure Front Door as an API Gateway was chosen to act as a single entry point for all incoming requests, hiding the internal complexities of the system. The API Gateway abstracts the details of microservices and routes requests to the appropriate backend services. It also provides centralized management for security, authentication, rate limiting, logging, and monitoring. This reduces the need for client applications to directly communicate with multiple services and simplifies the architecture by providing a uniform interface for users.

gateway

QGC1

Maintainability

External configuration storage was chosen to centralize the management of system configuration settings on the environments. Configuration settings can be easily updated without needing to redeploy services. This allows dynamic configuration changes without downtime, enables more secure management of sensitive information (e.g., API keys, database connections), and simplifies maintenance and updates.

azure-app-configuration

QGC1

Maintainability, Compatibility

Feature toggles (also known as feature flags) were chosen to enable or disable features dynamically without deploying new code. This allows for gradual rollouts, A/B testing, and safe experimentation in production environments. By decoupling feature releases from deployment cycles, it also minimizes the risk of introducing new features or changes that could negatively impact the user experience. This strategy enables greater flexibility in managing features and reduces the complexity of large-scale system changes.

Concept feature management

QGM3

Maintainability

CI/CD pipelines were selected to automate deployment and testing, allowing faster and more reliable releases. This strategy reduces manual errors and ensures continuous integration of new features, improving overall system agility.

Concept feature management

QGM2

Maintainability

A comprehensive testing strategy was selected to ensure system robustness, quality, and stability across all stages of development. This includes unit tests, integration tests, load tests, and acceptance tests. By automating testing as part of the CI/CD pipeline, we ensure that code is consistently validated before deployment. Load and performance testing help simulate real-world scenarios, while automated acceptance tests ensure that new features meet business requirements. This proactive approach improves the reliability and maintainability of the system.

4-different-types-of-testing
integration-tests

QGS1

Scalability

Load balancing was implemented to evenly distribute traffic across multiple instances of the service, ensuring that no single instance becomes a bottleneck. This improves system performance, minimizes downtime, and ensures high availability by dynamically routing traffic to healthy instances. It also allows the system to scale horizontally by adding new instances as needed. Load balancing is crucial for handling increased load and providing a seamless user experience under varying traffic conditions.

TODO

QGS1, QGR1

Scalability, Fault Tolerance

Service discovery was chosen to enable dynamic detection and registration of services within the system. This allows services to communicate with each other without hard-coded addresses or manual configuration. It ensures that services can automatically locate and interact with the correct endpoints, even as instances are added or removed. Service discovery is critical in a microservice architecture where services may scale up or down, and it simplifies the management of complex, dynamic environments.

TODO

5. Building Block View

Maintain an overview of your source code by making its structure understandable through abstraction.

5.1. High level design

Legende

5.2. Microservices

5.2.1. Timelines API

The Timelines API microservice is responsible for managing user timelines in the image-sharing system. It retrieves posts from users that the specified user follows, ensuring that they are ordered by recency.

Interface(s)

Locations

Fulfilled Requirements

Users can read their timeline with recent updates from users they follow.
Users can set number of posts in their timeline, max 50 posts.

Quality Requirements

Aim for responses within 200 milliseconds.
Support 1000 requests per second (RPS) during peak times.
The Timelines API must remain operational even if other part of the system is temporarily unavailable.

Risks

Risk of slow response times or degraded performance during peak usage periods.

Components

Updating Timeline postCreated event consumer

Is responsible for processing updates to user timelines in response to new created post.

Followers timeline updater

Reads user followers list and insert recent post to the followers timeline, ensuring posts is ordered by recency.
Removes outdated posts from the followers timeline.

Users client

Responsible for interacting with the Users API microservice to retrieve user data
Fetching a list of followers for a given user.
Retrieving the user type regular user, influencer.
UsersClient implemented as an abstraction and has Mock version that uses for microservice integration testing

Timelines repository

Responsible for managing and accessing timeline data.
Saving and retrieving followers timeline by user Id.

Influencers posts repository

Responsible for storing and managing influencer posts within the timeline service.

Timeline endpoint

Handles Get timeline HTTP requests

Timeline query

Querying and returning calculated user’s timeline by user Id
if the user follows any influencers, Querying and returning a user’s timeline by aggregating user timeline with an influencers posts.

Timelines database

The Key/Value NoSQL Timelines Database is responsible for efficiently storing and retrieving timeline data in the image-sharing system.

Timelines table

Storing each user’s timeline as a unique key, with associated values representing the posts in chronological order, allowing for quick retrieval of timelines.

Specification	Details
Table Name	Timelines
Key	User Id. Data format: GUID as string.
Value	Array of posts, each represented as an object. Data format: JSON. Max size: 20 MB

Specification

Details

Table Name

Timelines

Key

User Id.
Data format: GUID as string.

Value

Array of posts, each represented as an object.
Data format: JSON.
Max size: 20 MB

Influences posts table

Is responsible for managing and storing copy of posts created by influencers.

Specification	Details
Table Name	InfluencersPosts
Key	Influencer Id. Data format: GUID as string.
Value	Array of posts, each represented as an object. Data format: JSON. Max size: 20 MB

Specification

Details

Table Name

InfluencersPosts

Key

Influencer Id.
Data format: GUID as string.

Value

Array of posts, each represented as an object.
Data format: JSON.
Max size: 20 MB

Runtime views

Handling influencer followers timeline

Posts api microservice dispatches post created event. Post created event consumes by post created event consumer component.

5.2.2. Users API

The Users API microservice is responsible for managing all user-related functionality within the system. Its primary purpose is to handle operations such as user registration, authentication, profile management, and account settings. It also manages user data, including personal information, preferences, and security features like password updates and account recovery.

Interface(s)

REST API Messages

Locations

users-api details users-api repository users-api CI/CD users-api database users-api releases

Quality/Performance Characteristics

Performance

Aim for responses within 200 milliseconds.
Support several thousand requests per second (RPS) during peak times.
Minimize network latency with efficient API calls and lightweight data formats.

Scalability

Design to add more instances as user demand increases.
Distribute requests evenly across multiple API instances.

Reliability

Target 99.9% uptime with redundancy and failover strategies.
Ensure limited functionality in case of partial failures.

Fulfilled Requirements

Users can create accounts by providing necessary details (username, email, password).
Users can log in using their credentials, receiving an access token for subsequent requests.
Users can view and update their profile information (e.g., username, bio).
All sensitive user data is encrypted in transit and at rest.
The API implements rate limiting to prevent abuse and ensure fair usage.

Risks

Ensuring compliance with data protection regulations (e.g., GDPR) regarding user data handling and storage.
Risks of unauthorized access or data breaches if security measures are not adequately implemented.
Reliance on third-party authentication providers could lead to outages or service interruptions.
Complexity in the registration or authentication process may lead to user drop-off.

6. Runtime View

6.1. Sign-In

Legende

7. Deployment View

7.1. Development environment

Legende

7.1.1. Motivation

The development environment serves as a critical space for developers and QA engineers to validate features before deployment to production. This environment allows for comprehensive testing and debugging, enabling the team to identify and address issues early in the development cycle. By simulating a production-like setup, the development environment ensures that new features, updates, and configurations are thoroughly vetted, reducing the risk of errors and unexpected behavior in the live system.

Additionally, this environment promotes collaborative testing and iterative improvement, fostering a controlled, stable setting for finalizing code quality and functionality prior to production release.

7.1.2. Quality and/or Performance Features

The environment does not utilize Azure Front Door since it does not handle user traffic.
Azure Kubernetes Service (AKS) is configured with only 2 nodes, as traffic demands are minimal in this environment.
It supports deploying all microservices under separate namespaces, enabling the creation of isolated feature environments for testing individual features or changes.
Resources are optimized to reduce costs, given that the environment’s primary purpose is internal testing rather than production-scale performance.
Monitoring and logging are set up to capture development-level performance and error data, aiding in debugging without the need for high-traffic resilience.

7.2. Production environment

Legende

7.2.1. Motivation

The production environment is designed to provide a stable, secure, and high-performing infrastructure for serving live users. It is configured to handle real-world traffic, ensuring high availability and scalability to meet varying demands. By mirroring the final deployment setup, the production environment allows the system to operate at its best, delivering a seamless user experience. Robust security measures, such as firewalls, encryption, and access controls, are implemented to protect sensitive data and maintain compliance with industry standards. Additionally, advanced monitoring and alerting systems are set up to detect and respond to issues in real time, minimizing downtime and ensuring business continuity. This environment represents the ultimate stage in the deployment pipeline, where thoroughly tested features and updates are made available to end users, requiring the highest standards of reliability, performance, and resilience.

7.2.2. Quality and/or Performance Features

The production environment is configured for maximum uptime, utilizing redundancy and failover mechanisms to ensure service continuity, even in case of partial system failures.
Auto-scaling features are enabled to handle fluctuations in user demand, allowing the system to maintain performance during peak loads without manual intervention.
Strict security protocols, such as encryption, access control, and network isolation, are in place to protect sensitive data and meet industry compliance standards.
Optimized for low-latency responses through global load balancing (e.g., Azure Front Door) and geographic replication, reducing response times for users across different regions.
Real-time monitoring and alerting are established to track performance metrics, identify potential issues, and trigger alerts to the operations team for rapid incident response.
Implemented through distributed systems and backup mechanisms, ensuring the system can withstand hardware failures, network issues, and other disruptions without affecting user experience.
Ensures strong or eventual consistency depending on the service requirements, balancing availability with the consistency needs of different microservices in the system.

8. Cross-cutting Concepts

8.1. Logging

In the microservice architecture for our application, logging is implemented as a cross-cutting concern to provide consistent, centralized, and structured logging across all services. Using ASP.NET Core as the framework and Azure Application Insights (App Insights) as the logging and monitoring solution, our approach to logging supports traceability, debugging, and performance monitoring across the system.

8.1.1. Description

Centralized Logging: Each microservice uses ASP.NET Core’s built-in logging capabilities, which are configured to send structured log data to Azure App Insights. This setup ensures that logs are collected centrally, making it easier to track interactions and dependencies across services.
Correlation IDs: Each request across microservices is tagged with a unique Correlation ID. This ID is propagated throughout all API calls, message queues, and external service integrations, allowing developers and operators to trace the lifecycle of requests across multiple services and identify bottlenecks or errors in the request flow.
Log Levels: Different log levels (e.g., Information, Warning, Error, Critical) are used to classify log messages based on severity. Critical errors and warnings are set to trigger alerts in App Insights, providing immediate notification to the support team if any issues arise.
Performance Metrics and Insights: In addition to application logs, App Insights collects performance metrics such as request durations, failure rates, and dependencies. These metrics help monitor the health of the microservices and identify potential performance issues or resource bottlenecks.
Security and Privacy Compliance: Logging is configured to avoid sensitive data exposure. Personally identifiable information (PII) is redacted or excluded from logs, ensuring compliance with privacy regulations and standards.

8.2. Message Handling with MassTransit and Azure Service Bus

In our microservices architecture, message handling is treated as a cross-cutting concern to facilitate reliable and decoupled communication between services. By implementing the MassTransit library with Azure Service Bus, we enable multiple consumers to process the same message type independently, enhancing modularity and reducing service coupling.

8.2.1. Description

Multiple Consumers for the Same Event: MassTransit allows for multiple consumers to subscribe to a single event type. For example, when a PostCreatedEvent is published, both the AuditConsumer and NotificationConsumer can independently handle the message, ensuring that different parts of the system can respond to the same event in their own ways.
Decoupled Communication: This approach fosters loose coupling between services. Each consumer operates independently and can be modified or replaced without impacting other consumers or the message publisher.
Scalability and Reliability: Utilizing Azure Service Bus provides a robust messaging infrastructure that handles message queuing, delivery guarantees, and scaling, which ensures that the system remains responsive even under high load.

8.2.2. Code Example: Consumer Registration

In the following code snippet, we register two consumers—AuditConsumer and NotificationConsumer—to listen for PostCreatedEvent messages using MassTransit with Azure Service Bus.

public void ConfigureServices(IServiceCollection services)
{
    services.AddMassTransit(x =>
    {
        // Register each consumer for handling PostCreatedEvent
        x.AddConsumer<AuditConsumer>();
        x.AddConsumer<NotificationConsumer>();

        // Configure the MassTransit bus to use Azure Service Bus
        x.UsingAzureServiceBus((context, cfg) =>
        {
            cfg.Host("your-azure-service-bus-connection-string");

            // Configure the topic and subscription for the PostCreatedEvent
            cfg.Message<PostCreatedEvent>(configTopology =>
            {
                configTopology.SetEntityName("post-created-topic");
            });

            // Configure subscription endpoints for each consumer
            cfg.SubscriptionEndpoint("audit-consumer-subscription", e =>
            {
                e.ConfigureConsumer<AuditConsumer>(context);
            });

            cfg.SubscriptionEndpoint("notification-consumer-subscription", e =>
            {
                e.ConfigureConsumer<NotificationConsumer>(context);
            });
        });
    });

    services.AddMassTransitHostedService();
}

9. Architecture Decisions

9.1. ADR 0001: Handling influencer followers timeline

9.1.1. Status

Proposed: 2024-02-21

Accepted: 2024-10-10

9.1.2. Context

Influencers may have millions of followers, and updating timelines immediately after each post could trigger updates for millions of users, risking overload of the timelines service.

9.1.3. Decision

To mitigate this, a separate materialized view table for influencer posts was created. Given the limited number of influencers, this table remains relatively small. When an influencer publishes a new post, it is added to the influencers posts table, with entries ordered in descending order, similar to the main timelines table. During a user timeline query, the timeline query component retrieves the user’s list of followed influencers. If influencers are followed, posts from the influencers posts table are aggregated with the user’s timeline posts, creating a unified, chronological feed.

Runtime view of the scenario you can find on the following section Handling influencer followers

9.1.4. Consequences

The Timelines API microservice now maintains a copy of influencer posts, requiring additional logic to ensure data remains synchronized. The user timeline query is a high-load operation, so caching the influencer list is necessary to minimize dependency on the Users API microservice. However, if the Users API service is down, the influencer list cache may become outdated, resulting in an out-of-date timeline for the user.

10. Quality Requirements

ID	Quality Category	Quality	Description
QGS2	Scalability	Users activity	Each user uploads ~1 image/day Each image size ~2MB Data Processing Volume: ~1PB/day
QGR2	Reliability	Fault Tolerance	Maintains service continuity despite failures. In case of a failure, the system should redistribute load seamlessly, avoiding disruptions.
QGP1	Performance	Low Latency	Response time get images, search user <500ms at 99pt download image of size 2mb ~2000 ms Timeline/Newsfeed load time < 1000ms at 99pt Out of scope Uploads and downloads of images must be fast and efficient, with minimal latency, even under heavy loads.
QGP2	Performance	High Throughput	The system should support up to 10,000 concurrent requests per second, especially during peak hours.
QGSEC1	Security	Access Control	Limits access for functionality to authorized users only. Only authenticated and authorized users can access specific features like posting content, following other users. Un authorized users can view limited timeline.
QGS2	Security	Data Encryption	Protects sensitive data both in transit and at rest. User credentials and personal information should be encrypted to ensure privacy and comply with data regulations.
QGP1	Portability	User Interfaces	Support only web interface
QGM2	Maintainability	Automated Testing	Validates system functionality with each deployment. Automated integration and regression tests should run for each deployment to reduce the risk of introducing errors or breaking functionality.
QGM3	Maintainability	Teams Scalability	Enables multiple teams to work on the system simultaneously without dependencies that cause delays. The architecture should support decoupled microservices, allowing teams to deploy and update their services independently.
QGC1	Compatibility	Backward Compatibility	Supports multiple API versions and maintains backward compatibility with existing clients. The system should allow existing clients to function without breaking when new API versions are introduced.

Quality Category

Quality

Description

Scenario

QGS2

Scalability

Users activity

Each user uploads ~1 image/day

Each image size ~2MB
Data Processing Volume: ~1PB/day

QGR2

Reliability

Fault Tolerance

Maintains service continuity despite failures. In case of a failure, the system should redistribute load seamlessly, avoiding disruptions.

QGP1

Performance

Low Latency

Response time

get images, search user <500ms at 99pt
download image of size 2mb ~2000 ms
Timeline/Newsfeed load time < 1000ms at 99pt

Out of scope

Uploads and downloads of images must be fast and efficient, with minimal latency, even under heavy loads.

QGP2

Performance

High Throughput

The system should support up to 10,000 concurrent requests per second, especially during peak hours.

QGSEC1

Security

Access Control

Limits access for functionality to authorized users only.
Only authenticated and authorized users can access specific features like posting content, following other users.
Un authorized users can view limited timeline.

QGS2

Security

Data Encryption

Protects sensitive data both in transit and at rest. User credentials and personal information should be encrypted to ensure privacy and comply with data regulations.

QGP1

Portability

User Interfaces

Support only web interface

QGM2

Maintainability

Automated Testing

Validates system functionality with each deployment. Automated integration and regression tests should run for each deployment to reduce the risk of introducing errors or breaking functionality.

QGM3

Maintainability

Teams Scalability

QGC1

Compatibility

Backward Compatibility

Supports multiple API versions and maintains backward compatibility with existing clients. The system should allow existing clients to function without breaking when new API versions are introduced.

10.1. Quality Scenarios

11. Risks

Risk ID	Risk Description	Mitigation Strategy
R1	High load from influencers causing performance degradation due to large-scale timeline updates.	Implement a separate materialized view for influencer posts to reduce load on the main timeline. Use caching and batch processing.
R2	Dependency on external APIs, leading to potential service disruptions if APIs are down.	Use local caching and circuit breakers to handle temporary outages and minimize the impact on users.
R3	Cost overruns due to high storage and compute needs as the system scales.	Optimize storage through data compression, archiving, and implement auto-scaling to manage resources dynamically.
R4	Inconsistent data updates due to eventual consistency in microservices.	Use idempotent operations and reconcile data periodically to ensure consistency across services.
R5	Risk of unauthorized access and data breaches.	Enforce strong access controls, data encryption, and multi-factor authentication for enhanced security.
R6	Outdated data in influencer timelines due to cache staleness.	Set appropriate cache expiration times and implement cache invalidation strategies to keep data fresh.

Risk ID

Risk Description

Mitigation Strategy

High load from influencers causing performance degradation due to large-scale timeline updates.

Implement a separate materialized view for influencer posts to reduce load on the main timeline. Use caching and batch processing.

Dependency on external APIs, leading to potential service disruptions if APIs are down.

Use local caching and circuit breakers to handle temporary outages and minimize the impact on users.

Cost overruns due to high storage and compute needs as the system scales.

Optimize storage through data compression, archiving, and implement auto-scaling to manage resources dynamically.

Inconsistent data updates due to eventual consistency in microservices.

Use idempotent operations and reconcile data periodically to ensure consistency across services.

Risk of unauthorized access and data breaches.

Enforce strong access controls, data encryption, and multi-factor authentication for enhanced security.

Outdated data in influencer timelines due to cache staleness.

Set appropriate cache expiration times and implement cache invalidation strategies to keep data fresh.

12. Technical Debts

Debt ID	Technical Debt Description	Impact and Remediation Strategy
TD1	Lack of centralized logging system for monitoring and troubleshooting.	Makes debugging across services difficult. Remediate by implementing a centralized logging solution such as Azure Application Insights or ELK Stack.
TD2	Incomplete API versioning, leading to backward compatibility issues.	Causes disruptions for clients using older API versions. Remediate by adopting consistent API versioning and supporting multiple versions.
TD3	Insufficient automated tests for microservices.	Leads to increased risk of errors during deployment. Remediate by developing automated test suites covering unit, integration, and end-to-end tests.
TD4	Hard-coded configurations across services.	Reduces flexibility in deployment and configuration management. Remediate by using a centralized configuration management tool, like Azure App Configuration.
TD5	Inconsistent error handling and retry mechanisms in services.	Leads to unpredictable behavior during failures. Remediate by standardizing error handling and retry policies across services.
TD6	Limited resilience testing for failure scenarios (e.g., network partitions).	Increases the risk of unexpected downtime. Remediate by conducting regular chaos engineering exercises to test service resilience.
TD7	No clear deprecation policy for obsolete services or APIs.	Results in bloated codebase and confusion among developers. Remediate by establishing a deprecation policy with clear timelines for phasing out outdated services or versions.

Debt ID

Technical Debt Description

Impact and Remediation Strategy

TD1

Lack of centralized logging system for monitoring and troubleshooting.

Makes debugging across services difficult. Remediate by implementing a centralized logging solution such as Azure Application Insights or ELK Stack.

TD2

Incomplete API versioning, leading to backward compatibility issues.

Causes disruptions for clients using older API versions. Remediate by adopting consistent API versioning and supporting multiple versions.

TD3

Insufficient automated tests for microservices.

Leads to increased risk of errors during deployment. Remediate by developing automated test suites covering unit, integration, and end-to-end tests.

TD4

Hard-coded configurations across services.

Reduces flexibility in deployment and configuration management. Remediate by using a centralized configuration management tool, like Azure App Configuration.

TD5

Inconsistent error handling and retry mechanisms in services.

Leads to unpredictable behavior during failures. Remediate by standardizing error handling and retry policies across services.

TD6

Limited resilience testing for failure scenarios (e.g., network partitions).

Increases the risk of unexpected downtime. Remediate by conducting regular chaos engineering exercises to test service resilience.

TD7

No clear deprecation policy for obsolete services or APIs.

Results in bloated codebase and confusion among developers. Remediate by establishing a deprecation policy with clear timelines for phasing out outdated services or versions.

13. Glossary

Term	Definition
API Gateway	A server that acts as an entry point for clients, managing requests to multiple backend services in a microservices architecture.
Active-Active Strategy	A high availability approach where multiple regions or instances are active simultaneously, ensuring seamless failover and load balancing.
Microservices	An architectural style that structures an application as a collection of loosely coupled services, each responsible for a specific business capability.
API Versioning	The practice of managing different versions of an API to maintain compatibility and support for clients using older versions.
Eventual Consistency	A consistency model in distributed systems where updates are not immediately reflected across all nodes but will eventually converge to the same state.
High Availability	A system design approach that ensures minimal downtime and continuous operation, often achieved through redundancy and failover mechanisms.
CI/CD Pipeline	A set of automated processes for continuous integration and continuous deployment, enabling rapid and reliable software delivery.
Cache	A temporary data storage layer that stores frequently accessed data to reduce retrieval times and improve performance.
Chaos Engineering	The practice of testing a system’s resilience by intentionally introducing failures to observe how it responds and identify areas for improvement.
Centralized Logging	A logging approach where logs from various services are collected and stored in a central location for monitoring and troubleshooting.
CRUD Operations	Basic data operations: Create, Read, Update, and Delete, commonly used in database management.
Fault Tolerance	The ability of a system to continue operating properly in the event of a failure of some of its components.
Load Balancer	A component that distributes incoming network traffic across multiple servers to ensure availability and reliability.
Namespace	A logical grouping used to organize resources, often used in Kubernetes to isolate applications or environments.
Scalability	The capability of a system to handle increased load by adding resources, such as compute power or storage.
Service Bus	A messaging infrastructure that allows applications to communicate with each other in a decoupled way, commonly used in distributed systems.
SLA (Service Level Agreement)	A commitment between a service provider and a client that defines the expected level of service performance, availability, and support.

Term

Definition

API Gateway

A server that acts as an entry point for clients, managing requests to multiple backend services in a microservices architecture.

Active-Active Strategy

A high availability approach where multiple regions or instances are active simultaneously, ensuring seamless failover and load balancing.

Microservices

An architectural style that structures an application as a collection of loosely coupled services, each responsible for a specific business capability.

API Versioning

The practice of managing different versions of an API to maintain compatibility and support for clients using older versions.

Eventual Consistency

A consistency model in distributed systems where updates are not immediately reflected across all nodes but will eventually converge to the same state.

High Availability

A system design approach that ensures minimal downtime and continuous operation, often achieved through redundancy and failover mechanisms.

CI/CD Pipeline

A set of automated processes for continuous integration and continuous deployment, enabling rapid and reliable software delivery.

Cache

A temporary data storage layer that stores frequently accessed data to reduce retrieval times and improve performance.

Chaos Engineering

The practice of testing a system’s resilience by intentionally introducing failures to observe how it responds and identify areas for improvement.

Centralized Logging

A logging approach where logs from various services are collected and stored in a central location for monitoring and troubleshooting.

CRUD Operations

Basic data operations: Create, Read, Update, and Delete, commonly used in database management.

Fault Tolerance

The ability of a system to continue operating properly in the event of a failure of some of its components.

Load Balancer

A component that distributes incoming network traffic across multiple servers to ensure availability and reliability.

Namespace

A logical grouping used to organize resources, often used in Kubernetes to isolate applications or environments.

Scalability

The capability of a system to handle increased load by adding resources, such as compute power or storage.

Service Bus

A messaging infrastructure that allows applications to communicate with each other in a decoupled way, commonly used in distributed systems.

SLA (Service Level Agreement)

A commitment between a service provider and a client that defines the expected level of service performance, availability, and support.