Cloud

Scaling software systems

Published on 29 January 2024

Scale is a magical word. Where 'start' implies courage. 'Scale' implies success. A start-up is cute, but a scale-up is where the grown-ups work. Whatever you think of these sentiments they do not help meet your scalability challenges. We need to have a different approach.

A scaling software system is part of a larger endeavor. It doesn't have to be a VC funded 'scale-up'. It can be a long-planned replacement system for the tax authorities or a testing platform for primary and secondary schools. The activity of scaling is the same.

The first thing to realize about scaling software system is that you need the right people for the job. You don't need to have a large team to have a sizeable software artifact. However, a small team with a heavy workload does need to be experienced to work through the different phases.

That said several recurring problems will prevent you from scaling your software systems:

Interruptions kill violently
Performance degradation kills slowly
and technical debt incapacitates

Let's break these 3 down.

About the author:

Jurg van Vliet has a diverse work experience spanning various roles and industries. He is currently serving as Aknostic's CEO and a member of the Advisory Board at Bayer CropScience. Prior to this, Jurg was the founder of 30MHz. He wrote two books about AWS.

1. Interruptions

Interruptions can be big or small, and thus, can have a significant impact on the stability and growth of a software system. They can also come in various forms. They can be very visible, such as service outages during critical times (e.g., exam week for an educational platform) or they can be hidden like undetected bugs that surface after deployment.

The key point is that interruptions always cause pain and disruption to the development team and potentially to users. These disruptions can lead to frustration, decreased user satisfaction, and even revenue loss if users turn to alternative solutions. In conclusion, this pain will surface, more often than not with some fireworks.

Let’s deep dive into a real-life example.

One of our clients is serving 25 million European students and teachers in primary and secondary education with digital learning and education platforms.

During regular periods, these learning products and services may experience a certain level of user activity, with students accessing course materials, taking quizzes, and engaging with the content at a relatively consistent rate.

However, during exam week or when they use the applications at the same time (i.e. when school starts), the demand on the educational platform suddenly spikes. Many students and teachers log in simultaneously for teaching, learning, or school administration.

This surge in user activity can lead to several types of interruptions:

Server Overload: The increased number of users accessing the platform can overload the servers and infrastructure and so become slow or unresponsive.
Downtime: In extreme cases, the high demand during exam week may lead to server crashes or service outages.
Bugs and Errors: The sudden increase in traffic can expose previously undetected software bugs or performance bottlenecks.
User Frustration: Interruptions like slow response times or system failures can lead to user frustration, anxiety, and a negative perception of the educational platform.

To address such interruptions and ensure the scalability of the educational tool, you can check scalability, flexibility and the cost-efficiency of your digital learning platforms. Moving to an AWS cloud environment and using Kubernetes container orchestration drastically improves scalability. By continually optimising the cloud infrastructure, and facilitate new innovations and applications, you'll have the flexibility to rapidly play into new market opportunities.

2. Performance degradation

Usually, the first sign of performance degradation refers to the gradual or sudden decline in the performance of a software system. This can manifest as slow response times, the old-fashioned page load increasing , or it can manifest itself through timeouts. When a system's performance deteriorates, the immediate effect is that people (users) may experience frustration, which can lead them to seek alternatives, they will go do (and/or buy) something else, resulting in potential revenue loss. Performance problems are often a sign that the system's infrastructure or architecture may not be able to handle increased loads or changing demands.

Let’s delve into performance degradation in more detail.

Most digital learning solutions are designed to handle a certain number of users and course enrollments efficiently during regular periods. However, as their products and services grow, the demand for its services increases.

Let’s explore the following scenario to illustrate the concept of performance degradation:

Increased User Enrollment: Over time, more educational institutions sign up for the digital learning platform. It is resulting in a significant increase in the number of tools and students using the platform. This growth in user enrollment is a positive sign of scalability.
Performance Decline: As the user base and course offerings expand, the platform may start experiencing performance degradation. Users notice that page load times have increased, making it slower to access course materials, videos, and quizzes. Initially, this might be a minor annoyance, but it can become a more significant issue as the degradation continues.
Timeouts and Delays: Performance issues can manifest as timeouts when trying to submit assignments or access resource-intensive multimedia content, during exam week for example.. Students and teachers may encounter delays in accessing course materials, which negatively impacts the user experience.
User Frustration: Students, teachers, and administrators begin to express frustration over the sluggishness of the platform. Slow response times and delays affect the efficiency of online learning, and students may struggle to complete assignments or participate in live online classes effectively, especially during COVID for example.
Impact on Engagement: As performance issues persist, some students may disengage from the platform, affecting their learning experience. Teachers and administrators may also find it challenging to deliver engaging online classes, leading to a decline in the quality of education.
Risk of Churn: If the performance degradation continues unchecked, educational institutions may consider switching to alternative digital learning platforms that offer a smoother and more responsive experience. This poses a risk of customer churn, where institutions abandon the platform in favor of competitors.

In this scenario, performance degradation poses a significant challenge to the scalability of the digital learning platform. This is why the platform's infrastructure and architecture need to be adequately prepared to handle the rapid growth in user demand and course enrollments.

To address this issue, the platform's development and operations teams must proactively monitor performance metrics, identify bottlenecks. Implement continuous optimizations to ensure that the system can efficiently scale to meet the needs of its expanding user base.

3. Technical Debt

The concept of Technical debt is that refers to the accumulated work that needs to be done in the future due to shortcuts or suboptimal decisions made during the development process. Technical debt can be subtle and may not be immediately apparent. You introduce a technical debt when the software architecture does not adapt to changing requirements or circumstances. Over time, the consequence of technical debt can impede developer productivity. Ans increases the cost of maintaining and enhancing the system. The cause of technical debt is often caused by a lack of experience or lack of agency within the development team where important architectural decisions are postponed or overlooked.

Let's consider an hypothetical example and how technical debt can accumulate over time:

Imagine an online course management system used by universities to deliver courses to students. These systems have been in operation for several years and have grown to support thousands of students and teachers.

Initial Development: The online course management system was initially developed with a tight deadline to meet the university's need for a digital platform to deliver courses. To launch quickly, the development team made certain trade-offs, such as using a monolithic architecture instead of a more modular one.
Rapid Feature Requests: As the system gained popularity, the university received requests from teachers and students for various features. These requests included real-time chat for students, grading options, and multimedia course content. To satisfy these demands, the development team often implemented new features without a comprehensive architecture review.
Code Complexity: Over time, the codebase became increasingly complex. Multiple layers of code were added to accommodate new features, resulting in a tangled web of dependencies. Code quality standards, such as clean coding practices and code documentation, were sometimes sacrificed in favor of rapid development.
Lack of Automated Testing: Due to time constraints, the system lacked comprehensive automated testing. Manual testing was often relied upon, leading to the introduction of occasional bugs and inconsistencies when new features were added or changes were made to existing functionality.
Performance Issues: As more courses and users joined the platform, performance issues began to surface. Slow page load times, occasional downtime during peak usage, and database bottlenecks became common problems.
Technical Debt Realization: The university's IT leadership recognized that the accumulated technical debt was hindering the platform's scalability and maintainability. Developers were spending more time addressing issues and less time on innovative feature development.
Refactoring and Debt Reduction: To address the technical debt, the university allocated resources to refactor the codebase. This involved breaking down the monolithic architecture into more modular components, improving code quality, and implementing automated testing practices. The goal was to enhance system reliability and developer productivity.
Long-Term Benefits: While refactoring and reducing technical debt required an initial investment of time and resources, it led to long-term benefits. The system became more stable, scalable, and easier to maintain. Developer productivity improved as they spent less time firefighting issues and more time delivering valuable features.

In this example, the online course management system initially prioritized rapid development to meet immediate needs. However, the accumulation of technical debt began to hinder the platform's performance and maintainability. Recognizing the importance of addressing technical debt, the university took proactive steps to refactor and improve the system, ultimately ensuring its continued success and usability for both teachers and students.

Wrapping it up

To successfully scale a software system, it's crucial to address these recurring problems and build the foundation to help prevent these problems from occurring. This often involves having an experienced team that can proactively manage interruptions, monitor and optimize system performance, and actively manage technical debt. Usually, you need to have a customer base that directly or indirectly pays the bills, and any of the consequences illustrated above are too expensive for your organisation.
At a certain point in a software system's life, it's advisable to start building a Developer Platform. A Developer Platform delegates the operational aspects of a software infrastructure to Cloud Engineers. Together with the software engineers they build a platform that helps ensure resilience, reliability, and the smooth operation of the software, reducing the impact of these recurring problems and enabling further scalability.