Friday, December 11, 2020

How we improved the security posture in a small area of a large Dutch bank

I am a programmer, but last year among other things I put on the hat of "security architect" for our area (about a dozen teams). Here I will describe some steps we took to improve our security posture in the area.

Security is a business enabler. I don’t think that I need to spend lines arguing about that. But security is also hard. First of all, there are a lot of things we need to do right. Our software needs to be secure, the environment we run in, the libraries we use, how we deliver software, audit trails, compliance to regulations, the list can go on and on. We can do most of the things right, but if we do not do all of them we are vulnerable. And if our vulnerabilities are exploited, nobody is going to care about what we did right.

Next, is awareness and know-how. This is a problem for companies big and small. Security is a very wide topic, it has many different sides. The bigger the company, more items come into scope in terms of security. A smaller company probably has less resources to dedicate for security. In any case it takes effort to keep track of all the processes, regulations and policies. If we know them, they are often open to interpretation, and when interpreted we need to know what applies to us. And then, even if teams are aware, in many cases they do not know where or how to start.

And the matter of starting brings the topic of priorities. We have to do all the security work while staying competitive, innovative and driving down costs. It is difficult to balance and security usually lacks priority. The lack of priority is often due to misguided views. Views like, “we are secure”, “it will never happen”, “you built it to be secure right?” and “someone from security takes care of this”.

These are things that ring true for most organizations and some of them apply to us as well. So in the beginning of the year, we asked ourselves: how to achieve great compliance and naturally improve our security posture, acknowledging difficulties as the ones mentioned above? Our initiative, compliance shots! The idea is simple, each compliance shot is composed of:

  • Explanation of what we had to do in 2-3 sentences

  • A reference to the policy that brings the given requirement

  • A tutorial or guide on how to do it

  • A related story, to make it visible, introduce it to POs and keep track of the progress

We did a lot of these shots, and it worked remarkably well! I cannot get into details of course, but what I can do, is give a couple of closing remarks. You need allies in your quest to an improved security posture, because it is certain you will meet resistance. One or two people buying into your vision can help move the others. Also, good potential allies are people who are going to look good with the results of the initiative. Find them! Next advice is to have empathy and try to understand the reasons behind the resistance. In our case, some POs were glad that they had others letting them know of all the (security) things they have to do with explanations and tutorials. It makes their work easier. Others felt frustrated because new things were added, competing for attention with the features they want to deliver or the pressing deadlines they have to meet. You have to understand where they are coming from and only then you can motivate them to come your way.

But this is just the start. There are more initiatives in motion and they may be inspiration for another post.

Saturday, November 28, 2020

A tale of two migrations

 Within an enterprise, there are services (systems really) which are widely popular, offer just what you need and are easy to use. There are also systems, which for years the organization tries to decommission but they have so many applications depending on them, so many strings attached, it seems impossible. Often, it's the same system, at different points in time.

Recently while exploring a legacy application in order to design its Cloud native replacement, we identified a connection to such a system. We will refer to this system as the SAK (aka Swiss Army Knife). We wanted to do our part and remove one more string. The SAK’s service we consume acts in essence as a proxy for a database. After investigation, we found out that our application is the only one using the specific data (and thus the service). For the data, imagine a contact list (is not really an a contact list), which facilitates the main business offering of the application. I know I am vague, but I have to. The data in question make the main functionality easier but their lack does not make it impossible. You could still make calls without your contact list, but it would be a pain. Some clients use the application daily and some might not use it for months. 

I want to bring our focus back on the essence of what we try to do: Introduce our users to a new version of our application.  The new app has a quite different UI, runs on a different environment and has altered behavior in its interaction with the users. The services which the old application consumes, are replaced in the new one. 

What are the requirements we extracted from business?

  • Controlled and granular migration of users to the new version of the application

  • Maintain the availability as described in our SLA during migration

  • Data for the customer (those stored currently by SAK) must not be lost or corrupted. Lost updates by users during migration must be minimized, as it can affect the outcome of the main business function performed by users

  • Easy to assign and manage clients in migration slices

  • Handle possible new clients during migration

  • Clear exit strategy when migration is completed

It's a simple use case really, but a good opportunity to keep up with my writing chops. So let’s go!

We decided to view and reason about this problem as two migrations. User migration, where users will be introduced to the new application in slices and user data migration, moving their data to a database maintained by the new application. These are two very coupled migrations of course, but it helps our analysis and planning if we break it into two migrations.

User migration

For the user migration we decided to develop a small app, called here “Migration Router”. The idea is simple. A routing app stands in front of our applications (old and new) and redirects to the landing page of either one of them based on the user.

To make the redirect decision, we would need state of the user migration status. The Migration Router, will use a database for that.

Slices of users to be migrated can be defined based on business rules. Queries on user attributes would result sets of users. In simple terms, business factors easily define user slices.

The Migration Router database is loaded with existing clients at a given time. Any new client from that point, will not be present in the database and Migration Router will direct these users to the new version of the app. Rest (known to db) clients are directed based on their migration status, here represented as a boolean. The initial status for all clients in the database is “not migrated”.

The Migration Router application exposes an endpoint to programmatically update migration status of users.

What we like about this approach: it can easily be shared and re-used by other applications. Written once, the application can be deployed in multiple spaces with no code change. All you need to change is the environment properties (urls for new and old application) and the database it connects to (with the user Ids and migration status).

Let’s see a sequence diagram to make the above more clear.

Users’ data migration

Now to migration of user data. The dataset consists of around 80K records, about 500 bytes per record. Maximum records belonging to a single customer is about 5K and the median is around 100 records. We (our team) cannot get access to the SAK’s database, neither receive a snapshot from them, for ...reasons. We can either get an export with the data or access the data by calling their service.

We evaluated options for migrating data on-the-fly vs planned data migration and the level of dependency we want to have with the team that maintains SAK. We decided that our new application will not have a connection to SAK (even temporarily), we will avoid dependency from the SAK team (i.e., will call their service for migrating data) and finally, when a migrated user interacts with the application, their data will be already migrated to the database of the new application. 

This will be done in the following steps:

  • Migrate data from SAK to our new app's database, by calling SAK's service

  • When data successfully copied to the new app database, migration router app called to update user migration status

  • Initiating the migration of user data per slice, done from a pipeline, planned on out of business hours (e.g., late at night)

  • Next time the user tries to access the old app, migration router will navigate them to the new app and all their user data are in its database already

Depicted in the sequence diagram below:

What about lost updates?

We are lucky in our case. Our users access the application exclusively during business hours and the dataset we have to migrate is relatively small. So our approach does not entail a considerable risk of lost updates. But let's do a brief thought exercise. What if there was a real possibility of users altering their data while we are migrating them? How would we solve it?

We would need our database to catch up with the changes in SAK's persistence. As we said before we cannot solve this on middleware level, we have to solve it on application level. So, after we migrated the data and the user, how do we catch up? We need to capture data changes. Again, we find ourselves dependent on SAK's implementation. Some kind of log of changes could help so we could apply them in our database but is not available. However, there is a 'lastUpdated' field, so in principle we could retrieve the data again from SAK, compare with data on our side and identify inconsistencies. Of course, if  our data were heavy on writes 24/7, after the user is migrated, the data in the new application (and its db) would change before catching up and issues would arise. The problem would be much more difficult to solve. Thankfully, it isn't.

We actually did such an exercise within the context of our use case before deciding that the risk is low and we do not want to implement such controls against it. In agreement with business, we accept the risk and in the highly unlikely scenario that a lost update happens, it is still known in SAK and we could intervene to resolve conflicts in data.

Exit strategy

When all users in the Migration Router database are marked as migrated, the data migration code can be removed, the Migration Router will be removed and the calling apps will point to the new application directly. After some time and being confident about the migration, data from SAK can be removed and we will notify them that we do not use the service.

Nice and simple. Do you have better alternatives or other suggestions? Would love to hear!