Friday, December 11, 2020

How we improved the security posture in a small area of a large Dutch bank

I am a programmer, but last year among other things I put on the hat of "security architect" for our area (about a dozen teams). Here I will describe some steps we took to improve our security posture in the area.

Security is a business enabler. I don’t think that I need to spend lines arguing about that. But security is also hard. First of all, there are a lot of things we need to do right. Our software needs to be secure, the environment we run in, the libraries we use, how we deliver software, audit trails, compliance to regulations, the list can go on and on. We can do most of the things right, but if we do not do all of them we are vulnerable. And if our vulnerabilities are exploited, nobody is going to care about what we did right.

Next, is awareness and know-how. This is a problem for companies big and small. Security is a very wide topic, it has many different sides. The bigger the company, more items come into scope in terms of security. A smaller company probably has less resources to dedicate for security. In any case it takes effort to keep track of all the processes, regulations and policies. If we know them, they are often open to interpretation, and when interpreted we need to know what applies to us. And then, even if teams are aware, in many cases they do not know where or how to start.

And the matter of starting brings the topic of priorities. We have to do all the security work while staying competitive, innovative and driving down costs. It is difficult to balance and security usually lacks priority. The lack of priority is often due to misguided views. Views like, “we are secure”, “it will never happen”, “you built it to be secure right?” and “someone from security takes care of this”.

These are things that ring true for most organizations and some of them apply to us as well. So in the beginning of the year, we asked ourselves: how to achieve great compliance and naturally improve our security posture, acknowledging difficulties as the ones mentioned above? Our initiative, compliance shots! The idea is simple, each compliance shot is composed of:

  • Explanation of what we had to do in 2-3 sentences

  • A reference to the policy that brings the given requirement

  • A tutorial or guide on how to do it

  • A related story, to make it visible, introduce it to POs and keep track of the progress

We did a lot of these shots, and it worked remarkably well! I cannot get into details of course, but what I can do, is give a couple of closing remarks. You need allies in your quest to an improved security posture, because it is certain you will meet resistance. One or two people buying into your vision can help move the others. Also, good potential allies are people who are going to look good with the results of the initiative. Find them! Next advice is to have empathy and try to understand the reasons behind the resistance. In our case, some POs were glad that they had others letting them know of all the (security) things they have to do with explanations and tutorials. It makes their work easier. Others felt frustrated because new things were added, competing for attention with the features they want to deliver or the pressing deadlines they have to meet. You have to understand where they are coming from and only then you can motivate them to come your way.

But this is just the start. There are more initiatives in motion and they may be inspiration for another post.

Saturday, November 28, 2020

A tale of two migrations

 Within an enterprise, there are services (systems really) which are widely popular, offer just what you need and are easy to use. There are also systems, which for years the organization tries to decommission but they have so many applications depending on them, so many strings attached, it seems impossible. Often, it's the same system, at different points in time.

Recently while exploring a legacy application in order to design its Cloud native replacement, we identified a connection to such a system. We will refer to this system as the SAK (aka Swiss Army Knife). We wanted to do our part and remove one more string. The SAK’s service we consume acts in essence as a proxy for a database. After investigation, we found out that our application is the only one using the specific data (and thus the service). For the data, imagine a contact list (is not really an a contact list), which facilitates the main business offering of the application. I know I am vague, but I have to. The data in question make the main functionality easier but their lack does not make it impossible. You could still make calls without your contact list, but it would be a pain. Some clients use the application daily and some might not use it for months. 

I want to bring our focus back on the essence of what we try to do: Introduce our users to a new version of our application.  The new app has a quite different UI, runs on a different environment and has altered behavior in its interaction with the users. The services which the old application consumes, are replaced in the new one. 

What are the requirements we extracted from business?

  • Controlled and granular migration of users to the new version of the application

  • Maintain the availability as described in our SLA during migration

  • Data for the customer (those stored currently by SAK) must not be lost or corrupted. Lost updates by users during migration must be minimized, as it can affect the outcome of the main business function performed by users

  • Easy to assign and manage clients in migration slices

  • Handle possible new clients during migration

  • Clear exit strategy when migration is completed

It's a simple use case really, but a good opportunity to keep up with my writing chops. So let’s go!

We decided to view and reason about this problem as two migrations. User migration, where users will be introduced to the new application in slices and user data migration, moving their data to a database maintained by the new application. These are two very coupled migrations of course, but it helps our analysis and planning if we break it into two migrations.

User migration

For the user migration we decided to develop a small app, called here “Migration Router”. The idea is simple. A routing app stands in front of our applications (old and new) and redirects to the landing page of either one of them based on the user.

To make the redirect decision, we would need state of the user migration status. The Migration Router, will use a database for that.

Slices of users to be migrated can be defined based on business rules. Queries on user attributes would result sets of users. In simple terms, business factors easily define user slices.

The Migration Router database is loaded with existing clients at a given time. Any new client from that point, will not be present in the database and Migration Router will direct these users to the new version of the app. Rest (known to db) clients are directed based on their migration status, here represented as a boolean. The initial status for all clients in the database is “not migrated”.

The Migration Router application exposes an endpoint to programmatically update migration status of users.

What we like about this approach: it can easily be shared and re-used by other applications. Written once, the application can be deployed in multiple spaces with no code change. All you need to change is the environment properties (urls for new and old application) and the database it connects to (with the user Ids and migration status).

Let’s see a sequence diagram to make the above more clear.

Users’ data migration

Now to migration of user data. The dataset consists of around 80K records, about 500 bytes per record. Maximum records belonging to a single customer is about 5K and the median is around 100 records. We (our team) cannot get access to the SAK’s database, neither receive a snapshot from them, for ...reasons. We can either get an export with the data or access the data by calling their service.

We evaluated options for migrating data on-the-fly vs planned data migration and the level of dependency we want to have with the team that maintains SAK. We decided that our new application will not have a connection to SAK (even temporarily), we will avoid dependency from the SAK team (i.e., will call their service for migrating data) and finally, when a migrated user interacts with the application, their data will be already migrated to the database of the new application. 

This will be done in the following steps:

  • Migrate data from SAK to our new app's database, by calling SAK's service

  • When data successfully copied to the new app database, migration router app called to update user migration status

  • Initiating the migration of user data per slice, done from a pipeline, planned on out of business hours (e.g., late at night)

  • Next time the user tries to access the old app, migration router will navigate them to the new app and all their user data are in its database already

Depicted in the sequence diagram below:

What about lost updates?

We are lucky in our case. Our users access the application exclusively during business hours and the dataset we have to migrate is relatively small. So our approach does not entail a considerable risk of lost updates. But let's do a brief thought exercise. What if there was a real possibility of users altering their data while we are migrating them? How would we solve it?

We would need our database to catch up with the changes in SAK's persistence. As we said before we cannot solve this on middleware level, we have to solve it on application level. So, after we migrated the data and the user, how do we catch up? We need to capture data changes. Again, we find ourselves dependent on SAK's implementation. Some kind of log of changes could help so we could apply them in our database but is not available. However, there is a 'lastUpdated' field, so in principle we could retrieve the data again from SAK, compare with data on our side and identify inconsistencies. Of course, if  our data were heavy on writes 24/7, after the user is migrated, the data in the new application (and its db) would change before catching up and issues would arise. The problem would be much more difficult to solve. Thankfully, it isn't.

We actually did such an exercise within the context of our use case before deciding that the risk is low and we do not want to implement such controls against it. In agreement with business, we accept the risk and in the highly unlikely scenario that a lost update happens, it is still known in SAK and we could intervene to resolve conflicts in data.

Exit strategy

When all users in the Migration Router database are marked as migrated, the data migration code can be removed, the Migration Router will be removed and the calling apps will point to the new application directly. After some time and being confident about the migration, data from SAK can be removed and we will notify them that we do not use the service.

Nice and simple. Do you have better alternatives or other suggestions? Would love to hear!

Wednesday, August 7, 2019

Automation pipelines as a security enabler

Let’s consider automation pipelines from a security perspective. Pipelines can be a security enabler. Secure code in a developer’s machine, can result as insecure code running in production. Especially when there is manual intervention in the process. Automation pipelines can mitigate that risk. We must ensure that code can be promoted to production only via the pipeline and in doing so, we greatly minimize the attack surface.

However, a pipeline that can be compromised does not provide much security assurance. Marc van Lint, a colleague and automation expert, preaches that our pipelines are as important as our application code. This is especially true when it comes to security. A lot of concepts from DevOps are already in place to help us with this. A pipeline should be defined as code, persisted in a secured repo and versioned. That way, we have good visibility of all the steps for our code reaching production, only authorised users can make changes to it (least privilege principle) and any changes can be traced to specific individuals.

After ensuring our pipeline is secure, we have to make it useful. On a very high level, a pipeline is mainly about testing and promoting to different environments. A good approach when it comes to pipelines is to do the cheap and quick first. There are a lot of tools at our disposal that we can use out of the box. Static application security testing (SAST) and dynamic application security testing (DAST) are a must have in our pipelines. Then there are compliance tests, for example for PCI DSS. OWASP maintains a list of security testing tools[1], evaluate and pick those you need. Then of course tests we defined ourselves. Unit tests, integration tests, regression tests, performance test, failure tests. The list is long. Tests are a confidence builder. With each test we can be more certain that our code is secure. And also important, we generate supporting evidence (more on this later).

The above tools are used on code already in our repo or the application running in one of our environments. But we can start earlier. Tools like Talisman[2], a pre-push hook to Git which can catch suspicious files committed such as authorisation tokens and private keys before we even commit. And now that we are on the topic “keeping secrets” is a challenging topic and often processes in place are insecure. This is a good place for our pipelines to help. Generating certificates, following password policies and keeping secrets, secret should be automated. This ensures security by default. Pipelines again to the rescue!

Now about promoting through environments. The binaries are built once and they are used for all stages (test, acceptance, production). The exact same mechanism is used for release to each environment. Every commit gets built, tested and released to production right? Not so fast. We want to be quick with the release, true, but we cannot do it at the expense of security. For promoting there are a few security principles coming to place.  Separation of duties and asset classification being included. I will start with the latter because it might not be that obvious. In my view our test environment (and our code in it) can be classified as a lower value asset compared to the production environment. Vulnerability in the test environment is not as a big threat as vulnerability in production. Don’t get mad at me yet, I will explain. As stated before, we want our automation pipeline to be fast with as less intervention as possible while staying secure. A commit cannot reach production just because it passed all tests. It is possible that backdoors or vulnerabilities can slip through. There might be some processes necessary in place in your company before releasing to production. So it could be fine for a change to move to the test environment without review of the changes but there needs to be code review and documented approval for going to acceptance. Then promoting from acceptance to production, cannot be done just by approval of developers, but approval from product owners, managers and change requests must be part of it and automated. Thus the separation of concerns. All this of course observable in our pipeline and documented through the process.

So after putting all this effort on securing our pipeline and embed security tests in it and use it to secure our release processes, what is next? We want proof! Any artefact running in production should be accompanied with supporting evidence for the security measures taken. This can be very helpful for audits. Thankfully, having our pipelines as described that is an easy win,  provided that our tasks in the pipeline have associated comprehensive logging.  

We ensured that the pipeline is the only way to get code to production. The pipeline is observable and versioned. The code is tested for functionality, vulnerabilities, compliance, performance, failure and reports of these tests are generated. The 4 eyes principle is used with appropriate approvals before promoting to different environments. All these are documented, grouped, persisted and ready for audits for each artefact in production currently or in the past. Pipelines are neither the beginning nor the end of our security journey. But they can and should be a very important asset.


Monday, January 14, 2019

Devs will just dev! The Cloud Foundry promise

“Every company is a technology company” said Peter Sondergaard and evidence of this is all around us. But it was not so easy becoming a technology company, the entry barriers were high. Besides developing their business propositions, companies had to develop, maintain and operate the platform on top of which their businesses (i.e. applications) run. Until Cloud options and "X as a Service” models became available. It started mainly with Infrastructure as a Service (IaaS) offerings but like everything, it keeps evolving.

Rise of DevOps culture, automated pipelines, container technologies, microservices, all contributed to an improved situation. And all these are still evolving and getting increasingly popular. But still, businesses have to deal with things outside the development of their specific business propositions. There is still operational load to carry. And the load seems to be moved now to the hands of developers. Cloud Foundry helps eliminating this operational load, and the need of building platforms and utility components that have no relation to your business propositions. Cloud Foundry makes possible to develop only what contributes to your bottom line and it takes care of the rest. It allows developers, to just develop! 

Cloud Foundry (CF) is an open source platform for hosting cloud native applications. When running on CF you only have to manage your applications and data while CF takes care of the rest. At the same time it allows you to choose your underlying infrastructure, be it AWS, Openstack or you own. CF does not limit you on the kind of applications you can run on it. It supports many languages such as Java, Go, NodeJS, Ruby and others but most importantly, since it is open source, it allows the community to develop and provide what might be missing. The Cloud Foundry platform itself has many implementations and certified providers include Pivotal, IBM, Atos and SAP. As a user, you gain from the abstraction. You interact with all these platforms in the same way. If you know how to run your applications in one of them, you can do it in all of them. The above can be summarized as Cloud Foundry being a language agnostic, multi-vendor, multi-cloud environment for running cloud native applications.

Let’s see what you get from Cloud Foundry, before showing how it delivers:

  • Always on: it ensures that your system has the resources you specify 24/7
  • Easy elasticity: scale up, down in and out. Fast and possibly without downtime
  • Language Agnostic: run your apps in the language and frameworks that suit you
  • Out of the box functionality: common needs in any application are provided and can be used so you do not have to built it yourself
  • Distributed tracing: distributed applications can be difficult to analyze logs and tracing but CF helps with that too
  • Guards against cascading failure: the failure of one component will not bring others of your system down
  • No vendor lock-in: multi-vendor, multi-cloud, you can switch where you run your applications any time. No changes in your actual application code necessary
  • Enhanced security: ways that makes your application more secure will be discussed later
  • You only need to care about your code! No need to manage the platform, just describe what you want and CF takes care of it for you

You can easily see how Cloud Foundry delivers the above by examining the platform itself. Each component has a specific purpose. 

Every Cloud Foundry provider has their own UI for interacting with an instance, but all CF implementations share the same API. Cloud Foundry Command Line Interface (CLI), is a terminal that allows you to communicate with the CF instance in which you want to run your apps (target). Underneath, the CLI makes a series of REST calls to your target. Through the CLI you can connect to the platform, and instruct Cloud Foundry to deploy your application, scale it, bind it to services and more. 

To show how easy it is to deploy and manage an application via the CLI and Cloud Foundry, let’s say that we want to deploy a Java app called myapp, to run on 2 instances, with 1G memory per node. This is simply done by:

cf push myapp -b java_buildpack -i 2 -m 1G -p ./myapp.jar --random-route

That’s it! Our application will run with the specifications instructed, will be accessible on a random auto-generated endpoint and CF will make sure that it stays at this desired situation at all times.

The Cloud Controller, is your interface with Cloud Foundry. It provides the API that listens to the requests made by the CLI and kick-starts the processes for deploying, running, scaling, monitoring and managing your applications. The Cloud Controller is in charge of keeping track of your desired state of your applications, meaning, what was specified for each app when it was “pushed”. It keeps track of such metadata on the Cloud Controller Database. It also stores the uploaded application (the jar in the above example) on the Blob Store.

The Router, is the first component (after the load balancer) that receives incoming requests to a given CF instance and directs the request either to the Cloud Controller or to the application listening on the endpoint specified in the request. Applications go on and off, their number of instances change, as well as the URIs (called routes) on which they are “listening”. The Router takes care to direct each call to the appropriate receiver while all these events are taking place. It achieves that using a routing table which maps the routes to applications. The routes are emitted via the nats message bus. An application can be bound to zero or more routes. A route can be mapped to zero or more applications. This gives us great flexibility, for example to easily make a new release with no downtime. Additionally, when handling requests, the router adds headers for enabling distributed tracing. It allows us to follow a logical request within the platform and all the internal services called in order to handle it.

Diego takes care of the lifecycle and health of your applications using a set of subcomponents with a specific purpose (see picture below). Its Garden component creates containers for the apps as instructed and keeps track on their actual state. Diego contains one or more virtual machines called Cells and the containers for the applications run in them. Knowing the actual state of your application and in cooperation with Cloud Controller (who knows the desired) it ensures that the health of the applications are as expected. Bulletin Board System (BBS) monitors the real time state of your applications and periodically compares it with the desired state (received via nsync). If for some reason one of the instances of your application crashes, Diego Brain discards it and fires up another container with your application automatically. The containers are a made of the operating system, information about the system environment and the application droplet (will be explained in next section). When about to create new instance of an application, Cell Rep performs auction on which cell to host the container. Diego opts for deploying instances of the same applications across different cells to increase resiliency. If enabled also across multiple availability zones. Always on, no need for late night calls to Ops.

Cloud Foundry components. source 
Chances are that your applications need runtime dependencies to execute such as JRE. When staging your applications, external dependencies are added into them, so your application can work. These runtime dependencies are called Buildpacks. There are buildpacks for different kind of languages and different buildpacks for each language. When pushing the application it is possible to define the buildpack you want to use, else Cloud Foundry will try to identify the appropriate buildpack. The application along with the buildpack generate a droplet which is a binary executable that runs inside a container in a Diego cell. The droplet is also stored in the Blob Store.

Applications generate logs and so are the components of Cloud Foundry (e.g., Cloud Controller, Diego, Router). A lot of metrics are generated through the applications lifecycle which are crucial for operations. Loggregator, as the name suggests, is in charge of aggregating all the logs and metrics from your applications (as long as you direct them to standard out and standard error). The logs are made available in real time via CLI (temporarily) but they can also be streamed onto external tools that persist, index and search them such as Splunk. It is important to note that logs can be lost in the Loggregator and this is by design. Logs can be dropped if in danger of the Loggregator becoming bottleneck.

Most applications have common needs, such as persisting data. Such needs can be covered in CF with the use of services. Services can be databases, message buses or specific applications you built in order to do something needed in your system only. Services should be considered as resources which can be bound to applications. Cloud Foundry maintains a list of readily available services which can be found in its marketplace. These services are called managed services. When you want to make a service available in the marketplace, you need to implement the Service Broker API. All services are accessible through this standardized interface. When an application requests to be bound to a service, the service broker is responsible for creating the service instance and passing the credentials to the application as environment variables. 

A special kind of services are the so called route services. These are services that are bound to routes and not applications. Their main usage is for cross cutting concerns, for example checking certain headers and perform some validations before the call reaches applications. We could create an application that does that, create a route service out of it, and place it to the routes of applications we select. Then when a request is received by the router, if the route is bound to a route service, first goes through the route service and then back to the router and eventually to the app. 

One can guess that there are a lot of things going on in terms of security. A detailed description is out of the scope, but we should briefly mention the highlights. Cloud Foundry is a platform, used by many tenants so there is a need for authentication and authorization of users. This way, there can be a segregation between orgs (a logical separation of tenants in a CF instance) and spaces (a logical division of parts of orgs). Users can only perform actions permitted by their assigned role. Cloud Foundry’ s User Account and Authentication (UAA) takes care of that. It is an OAuth2 provider for the platform but UAA can be used as a service itself by your applications. Besides that, there are other reasons that enhance your security when running in CF. Your applications are behind the Load Balancer and the Router, and that decreases the network surface exposed for potential attacks. Also, the easiness of “throwing away” and replacing the containers in which your apps run, makes applying security patches simple and fast. Containers are isolated from each other and all traffic to your application is encrypted. Standardized buildpacks can put you at ease about what your application is running on. 

It should be clear that Cloud Foundry is a strong enabler for developers. It takes care of operational load and makes sure that our systems just work. It allows developers to focus on what matters most: the applications or better, the business proposition to customers.

Friday, July 6, 2018

The value of deliberate logging

Your logs tell a story - or at least they should be. It is safe to assume that all software applications have some type of logging. 

With logging in this context I mean messages generated in response to events occurring in an application from its deployment until its undeployment. These messages are usually transported to a different system for consumption. Their purpose is to inform about what is happening in the application and are not part of the application’s functionality.

In many applications, logging is executed as an afterthought. Something we do but we do not think much about. Moreover, there are a lot of (valid) concerns regarding the dangers that can arise from excessive logging and the “clutter” they cause to the codebase. These will not be addressed in this article. I am interested in how to perform deliberate logging, because I believe that logging, when done properly, can bring our application to the next level. So, why do we log?

We log to communicate useful (and often actionable) information to an interested party. It can be to ourselves when trying to debug why the application does not behave as expected. Maybe we log users’ behavior in order to understand them better and help us build a better experience for them. Or we might log because it is mandatory due to regulations. 

I will attempt a categorization and will group logs under: 
  • User behavior: data concerning the journey of users while using our application. We use them so we can improve their experience
  • Debug: information which help us in times of trouble, when things in the application do not behave as expected 
  • Performance: metrics that help us understand how our application components perform under varying load and reveal areas we need to improve 
  • Regulatory compliance: some applications are required to retain certain logs for auditing, assist in non-repudiation controls, etc. 
  • Security: logs that help us establish baselines, recognize attacks in real time and respond 
  • Business: concern data state and processes taking place due to users’ activity and use of the application. This type of data are the most immediate to our application’s purpose 
  • System: I greatly generalize here, to include all logs related to things like operating system, databases, application lifecycle, its maintainers, etc. It provides information about the environment our application runs on 
The categorization above aims to convey that not all logs are the same. The way we handle them should not be the same either. We could evaluate and characterize our logs based on the following properties: 
  • Criticality: How important is the message? It would not be a big deal to drop some logs related to user behavior but we cannot drop logs that are required for auditing 
  • Frequency & size: How often are messages generated? And of what size? Some events happen more often than others and the log payload can differ in size. We have to handle each in an appropriate way, for example to prevent bottlenecks 
A picture is worth a thousand words so evaluating the log categories based on the above properties (I am not being precise here) could look something like this:
I argue that the differences in their nature could justify each log category to be handled differently during the logging lifecycle.

The logging lifecycle is pretty straight forward. Recorded logs are being transported, in order to be persisted, indexed, analyzed and possibly (automated) reactions will take place. The type of logs (criticality, frequency) will be factors in our design of the transport phase so that we allow logging to enhance our applications without hampering its main functionality. The way in which the logs are recorded will play role on the effectiveness and efficiency of analysis and reactions. So how we should write logs? And what should be in them to provide context and tell a story?

The how is simple. We want logs to be easy to understand by both humans and machines and we have a perfect format for that, JSON. All data should be entered as clear key-value pairs. Next, what should we log? As we mentioned we need to tell a story, give context and remove all the “fat” from it. In order to tell a story, every log should provide:
  • Who: service reference id, application name and version 
  • Does what: log/event type (e.g., security) and subtype (e.g., unauthorized request) 
  • On behalf of whom: some form of user or system identification 
  • For what reason: specific business functionality (request) the system works on 
  • When and for how long: timestamp 
  • From where: source id 
  • On what: target id 
  • As response to what: parent process or request 
This way, the event logged is described in context, with all the necessary information, in a format that can be easily analyzed by both humans and machines.

For the sake of completion I must mention what not to log. For this, I simply refer to OWASP as they can do a much better job at it than me [1]. So your logs must not contain: 
  • Application source code 
  • Session identification values (consider replacing with a hashed value if needed to track session specific events) 
  • Access tokens 
  • Sensitive personal data and some forms of personally identifiable information (PII) e.g. health, government identifiers, vulnerable people 
  • Authentication passwords 
  • Database connection strings 
  • Encryption keys and other master secrets 
  • Bank account or payment card holder data 
  • Data of a higher security classification than the logging system is allowed to store 
  • Commercially-sensitive information 
  • Information it is illegal to collect in the relevant jurisdictions 
  • Information a user has opted out of collection, or not consented to e.g. use of do not track, or where consent to collect has expired 
And last thing on what not to log. Record only information that is of interest to someone. Remove the fat. For example, is anyone in your team interested in which thread the process took place? If yes, great - log it. If not, then it shouldn’t be logged.

It all sounds peachy but in practice there are challenges. I believe that in most applications logging is performed suboptimal. Even if there are clear guidelines, conventions and strategies in a company on how to log, I would expect they are not followed by everyone. Developers come and go. Logging is not “respected” much in general. So the reality is that each developer throws some logs here and there on their own language (i.e., idioms), to the log level they see fit. Crucial information about the event could be missing while unnecessary information could clutter the logs depending on the skills of each developer. All types of (heterogeneously phrased) logs are recorded and persisted somewhere. Later, great efforts take place to parse, index, analyze and make use of these logs. But we can and should do better.

An interesting idea and a relatively low cost investment would be to create a small framework on top of your preferred logging library which provides a “logging facade” for the developers. All these items discussed in the article regarding what you should (or shouldn’t) log, the format and the different handling for different categories of logs would be abstracted and handled by the framework. An intuitive and easy to use interface would be available for the developers. Deliberate logging practices would be applied and your applications and operations would be better off. How to do that? It would definitely make a good topic for another article.

References and inspiration:

Friday, December 15, 2017

How to deal with exceptions

I recently had a discussion with a friend, who is a relatively junior but very smart software developer. She asked me about exception handling. The questions were pointing to a tips and tricks kind of path and there is definitely a list of them. But I am a believer on context and motivation behind the way we write software so I decided to write my thoughts on exceptions from such a perspective. 

Exceptions in programming (using Java as a stage for our story) are used to notify us that a problem occurred during the execution of our code. Exceptions are a special category of classes. What makes them special is that they extend the Exception class which in turn extends the Throwable class. Being implementations of Throwable allow us to "throw" them when necessary. So, how can an exception happen? Instances of exception classes are thrown either from the JVM or in a section of code using the throw statement. That is the how, but why?

I am sure that most of us cringe when we see exceptions occur, but they are a tool to our benefit. Before the inception of exceptions, special values or error codes were returned to let us know that an operation did not succeed. Forgetting (or being unaware) to check for such error codes, could lead to unpredictable behavior in our applications. So yay for exceptions!

There are 2 things that come to mind as I write the above. Exceptions are a bad event because when they are created we know a problem occurred. Exceptions are a helpful construct because they give us valuable information about what went wrong and allow us to behave in proper way for each situation.

Trying to distil the essence of this design issue: a method/request is triggered to do something but it might fail - how do we best notify the caller that it failed? How do we communicate information about what happened? How we help the client decide what to do next? The problem with using exceptions is that we “give up” and not just that; we do it in an “explosive” way and the clients/callers of our services have to handle the mess

So my first advice when it comes to exceptions, since they are a bad event - try to avoid them. In the sections of software under your control, implement design that makes difficult for errors to happen. You can use features of your language that support this behavior. I believe the most common exception in java is the NullPointerException and Optionals can help us avoid them. Let’s consider we want to retrieve an employee with a specified id:

public Optional<Employee> tryGetEmployee(String employeeId) {
    return Optional.ofNullable(employeeService.getEmployee(employeeId));

So much better now. But besides the features of our language, we can design our code in a way that makes it difficult for errors to occur. If we consider a method, which can only receive positive integers as an input, we can set our code up, so that it is extremely unlikely for clients to mistakenly pass invalid input. First we create a PositiveInteger class:
public class PositiveInteger {
  private Integer integerValue;
  public PositiveInteger(Integer inputValue) {
     if(inputValue <= 0) {
        throw new IllegalArgumentException("PositiveInteger instances can only be created out of positive integers");
     this.integerValue = inputValue;
  public Integer getIntegerValue() {
     return integerValue;

Then for a method that can only use positive integer as an input:
public void setNumberOfWinners(PositiveInteger numberOfWinners) { … }

These are of course simple examples and I did argue that the heart of the issue is that occasionally problems occur and then we have to inform clients about what happened. So let’s say we retrieve a list of employees from an external back end system and things can go wrong. How to handle this?

We can set our response object to GetEmployeesResponse, which would look something like this:
public class GetEmployeesResponse {
  private Ok ok;
  private Error error;

  class Ok {
    private List<Employee> employeeList;

  class Error {
    private String errorMessage;

But let’s be realists, you do not have control on every part of your codebase and you are not going to change everything either. Exceptions do and will happen, so let’s start with brief background information on them. 

As mentioned before, the Exception class extends the Throwable class. All exceptions are subclasses of the exception class. Exceptions can be categorized in checked and unchecked exceptions. That simply means that some exceptions, the checked ones, require from us to specify on compile time how the application will behave in case the exception occurs. The unchecked exceptions do not mandate compile time handling from us. To create such exceptions you extend the RuntimeException class which is a direct subclass of Exception. An old and common guideline when it comes to checked vs unchecked is that runtime exceptions are used to signal situations which the application usually cannot anticipate or recover from, while checked exceptions are situations that a well-written application should anticipate and recover from.

Well, I am an advocate of only using runtime exceptions. And if I use a library that has a method with checked exception, I create a wrapper method that turns it into a runtime. Why not checked exceptions then? Uncle Bob in his “Clean Code” book argues, they break the Open/Closed principle, since a change in the signature with a new throws declaration could have effects in many levels of our program calling the method.

Now, checked or unchecked, since exceptions are a construct to give us insights on what went wrong, they should be as specific and as informative as possible on what happened. So try to use standard exceptions, others will understand what happened easier. When seeing a NullPointerException, the reason is clear to anyone. If you make your own exceptions, make it sensible and specific. For example, a ValidationException lets me know a certain validation failed, an AgeValidationException points me to the specific validation failure. Being specific, allows both to diagnose easier what happened but also to specify a different behavior based on what happened (type of exception). That is the reason why you should always catch the most specific exception first! So here comes another common advice that instructs to not catch on “Exception”. It is a valid advice which I occasionally do not follow. In the boundaries of my api (let’s say the endpoints of my REST service) I always have generic catch Exception clauses. I do not want any surprises and something that I did not manage to predict or guard against in my code, to potentially reveal things to the outside world. 

Be descriptive but also provide exceptions according to the level of abstraction. Consider creating a hierarchy of exceptions that provide semantic information in different abstraction levels. If an exception is thrown from the lower levels of our program, such as a database related exception, it does not have to provide the details to the caller of our API. Catch the exception and throw a more abstract one, that simply informs callers that their attempted operation failed. This might seem like it comes against the common approach of “catch only when you can handle”, but it is not. Simply in this case our “handling” is the triggering of a new exception. In these cases make the whole history of the exception available from throw to throw, by passing the original exception to the constructor of the new exception.

The word “handle” was used many times. What does it mean? An exceptions is considered to be handled when it gets “caught” in our familiar catch clause. When an exception is thrown, first it will search for exception handling in the code from where it happens, if none is found it will go to the calling context of the method it is enclosed and so on until an exception handler is found or the program will terminate. 

One nice piece that I like from uncle Bob again, is that the try-catch-finally blocks define a scope within the program. And besides the lexical scope we should think of its conceptual scope, treat the try block as a transaction. What should we do if something goes wrong? How do we make sure to leave our program in a valid state? Do not ignore exceptions! I am guessing many hours of unhappiness for programmers were caused by silent exceptions. The catch and finally block are the place where you will do your cleaning up. Make sure you wait until you have all the information to handle the exception properly. This can be tied to the throw early-catch late principle. We throw early so we don’t make operations that we have to revert later because of the exception and we catch late in order to have all the information to correctly handle the exception. And by the way, when you catch exceptions, only log when you resolve them, else a single exception event would cause clutter in your logs. Finally, for exception handling, I personally prefer to create an error handling service that I can use in different parts of my code and take appropriate actions in regards to logging, rethrowing, cleaning resources, etc. It centralizes my error handling behavior, avoids code repetition and help me keep more high level perspective of how errors are handled in the application.

So now that we have enough context, paradoxes, rules and their exceptions, we could summarise:
  • Try to avoid exceptions. Use the language features and proper design in order to achieve it
  • Use runtime exceptions, wrap methods with checked exceptions and turn them into runtime
  • Try to use standard exceptions
  • Make your exceptions specific and descriptive
  • Catch the most specific exception first
  • Do not catch on Exception
  • But catch on Exception on the boundaries of your api. Have complete control on what comes out to the world
  • Create a hierarchy of exceptions that matches the layers and functionalities of your application
  • Throw exceptions at the proper abstraction level. Catch an exception and throw a higher level one as you move from layer to layer
  • Pass the complete history of exceptions when rethrowing by providing the exception in the constructor of the new one
  • Think of the try-catch-finally block as a transaction. Make sure you leave your program in a valid state when something goes wrong
  • Catch exception when you can handle it
  • Never have empty catch clauses
  • Log an exception when you handle it
  • Have a global exception handling service and have a strategy on how you handle errors

That was it! Go on and be exceptional!

Sunday, November 5, 2017

In encryption we trust! A tutorial

Many people view encryption as a complicated subject, something difficult to understand. And certain aspects of its implementation can be, but everyone can understand how it works on a higher level.

This is what I want to do with this article. Explain in simple terms how it works and then play around with some code.

Yes, in encryption we trust. What do I mean with trust? We trust that our messages are read only by authorized parties (confidentiality), they are not altered during transmission (integrity) and are indeed sent by those we believe they were sent (authentication).

Wikipedia provides a good definition for encryption: “is the process of encoding a message or information in such a way that only authorized parties can access it”.

So encryption is turning our message with the use of a key (cipher) to an incomprehensible one (ciphertext) which can only be turned back to the original from authorized parties.

There are two types of encryption schemes, symmetric and asymmetric key encryption.

In symmetric encryption the same key is used for encrypting and decrypting the message. Those we wish to access the message must have the key but none else, otherwise our messages are compromised.

Asymmetric key encryption is my interest here. Asymmetric key schemes, use two keys, a private and a public. These pairs of keys are special. They are special because they are generated using a category of algorithms called asymmetric algorithms. The actual algorithms are out of scope for this discussion, but later in the tutorial we will use RSA.

What you need to know now, is that these keys have the following properties. A message encrypted with the:
  1. public key can be decrypted only using the private key
  2. private key can be decrypted only using the public key

Seems simple enough right? So how is it used in practise? Let’s consider two friends, Alice and Bob. They have their own pairs of public and private keys and they want privacy in their chats. Each of them, openly provides their public key but takes good care hiding their private key.

When Alice wants to send a message only to be read from Bob, she uses Bob’s public key to encrypt the message. Then Bob and only him, can decrypt the message using his private key. That’s it.

That explains the use of the first property, but what about the second? Seems there is no reason to encrypt using our private key. Well, there is. How do we know that Alice was the one sent the message? If we can decrypt the message using Alice’s public key, we can be sure that Alice’s private key was used for the encryption, so it was indeed sent from Alice. Simply put:

The public key is used so people can send things only to you and the private key is used to prove your identity.

So we can have confidentiality using the public key and authenticity using the private. What about integrity? To achieve this, we use cryptographic hashing. A good cryptographic hash takes an input message and generates a message digest with the following properties:
  1. The message digest is easy to generate
  2. It is extremely difficult to calculate which input provided the hash
  3. It is extremely unlikely that two different inputs/messages would generate the same hash value

If we want to be sure that the message received was not compromised during transition, the hash value is sent along the encrypted message. In the receiving end we hash the decrypted message with the same algorithm and compare to make sure the hashes are an exact match. If they are, then we can be confident that the message was not altered.

These hashes or message digest have other uses as well. You see, sometimes Bob makes promises and then denies he ever did. We want to keep him in check. In fancy terms, it is called non-repudiation and prevents parties from being able to deny sending a message. Well known application of this, are digital signatures.

Before we move and have some fun with code, let me mention a couple more things.

  1. Asymmetric key algorithms  have actually two algorithms for different functionalities. One is of course for keys generation and the other functionality is for function evaluation. Function evaluation means taking an input (i.e. the message) and a key and result an encrypted or decrypted message, depending the input it got. So function evaluation is how messages are encrypted and decrypted using the public/private keys.
  2. Maybe you already thought, how do we know that a public key is actually related to Bob or Alice? What if it is someone pretending to be them? There is a standard that can help us with that. It is the X.509 which defines the format for public key certificates. These certificates are provided by Certification Authorities and usually contain:
    1. Subject, detailed description of the party (e.g. Alice)
    2. Validity range, for how long the certificate is valid
    3. Public key, which help us send encrypted messages to the party
    4. Certificate authority, the issuer of the certificate
  3. Hashing and encrypting are different things. An encrypted message is intended to eventually be turned back to the original message. A hashed message should not be  possible to be turned back to the original.

Now let’s use a tutorial to help all these sink in. We will allow three individuals Alice, Bob and Paul to communicate with Confidentiality, Integrity and Authentication (further will refer to them as CIA). The complete code is available on github.
The project has a couple of dependencies, as shown below:
<project xmlns="" xmlns:xsi=""





We will start with the EncryptedMessage class, which will provide all the information we need to ensure CIA. The message will contain the actual encrypted message for confidentiality, a hash of the message to be used to ensure integrity and identification of the sender, raw and encrypted for authentication. We also provide a method to compromise the message payload, so we can test the validation against the digest (more on that later).
package com.tasosmartidis.tutorial.encryption.domain;

import lombok.AllArgsConstructor;
import lombok.EqualsAndHashCode;
import lombok.Getter;

public class EncryptedMessage {
    private String encryptedMessagePayload;
    private String senderId;
    private String encryptedSenderId;
    private String messageDigest;

    public void compromiseEncryptedMessagePayload(String message) {
        this.encryptedMessagePayload = message;

    public String toString() {
        return encryptedMessagePayload;

Now let’s get to the encryption part. We will create a base encryptor class independent of the actual asymmetric algorithm and key length. It will create keys and cipher, have methods for encrypting and decrypting text as well as providing access to the keys. It looks something like this:
package com.tasosmartidis.tutorial.encryption.encryptor;

import com.tasosmartidis.tutorial.encryption.domain.EncryptorProperties;
import com.tasosmartidis.tutorial.encryption.exception.DecryptionException;
import com.tasosmartidis.tutorial.encryption.exception.EncryptionException;
import com.tasosmartidis.tutorial.encryption.exception.EncryptorInitializationException;
import com.tasosmartidis.tutorial.encryption.exception.UnauthorizedForDecryptionException;
import org.apache.commons.codec.binary.Base64;

import javax.crypto.BadPaddingException;
import javax.crypto.Cipher;
import javax.crypto.IllegalBlockSizeException;
import javax.crypto.NoSuchPaddingException;
import java.nio.charset.StandardCharsets;

public class BaseAsymmetricEncryptor {
    private final KeyPairGenerator keyPairGenerator;
    private final KeyPair keyPair;
    private final Cipher cipher;
    private final EncryptorProperties encryptorProperties;

    protected BaseAsymmetricEncryptor(EncryptorProperties encryptorProperties) {
        this.encryptorProperties = encryptorProperties;
        this.keyPairGenerator = generateKeyPair();
        this.keyPair = this.keyPairGenerator.generateKeyPair();
        this.cipher = createCipher(encryptorProperties);

    protected PrivateKey getPrivateKey() {
        return this.keyPair.getPrivate();

    public PublicKey getPublicKey() {
        return this.keyPair.getPublic();

    protected String encryptText(String textToEncrypt, Key key) {
        try {
            this.cipher.init(Cipher.ENCRYPT_MODE, key);
            return Base64.encodeBase64String(cipher.doFinal(textToEncrypt.getBytes(StandardCharsets.UTF_8)));
        } catch (InvalidKeyException | BadPaddingException | IllegalBlockSizeException ex) {
            throw new EncryptionException("Encryption of message failed", ex);

    protected String decryptText(String textToDecrypt, Key key) {
        try {
            this.cipher.init(Cipher.DECRYPT_MODE, key);
            return new String(cipher.doFinal(Base64.decodeBase64(textToDecrypt)), StandardCharsets.UTF_8);
        }catch (InvalidKeyException | BadPaddingException ex){
            throw new UnauthorizedForDecryptionException("Not authorized to decrypt message", ex);
        } catch (IllegalBlockSizeException ex) {
            throw new DecryptionException("Decryption of message failed", ex);

    private Cipher createCipher(EncryptorProperties encryptorProperties) {
        try {
            return Cipher.getInstance(encryptorProperties.getAsymmetricAlgorithm());
        } catch (NoSuchAlgorithmException | NoSuchPaddingException ex) {
            throw new EncryptorInitializationException("Creation of cipher failed", ex);

    private KeyPairGenerator generateKeyPair() {

        try {
            return KeyPairGenerator.getInstance(this.encryptorProperties.getAsymmetricAlgorithm());
        } catch (NoSuchAlgorithmException ex) {
            throw new EncryptorInitializationException("Creation of encryption keypair failed", ex);


There are a lot of exceptions we need to handle for implementing our functionality but since we are not going to do anything with them in case they happen, we will wrap them with semantically meaningful runtime exceptions. I am not going to show here the exception classes since they have simply a constructor. But you can check them out in the project in github under the com.tasosmartidis.tutorial.encryption.exception package.
Their actual use you will see in different parts of the code. The constructor of the BaseAsymmetricEncryptor takes an EncryptorProperites instance as an argument.
package com.tasosmartidis.tutorial.encryption.domain;

import lombok.AllArgsConstructor;

public class EncryptorProperties {
    private final AsymmetricAlgorithm asymmetricAlgorithm;
    private final int keyLength;

    public String getAsymmetricAlgorithm() {
        return asymmetricAlgorithm.toString();

    public int getKeyLength() {
        return keyLength;

We will create an RSA based encryptor implementation. The code should speak for itself:
package com.tasosmartidis.tutorial.encryption.encryptor;

import com.tasosmartidis.tutorial.encryption.domain.AsymmetricAlgorithm;
import com.tasosmartidis.tutorial.encryption.domain.EncryptedMessage;
import com.tasosmartidis.tutorial.encryption.domain.EncryptorProperties;
import org.bouncycastle.jcajce.provider.digest.SHA3;
import org.bouncycastle.util.encoders.Hex;


public class RsaEncryptor extends BaseAsymmetricEncryptor {
    private static final int KEY_LENGTH = 2048;

    public RsaEncryptor() {
        super(new EncryptorProperties(AsymmetricAlgorithm.RSA, KEY_LENGTH));

    public String encryptMessageForPublicKeyOwner(String message, PublicKey key) {
         return super.encryptText(message, key);

    public String encryptMessageWithPrivateKey(String message) {
        return super.encryptText(message, super.getPrivateKey());

    public String decryptReceivedMessage(EncryptedMessage message) {
        return super.decryptText(message.getEncryptedMessagePayload(), super.getPrivateKey());

    public String decryptMessageFromOwnerOfPublicKey(String message, PublicKey publicKey) {
        return super.decryptText(message, publicKey);

    public String hashMessage(String message) {
        SHA3.DigestSHA3 digestSHA3 = new SHA3.Digest512();
        byte[] messageDigest = digestSHA3.digest(message.getBytes());
        return Hex.toHexString(messageDigest);

For our demo we will need actors, people that will exchange messages with each other. Each person will have a unique identity, a name and a list of trusted contacts that communicates with.
package com.tasosmartidis.tutorial.encryption.demo;

import com.tasosmartidis.tutorial.encryption.domain.EncryptedMessage;
import com.tasosmartidis.tutorial.encryption.message.RsaMessenger;
import lombok.EqualsAndHashCode;

import java.util.HashSet;
import java.util.Set;
import java.util.UUID;

public class Person {
    private final String id;
    private final String name;
    private final Set<Person> trustedContacts;
    private final RsaMessenger rsaMessenger;

    public Person(String name) { = UUID.randomUUID().toString(); = name;
        this.trustedContacts = new HashSet<>();
        this.rsaMessenger = new RsaMessenger(this.trustedContacts,;

    public PublicKey getPublicKey() {
        return this.rsaMessenger.getPublicKey();

    public String getName() {
        return name;

    public String getId() {
        return id;

    public void addTrustedContact(Person newContact) {
        if(trustedContacts.contains(newContact)) {


    public EncryptedMessage sendEncryptedMessageToPerson(String message, Person person) {
        return this.rsaMessenger.encryptMessageForPerson(message, person);

    public void readEncryptedMessage(EncryptedMessage encryptedMessage) {


Next, let’s create an RsaMessanger class which will allow people to send encrypted messages using the RsaEncryptor. When sending an encrypted message we will provide all the necessary information to guarantee confidentiality, integrity and authentication. When reading we will decrypt the message, we will try to verify that it is send by a trusted contact and ensure that the message has not been compromised, or altered.
package com.tasosmartidis.tutorial.encryption.message;

import com.tasosmartidis.tutorial.encryption.demo.Person;
import com.tasosmartidis.tutorial.encryption.domain.EncryptedMessage;
import com.tasosmartidis.tutorial.encryption.encryptor.RsaEncryptor;
import com.tasosmartidis.tutorial.encryption.exception.PayloadAndDigestMismatchException;

import java.util.Optional;
import java.util.Set;

public class RsaMessenger {

    private final RsaEncryptor encryptionHandler;
    private final Set<Person> trustedContacts;
    private final String personId;

    public RsaMessenger(Set<Person> trustedContacts, String personId) {
        this.encryptionHandler = new RsaEncryptor();
        this.trustedContacts = trustedContacts;
        this.personId = personId;

    public PublicKey getPublicKey() {
        return this.encryptionHandler.getPublicKey();

    public EncryptedMessage encryptMessageForPerson(String message, Person person) {
        String encryptedMessage = this.encryptionHandler.encryptMessageForPublicKeyOwner(message, person.getPublicKey());
        String myEncryptedId = this.encryptionHandler.encryptMessageWithPrivateKey(this.personId);
        String hashedMessage = this.encryptionHandler.hashMessage(message);
        return new EncryptedMessage(encryptedMessage, this.personId, myEncryptedId, hashedMessage);

    public void readEncryptedMessage(EncryptedMessage message) {
        String decryptedMessage = this.encryptionHandler.decryptReceivedMessage(message);
        Optional<Person> sender = tryIdentifyMessageSender(message.getSenderId());

        if(!decryptedMessageHashIsValid(decryptedMessage, message.getMessageDigest())) {
            throw new PayloadAndDigestMismatchException(
                    "Message digest sent does not match the one generated from the received message");

        if(sender.isPresent() && senderSignatureIsValid(sender.get(), message.getEncryptedSenderId())) {
            System.out.println(sender.get().getName() +" send message: " + decryptedMessage);
        }else {
            System.out.println("Unknown source send message: " + decryptedMessage);

    private boolean senderSignatureIsValid(Person sender, String encryptedSenderId) {
        if(rawSenderIdMatchesDecryptedSenderId(sender, encryptedSenderId)) {
            return true;

        return false;

    private boolean rawSenderIdMatchesDecryptedSenderId(Person sender, String encryptedSenderId) {
        return sender.getId().equals(
                this.encryptionHandler.decryptMessageFromOwnerOfPublicKey(encryptedSenderId, sender.getPublicKey()));

    private Optional<Person> tryIdentifyMessageSender(String id) {
                .filter(contact -> contact.getId().equals(id))

    private boolean decryptedMessageHashIsValid(String decryptedMessage, String hashedMessage) {
        String decryptedMessageHashed = this.encryptionHandler.hashMessage(decryptedMessage);
        if(decryptedMessageHashed.equals(hashedMessage)) {
            return true;

        return false;

Alright! It’s demo time!

We will create some tests to make sure everything works as expected. The scenarios we want to test are:
  1. When Alice (a trusted contact of Bob) sends an encrypted message to him, Bob can decrypt it and know it is from Alice. Also to ensure that the payload was not altered.
  2. The same message from Alice to Bob, is not available for Paul to decrypt and an UnauthorizedForDecryptionException will be thrown.
  3. When Paul (not known to Bob) sends an encrypted message, Bob will be able to read it but not be able to know who send it.
  4. Finally, when we compromise the payload of the encrypted message, the validation with its message digest will recognise it and throw an exception.
package com.tasosmartidis.tutorial.encryption;

import com.tasosmartidis.tutorial.encryption.demo.Person;
import com.tasosmartidis.tutorial.encryption.domain.EncryptedMessage;
import com.tasosmartidis.tutorial.encryption.exception.PayloadAndDigestMismatchException;
import com.tasosmartidis.tutorial.encryption.exception.UnauthorizedForDecryptionException;
import org.junit.Before;
import org.junit.Test;

public class DemoTest {

    private static final String ALICE_MESSAGE_TO_BOB = "Hello Bob";
    private static final String PAULS_MESSAGE_TO_BOB = "Hey there Bob";
    private final Person bob = new Person("Bob");
    private final Person alice = new Person("Alice");
    private final Person paul = new Person("Paul");
    private EncryptedMessage alicesEncryptedMessageToBob;
    private EncryptedMessage paulsEncryptedMessageToBob;

    public void setup() {
        alicesEncryptedMessageToBob = alice.sendEncryptedMessageToPerson(ALICE_MESSAGE_TO_BOB, bob);
        paulsEncryptedMessageToBob = paul.sendEncryptedMessageToPerson(PAULS_MESSAGE_TO_BOB, bob);

    public void testBobCanReadAlicesMessage() {

    @Test(expected = UnauthorizedForDecryptionException.class)
    public void testPaulCannotReadAlicesMessageToBob() {

    public void testBobCanReadPaulsMessage() {

    @Test(expected = PayloadAndDigestMismatchException.class)
    public void testChangedMessageIdentifiedAndRejected() {
        EncryptedMessage slightlyDifferentMessage = alice.sendEncryptedMessageToPerson(ALICE_MESSAGE_TO_BOB + " ", bob);


Running the test would produce the following result:

That was it! Thanks for reading, and again, you can find the code on github.