Tuesday, April 16, 2019

Data design for web applications

There are many choices of database technologies to base your application on; so how do you make sense of the options? Are you making a single user application or a service that can have thousands of users?
For web applications that have many users, the logic in the application builds more than enough complexity; so its important to simplify what tools you use around the app. To this end it is good practice to make safe and sometimes boring architectural decisions when it comes to your database.

In this post we:

  1. Understand what types of data are used in an application.
  2. What database and caching technologies are best fits for the different types of data.
  3. Go through some guidelines to create a solid datamodel foundation for an application that can be scaled for more users later.

Data

Firstly, you need to understand all data isn't the same, and needs to be stored according to how it is used.What sort of data are you saving? How long does it have to live, how fresh does it have to be? Let this dictate your strategy for what system you are using. 
Web applications can sometime read a little bit of data from many places, while more analytical data app can have queries that can take a long time. We get into analyzing queries for performance in a different post since its important to understand how that works no matter the type of application the data is used for.
Data analysis applications have a large amounts of data, but less concurrent users than web applications. The ingestion of new data into a data warehouse is handled by pipelines that are somewhat deterministic in nature; you control when and how it happens. With a widely available web application the data is updated whenever the users are using the app, and that doesn't follow a consistent timeline.

Its important to know how 'hot' the data is; is it being updated constantly in small amounts, or is it updated at once with a lot of data to be analyzed later?

So, classify your data along the lines of:
  • Data is short living and is used often.
  • ...variations...
  • Data is long living and used rarely.
There is a big range in-between so its important to understand what data you are dealing with.
In the context of this post we are talking about using data for web applications that have many users.

Temporal/Dynamic Data

Data like session tokens are considered temporal or 'ephemeral'. This type of data has a short lifetime, and needs to be accessed quickly and frequently. Session tokens, permissions, feature flags are examples of this. They are created and used while the app is running and disposed of after the app has stopped or a users request is finished.

Pattern: Use a dedicated caching app (redis, memcached, etc) to store this type of data. Memcached and redis are great options. I have used/abused redis for many use cases and its been fast and correct no matter what the job I gave it.


For web applications, there will be assets used by the pages, like style sheets, javascript and images. For these types of artifacts they can be loaded by the webserver each time, but they don't change between requests,

Pattern: for static assets, push the persistence of them towards the user. Its good practice to cache these types of artifacts in front of the application with a reverse proxy or push them out to a CDN. This is easier than it sounds and has a tremendous effect on the user experience.

For reverse proxies, try using nginx or varnish as a quick and easy way to cache heavy artifacts up front; so all your application cpu cycles are solely spent on application requests and not serving the same file time and again.

Persistent data

When you model the application you are building, you will end up with some first class objects that need to have their data representation stored for long periods of time. Like a user object. Instead of the user creating their account each time they want to use your site, they create it once and access it many times afterwards.
Some types of data don't belong in the database at all. Logging is a great example, this is data the application produces and may seem like a good candidate to keep in the database to query for usage; but this is much better handled by logging systems. You are probably looking for a report that can take a long time to complete, and that would compete for resource time that is better left to service user requests.

What kind of database?

Like programming languages, its the understanding of the context of the problem that gives you the most insight into what to use. Options are many, saving files on disk, using various relational dbs, using new document oriented dbs.

Why not save to files?

Maybe you have made applications on your desktop and have saved data to a file. Why wont that work on a web application? Its because with web applications you have many users using your application at once, instead of just one user at a time. To support many users at once, web applications are multi-threaded; so you have many threads trying to access, update and delete items from that single file. This is called 'resource contention' and you would have to write the code around the file operations to be thread-safe.
Exercise: Try this out with a unit test that starts many thread that operate on the same file and you will see why quite quickly

Why not use the session store?

When building a web application you always have access to the session store, and it may be tempting to store data in the session and then resolve the final results in the database when the users request is finished. This is very problematic as the data is changing with many users making requests at the same time, so the validity of the data gets lots if the transactions are done whenever the request or long-living session is used. Its just more to keep in sync and not worth the extra complexity.

What does all this look like?

For a basic web app, the application uses its own memory and a database.



Pattern: Separate the storing of data in more appropriate technologies




Databases...

There are different types of database technologies. This post largely covered relational databases, but there are different types of databases you can use, depending on your needs. Document dbs like mongo, and couchbase have great feature sets, but can be tricky to figure out how to use properly. Do you  have the time and budget needed figure them out? If you are building an app that does the standard CRUD operations, its probably a good idea to pick a tried and tested RDBMS and then let the usage of the application help you make decisions about what to do next. Different tools are just better for different tasks. Aerospike looks awesome in this space, with more coming up all the time.


All the multi-threaded access and ability to model your entities as data objects are in modern databases; so use them!


For Relational Data I have used Postgres with the most success, but MySql will give you good results as well. Oracle apparently has a database; but its rarely worth the money, or time lost talking to their salespeople.


NoSql

Lately there has been a trend to use non-relational databases, but for many cases this is a shiny toy; you can run relational databases for many thousands of users, and the technology is very stable. New databases like mongo or couch are document oriented, and may seem like a good idea, but keep in mind the added complexity you are taking on.

Cloud

Internet companies like google (spanner) and amazon (dynamodb) have created db systems for their own operations. Its fun to think of your project needing this technology; but let the usage of your application dictate when to move to these types of solutions.

Keep in mind you can run mysql or postgres in the cloud, with AWS, Azure or GCP, and they allow you to provision machines and monitor how the db uses the machine and responds to requests.
Tip: If running on AWS and you are using mysql, switch to Aurora, its a drop in replacement and will save you when running the instance hours required to service your application.

One database technology I have found to be very useful for basic webapps is the cloud datastore in Google App Engine. This scales automatically on usage and is very cheap to run in the small.

Design Steps for a datamodel

There are lots of possibilities and different solutions are useful for different problems. Here we are talking about application SQL databases that support many users and the data about them and all the data being used by the application itself to satisfy the intended use case.

SQL

Everything in stored is a table with rows and columns. Use SQL to query these tables.The relational model is a flat, 2 dimensional view of data; everything is in a table made up of rows and columns. You can do many things to represent the data to the outside, but it still resides in tables.
Learn SQL so you can look into your database without the application running with the db admin console.

Pattern: Basic steps for creating a datamodel
  1. Identify the entities and attributes
    1. usually the nouns in the spec about the data
  2. Determine the attributes types
    1. the types of the fields for the data
  3. Identify the relationships between the entities
    1. Patterns of common usage appear here
  4. Add system fields 
    1. id primary key, timestamps.
  5. Normalize enough
    1. but not too much
  6. Index fields that are searched by the application queries
    1. measure usage to see what to index

Entities

You can think of the database for the application as the foundation of a house. If its built to last, your house can be large. If its brittle, the house will fall down.
Take time when discovering relations between entities and use Normalization techniques to create a solid foundation for your application.

Saving Objects
Depending on your database and language of choice, there are tools available to map the objects you have in your app to the actual data persistence mechanism. Transactions are atomic, so make sure to prepare everything that is needed before executing the transaction. Have the steps repeatable and reversible so the transaction can be 'rolled back'.

You could be tempted to construct SQL strings in your application logic, but this approach becomes complex very quickly. Look for Object Relational Mapping tools and use them instead of rolling your own.

Identify and create relationships 
Try to flesh this out with diagrams and models before throwing foreign field keys in the tables. Many-to-many relationships need to be broken out into another table of only primary keys.
Understand referential integrity will save you in your application not having to validate these and adding complexity that can be dealt with in the database.

Normalize the entities - just keeping redundant data in it's own spot. Watch for doing too much here as performance can suffer with the resulting joins needed in the queries. Normalizing the data makes sure there is only one instance of that data, this can then be represented in multiple views.

De-normalize for performance 
3rd order normalization is probably enough. Too much normalization will put complexity in on the queries to join results and could cause performance problems at scale; but that scale needs to be very large for this to be a problem.

Pattern: Data needs to be indexed properly for efficient queries. It's like books in a library, the new books aren't just lumped into a pile; they are labels and placed in the proper location.
When indexing a database, take care to understand what parts need indexing and why. Indexing is necessary, as any lookups would be slow otherwise.

Migrations

When the application gets developed over time, the database will change. Columns and indexes will be added, and some types changed. How to we change the structure of the database as we go along? This is commonly referred to as Migrations. Migrations for relational databases are sql statements that alter a column or a table for changes in the application. Be very pragmatic and disciplined about migrations.
Pattern: Version the migration the same as the software version, then its easy at a glance to see what migrations are needed for any given release.
Many platforms have built-in migration features. Django for python has a good way of keeping this organized. JOOQ for java is very slick in keeping migrations aligned.

Wrap it up!

Designing your database will take your application far if you take some time to get it correct. You will learn a lot about the application and how accurate the requirements are by taking some time on the design of your datamodel.
Understanding the relationships between entities, what entities are used more often, and how much 'work' the database can do for you instead of writing that logic in your application code.

Use the tools for the job. storing data in its appropriate place will reduce complexity in the application and allow it to grow.




No comments:

Post a Comment