SQL vs NoSQL
- Queries
- Aggregation
- CloudSQL
- Lets say that you want to compute the average age of people living in each city
- Lets say that you want to put this average city age information on a map. Well, you need to be able to have joins to do that. You probably have a table for people and a table for cities, and the cities table will contain the latitude and longitude.
- DataStore
- Lets say that you want to compute the average age of people living in each city.
- Lets say that you want to put this average city age information on a map. Well, you need to be able to have joins to do that. You probably have a table for people and a table for cities, and the cities table will contain the latitude and longitude.
- CloudSQL
- Transactions
- Consistency
- Scalability
- Management
- Schema
- Scoreboard
- References
- Tags
- TODO
- Outstanding questions
Queries
CloudSQL
Queries are important because they are the way that you access your data. If you don’t have a powerful query language, you can’t get the data that you want and you can’t get it quickly.
DataStore
The DataStore actually has a wide variety of queries that support most use cases. But if you want to query anything and everything, or in fact all of your data when you do these aggregations, you really have to use CloudSQL.
Aggregation
CloudSQL
Lets say that you want to compute the average age of people living in each city
SELECT people.city_id, AVG(people.age)
FROM people
GROUP BY people.city_id;
Lets say that you want to put this average city age information on a map. Well, you need to be able to have joins to do that. You probably have a table for people and a table for cities, and the cities table will contain the latitude and longitude.
SELECT AVG(people.age), cities.name, cities.latitude, cities.longitude
FROM people, cities
WHERE people.city_id = cities.city_id
GROUP BY people.city_id;
DataStore
Lets say that you want to compute the average age of people living in each city.
MapReduce
There is a powerful framework called MapReduce.
The example below requires mapping over all your data. And MapReduce is a very powerful framework to do that because it computes this in parallel. This follows a basic scatter gather algorithm.
city=3,age=5
city=1,age=2
city=3,age=7
city=4,age=9
city=4,age=9
city=1,age=3
city=4,age=8
city=2,age=3
We map this to a key-value pair - in this case, the key is the cityID and the value is the age.
3,5
1,2
3,7
4,9
4,9
1,3
4,8
2,3
Then, we shuffle to group by cityID
1 - 2,3
2 - 3
3 - 5,7
4 - 9,9,8
And then reduce
1 - 5/2=2.5
2 - 3/1=3
3 - 12/2=6
4 - 26/3=8.66
If you want to keep the result set up to date as your entities in the database change, you can do that using something called a Materialized View. What you do is, track changes in your entities and you store them in a separate entity. And then, asynchronously, fan in those changes and apply them to your result set. This way, the results will always be up to date.
Lets say that you want to put this average city age information on a map. Well, you need to be able to have joins to do that. You probably have a table for people and a table for cities, and the cities table will contain the latitude and longitude.
In DataStore, it not as easy as it was in CloudSQL.
Transactions
Transactions are important because they ensure that you have atomically made changes to your database. You don’t want your machine to crash in the middle and partially apply some changes.
DataStore
DataStore does support transactions. You can do transactions on a single row. But it could be said that it is not a real transaction. DataStore supports transactions across rows using something called Entity Group. These are groupings of entities under a single transaction log. They are great at providing ACID semantics at scale. So all of these entity groups can have transactions occurring simultaneously, and you can have any number of these entity groups in your application.
┌─────────┐
│ EG │
├─────────┤
┌───────────────┐ │ │
│ Entity Groups │ │ Entity │
└───────────────┘ │ │
├─────────┤
│ Entity │
└─────────┘
- Groupings of entities under a single transaction log
- Many entity groups = scalable ACID semantics
For example, if you have a game, and you have a player entity, and then you have entities for items in that player’s inventory, as long as you structure it in such a way that the items in the player’s inventory are in the same entity group as the player, you can act upon these transactionally. This is very important, because you never want a player to use an item and have the item still be in their inventory afterwards or try to use an item and have the effect not work.
┌───────────────┐
│ Player │ id, Health, Gold
└───────────────┘
┌───────────────┐
│ Potion │ Health
└───────────────┘
┌───────────────┐
│ Sword │ Power, Durability
└───────────────┘
Multi-row transactions
For example, if a player wanted to drink a potion, we have the player as the root entity and the potion as a child entity. They are in the same entity group. And so we can easily act upon these transactionally. It is as easy as decorating the funtion/method in db.transactional
, and it makes everything in that function/method happen atomically. You get the player, you get the potion from his inventory, you transfer the health and the potion to the player, you remove the potion from the player’s inventory, and you put the player in. It all happens atomically.
What happens when you want to move a potion atomically from one player to another?
DataStore also supports cross-entity group transactions. So, if you have two players, and one player wants to sell a potion to that other player, you can do so simply by setting the XG
flag to true. Now, in the same scenario, you can load the buyer, you load the seller, you load the potion from the seller’s inventory, you transfer the money from the buyer to the seller, you store the potion in the buyer’s inventory, you remove the potion from the seller’s inventory and then you save both to the buyer and seller, and it happens atomically.
Are there any limitations to cross entity group transactions?
Yes, we had to do something called two-phase commit to make sure that we commit to all the transaction logs atomically. And this doesn’t actually scale very well with the number of transaction logs involved. So currently, we have a limit of five entity groups that you can use in these cross-entity group transactions, which is more than enough for most use cases.
CloudSQL
Lots of NoSQL databases don’t even support transactions.
What good are the DataStore transactions if they are broken by cross data center replication? We all know that DataStore is built on top of Bigtable. And BigTable has this weird, out-of-order, eventually consistent replication that nobody really understands. Actually, DataStore uses Megastore Replication.
But there are those other use cases where you want to transact ove the entire world, and in CloudSQL, you can do that. Lets say that you want to give gold away to your friends. Again, you start your transaction, you run your queries and then you commit.
Transactions in MySQL
START TRANSACTION;
SELECT gold from players where id = 1;
SELECT COUNT(*) from friends WHERE player_id = 1;
UPDATE players SET gold = <amount to give away> WHERE id = 1;
UPDATE players, friends SET players.gold = players.gold + 25
WHERE friends.player_id = 1 AND players.id = friends.friend_id;
COMMIT;
There is no limitations in the number of entity groups or the number of rows that can be involved in a transaction.
In CloudSQL, you can do the same thing but you don’t have to define those relationships in advance.
Sell a potion to another player
START TRANSACTION;
SELECT gold from players where id in (1,2);
SELECT COUNT(*) from inventory WHERE player_id = 1 AND type = 'potion';
UPDATE players SET gold = gold + 25 WHERE id = 1;
UPDATE players SET gold = gold - 25 WHERE id = 2;
UPDATE inventory SET player_id = 2 WHERE player_id = 1 AND type = 'potion' LIMIT 1;
COMMIT;
All you need is START TRANSACTION
, you run your queries and then commit. It is as simple as that.
Consistency
DataStore
- Megastore replication
- Entity groups
- Parallel transaction logs
- Parallel replication
- No Master - but this is not a disadvantage. If the queries from the application uses transactions, the data is written to and read from a replica that has the most up-to-date replication.
- Strong within an entity group
- Get
- Ancestor Query
- Eventual across entity groups
- Global Queries
Speaking of replication, MySQL uses a single master to guarantee strong consistency but then asynchronously replicates changes to a slave. And if there is a lot of changes queued up on a master and the master crashes, you lose that data. If there is a data center outage, the developers lose their data. No, in CloudSQL, it uses synchronous replication.
What is the latency that the applications should expect and how does that compare to DataStore? They are pretty comparable. The latency would be somewhere between 50 and 100 milliseconds. Although, since CloudSQL uses a single master, it can commit a whole bunch of inserts at one time, so the bandwidth is much larger.
CloudSQL
Master + Synchronous replication
┌──────--─┐
|Client │
└─-------─┘
┌──────--------------------─┐
|DataCenter A |
|---------------------------|
|MySQL ┌───────┐ |
| │Master │ |
| └───────┘ |
└─----------- ──────────────┘
┌──────--------------------─┐ ┌──────--------------------─┐
|DataCenter B | |DataCenter C |
|---------------------------| |---------------------------|
| ┌───────┐ | | ┌───────┐ |
| │Slave │ | | │ Slave │ |
| └───────┘ | | └───────┘ |
└─----------- ──────────────┘ └─----------- ──────────────┘
A client sends some data to the MySQL server. Before responding to the client, we synchronously replicate the data to the other data centers, and then we respond to the client. What this means is that if we lose the machine that is running the MySQL server, or even if we lose the entirety of DataCenter A, we can quickly restart the MySQL server in a different DataCenter without any data loss.
Scalability
CloudSQL
Lets look at some examples from within Google about how CloudSQL is used.
Example 1:
Google Time Keeper
- Used by Goodle AdWord’s sales and support team
- Tracks time spent on
- Chat support
- Email support
- Campaign optimization
This is a large organization within Google and they use CloudSQL for their day-to-day jobs and it works really well for them. They use this data to optimize their own workflow.
Example 2:
Google Org Chart
- Tracks 30k+ employees
- 10-100 QPS
This keeps track of the data about employees, their relationships to each other, and what they are working on. To give you idea of the kind of load that we can handle, picture this. We have these company all-hands meetings. So all 30,000 employees are listening to our upper management. And the upper management reminds everyone, alright, I want you to go onto the org chart application and update what you are working on. So everybody simultaneously opens their laptop and goes to this website. Tens of thousands of employees hammering on this website all of a sudden. We get tens to hundreds of QPS on the backend. And CloudSQL handles it just fine.
CloudSQL works very well for these sorts of large corporate environments.
DataStore
Say you are building a hugely popular mobile application. We are talking about thousands and thousands of QPS and millions and millions of users and billions of ruffled feathers. With DataStore, there’s no headaches. There is no provisioning. It just scales to your use case and it just works.
Lets see how.
DataStore on Megastore on Bigtable on …
- All the best features of each layer
┌──────---─┐
│DataStore │
└──────---─┘
│Megastore │
└──────---─┘
│Bigtable │
└──────---─┘
│GFS v2 │
└──────---─┘
DataStore is built on top of Google infrastructure. And each one of these layers adds a key component to the DataStore scalability.
GFS v2 (Google File System)
- Huge Capacity
- Durable
GFS allows your application to get as large as it needs to get.
Bigtable
On top of that, we have Bigtable. Bigtable automatically splits your data based on loads and balances them on the machines that we have available. So, say your traffic changes. All of a sudden, you have a spike of writes in one part of your data. What Bigtable will do is it will take down that one shard, or tablet, and split it into two pieces and then load those on different machines.
On top of that is Megastore.
Megastore
- Works at scale
- See 2011 talk “More 9s Please: Under The Covers of the High Replication DataStore”
- 9’s are important at scale
- Not reliant on a single datacenter
- Handles local issues
- Handles catastrophic failures
It is a truly distributed database system because it spans multiple data centers and multiple geographic regions. That is the level it operates.
At scale, the reliability of the DataStore is hugely important, because even small local issues can cause outages for many, many users. And Megastore just handles it by automatically failing over to a different data center and reading the data from there. And it is guaranteed, if you are using the entity groups, to always have that strong consistency, because it makes sure that whatever replica you are reading from is up-to-date. It also handles catastrophic failures. So it one or more data centers all of a sudden goes offline, they fall into the ocean of the power outage happens nearby, those types of failures are still hidden from your users.
Management
Benefits of the cloud. No software patches to worry about. No hard drives to replace. No systems to purchase. All of that applies equally to DataStore and CloudSQL as well.
CloudSQL
It is very easy to get started with CloudSQL. The very first thing we need to do is create an App Engine Application. So we go to the AppEngine website where we have the form for creating an application. We need to pick an Application Identifier, Application Title, and create an application. And go to the dashboard. For a newly created App, there isn’t going to be much data in the dashboard. The next step is, go to the APIs console. If you used the Maps API or the Translate API, you probably have this already set up. We need to set-up billing. Go through the billing flow, enter credit card info. After that, go to the main page and set-up CloudSQL instance. And go to the CloudSQL tab, and create a new instance, pick an instance name, pick a size. The size controls how much CPU and RAM you are going to allocate to the MySQL process. Authorize the instance to access the Application Identifier. After a few seconds, the MySQL instance is provisioned and you will see a dashboard for it. There will be a little bit of storage usage already because MySQL needs to format some of its data files. Now we want to get started using our CloudSQL instance. We have a SQL prompt built into the web UI that we can easily use for simple queries. So, first thing we need to do is create a database. CREATE DATABASE test; And then create a table. CREATE TABLE t1(c1 INT, c2 VARCHAR(256)); We can continue to use this to populate data as well as query the data as well. If we need to create development or staging instances, just go through those last few steps and everything will be provisioned for you.
DataStore
When we create the App initially, the DataStore was ready right then to accept writes from your application. There is nothing to provision, nothing to configure. You just start writing data. And if you want to use different tables - or in the DataStore, they are called “kinds” - you just define those kinds in your code. You don’t have to tell DataStore about them ahead of time. And you just start putting data. If you want isolation, you can use Namespaces for multi-tenancy or to isolate a development instance. Or you can even use an entirely different app to completely isolate your staging instance from everything else.
- No configuration needed
- Just start writing data
- Entity ‘Kinds’ for table
- Namespaces for multi-tenancy/isolation
Schema
Schema defines what your data looks like. What are the data types. What are the relationships between the data.
CloudSQL
In CloudSQL, the schema is strictly enforced. And that means you have to create the table before you can start working with your data. And some people think of this as a benefit of having this strictly enforced schema. It means that you don’t have typos in your code where you write to some non-existent column, and then when you try to read from the column that you are supposed to read from, there is no data there.
CREATE TABLE Player (name VARCHAR(256), health int);
Lets see an example of how to do a schema change. Lets go to our previous example of a player with a name and some integer amount of health. We are going to want to add magic to this game.
ALTER TABLE Player ADD COLUMN magic int;
We do have to be careful with these ALTER TABLE
statements, because they can lock up the table for the duration of the change.
It’s only easy to change the schema on the surface. The application usually has to be made aware of any schema changes in order to operate properly. Adding a column in a SQL database can adversely affect applications that tend to do “SELECT *”, while renaming or removing a column can adversely affect applications that assumed the existence of that column. SQL databases make schema change “easy” in the sense that there’s a SQL command that allows you to add, drop and rename columns. The schema management requirements up the stack still need to be thought through and implemented properly.
Traditionally, it involves creating a complete replica of your data so that you can operate in parallel. This allows you to update the schema in the replica and then gradually migrate your application over to it.
Maintaining multiple instances of a database is expensive — even for enterprises. Doubling your database costs just to prevent downtime during schema changes is a tough pill to swallow.
Schema updates on legacy RDBMS get even more complicated if you’ve sharded your database to scale it horizontally. In that case, you’ve created an ad-hoc distributed database, so building a replica of it is far from straightforward. The complexity of a distributed relational database makes it difficult to apply schema changes while keeping data consistent.
SQL Schema Change
- ALTER TABLE
- Locks the table - the reason this happens is MySQL has tightly packed the row data so that one row is right adjacent to the next. And when we add that extra columm, there is not room for that new field in that tightly packed space. So it needs to copy everything to a new location. So for the duration of the time that it takes to copy everything, you are going to lock the table up.
- Copies entire table
Online Schema Change
There are some tricks that we can play to minimize this lock time or even hide it entirely. It is called on Online Schema Change
. What we do is we have our old table, and we have a new table. We do a background copy of the data from the old table to the new table. And while that background copy is going on, we don’t want to miss any changes that are happening to the old table. So we set-up a trigger on the old table so that if any of those changes come through, they will get propagated to the new table. Once everything is copied, we just do an atomic rename and it just works. So if you want to see how that works, there is a company called Percona. And they have a tool called pt-online-schema-change that works with MySQL to make that very, very easy.
SQL Schema Change
- Online Schema Change
- Write to new and old table
- Bulk copy
- Rename new table
NoSQL and “schemaless” databases
NoSQL databases such as MongoDB offer a “schemaless” database engine. Schemaless databases solve the schema problem by changing how data is stored. Instead of tables and rows, data is stored as JSON-like documents.
This approach increases flexibility at the expense of data consistency, because documents in a table or collection aren’t required to contain the same data fields. The lack of an enforced schema in the database can also negatively impact performance in some use cases.
And while not having to worry about database schema might sound great, in most cases, it just moves the work of enforcing data rules into your application.
Most applications require some or all of the data they use to be structured. If your database isn’t enforcing that structure, then you need to enforce it in your application, to ensure only valid data is sent to the database.
So, while NoSQL databases have other advantages, the term “schemaless” can be a bit misleading. Your database might be schemaless, but often that just means your team will need to build schema-like guardrails into your application. And of course, when you want to change those guardrails, it’ll be on your team to figure out how to do those updates without impacting the user experience.
For some applications, it makes sense to sacrifice the schema-enforced data consistency of a traditional relational database for the flexibility of NoSQL. Sometimes you need to store some unstructured data! Or, you may simply have an application where eventual data consistency is good enough.
But many applications – particularly those with heavy transactional workloads – rely on data consistency. Orders, logistics, money transfers – these are workloads that many applications have in some form or another where eventual consistency often isn’t good enough.
DataStore
In DataStore, schema changes are actually magical. Well, they are not magical. You have to do something. But the schema enforcement actually happens - or you can enable a schema enforcement in your code. The DataStore doesn’t actually enforce this schema for you. What this means is, if you want to add that magic field to Player, all you have to do is, change your code (to add that field to the entity definition) and it’s there. You can set a default value, and you can just start using the new stuff. If you need to backfill any of the previously stored entities to, say, do some sort of complicated calculation to figure out what initial magic every character should have, you can do that using the powerful MapReduce framework.
DataStore Schema Change
- Update code
- Optionally write MapReduce to backfill
class Player(db.Model)
name = db.StringProperty()
health = db.IntegerProperty()
Add the new magic field.
magic = db.IntegerProperty(default=0)
Bottom line
Typically, schema evolution is either managed by the database engine, the application or the intervening API layer. As for how “easy” it is, depends a lot on the application layers above it and how they are affected by the schema change.
Scoreboard
DataStore | CloudSQL | |
---|---|---|
Queries | Y | Y+ |
Transactions | Y | Y+ |
Consistency | Y | Y+ |
Scalability | Y+ | Y |
Management | Y+ | Y |
Schema | Y+ | Y |
Maybe, there is room for both of these products in the world.
Where can the DataStore probably be a better fit than CloudSQL
File sharing applications are popular now-a-days. If we wanted to build one, first we need to come up with a good name. e.g. DropRectangle.net.
DropRectangle.net SQL
┌──────────┐ ┌──────────┐ ┌──────────--┐
│ Users │ │ Files │ │ ACL │
└──────────┘ └──────────┘ └──────────--┘
| user_id | | file_id | | file_id |
| name | | owner_id | | user_id |
| | | name | | permission |
└──────────┘ └──────────┘ └─────────--─┘
If you were to use CloudSQL to store this data, this is probably how you would structure your schema. You would have a table for your users. With this schema, you can run queries like, show me all of the files that I have access to, or atomically transfer ownership of this file from one owner to another. This works great until your site gets popular and you have lots and lots of users and lots and lots of files. And the data no longer fits on a single machine. At that point, you can shard your data. And the natural way to shard the data would be by user. Unfortunately, we have this operation of transferring ownership between users. And if you shard your data by user, you don’t know whether the two users are on the same shard. If they are not on the same shard, it gets hard to atomically move that file ownership between shards. And this is where the DataStore will probably do better than CloudSQL.
You structure the user as the root entity. You have files beneath that and access control underneath that. So with global queries, you could easily find all the files that you have access to. And if you want to atomically transfer files between users, you can use the cross-entity group transactions.
What are some use cases for CloudSQL?
Full support of off-the-shelf
- Frameworks - The entire ecosystem built by frameworks that are available that were built to work with relational databases. And it doesn’t always make sense to modify these solutions or roll your own solutions.
- Hibernate
- JDO/JPA
- Spring
- Django
- WordPress
- Standards Based Existing Applications
Are there use-cases where these two products could work together?
We have this product manager on our team. He sends a lot of emails about the stuff that he is selling. It would be great if he had some sort of web application where he could post things for sale or list things for sale, and people could search for what they want to buy.
┌────────────----───┐ ┌────────────------───┐
│ CloudSQL │ │ DataStore │
│ (active listings) │ │ (archived listings) │
│ │ │ (disk) │
└─────────────----──┘ └─────────────------──┘
What he could do is he could use CloudSQL to store all of his active listings so that he has all the speed of the in-memory operations and in-memory performance of a single machine. And when a listing expires or is sold, he can use the DataStore to archive all those listings. And they are always available, and he can still query against them, and he could still use them. One of the big benefits of putting the active listings in CloudSQL would be that he gets to take advantage of the powerful query language and all of those aggregations and lots of flexibility so that he could run queries like, show me the average price of a sofa in San Fransisco. And CloudSQL works best when your entire data set fits into memory so it doesn’t have to page the disk or do any sort of heavy lifting there. And the active set of listings is relatively small to all the listings throughout time. So it really makes a lot of sense to keep them in CloudSQL. And storing the archive listings in the DataStore makes sense, because when you have schema changes, you want to apply it to the data that you are actually going to be working with, the stuff that is in CloudSQL. But all of those archive listings, you don’t really want to apply the schema changes and do the backfill and everything. And so with the flexible schema of the DataStore, you can get that to work as well.
References
Google I/O 2012 - SQL vs NoSQL: Battle of the Backends
Tags
TODO
Outstanding questions
Can the schema of a NoSQL table be updated later? If so, can the data be migrated into the new table?