AWS – Choosing an AWS Database

You are here:
< All Topics

AWS offers several db solutions.

 

A number of questions need to be considered when choosing an AWS DB solution:

 

Questions to Consider

 

What is your need – read heavy, write heavy, balanced?

 

Throughput volume?

 

Will this change, does it need to scale, will it fluctuate up and down?, is it predictable or not?

 

How much data volume to store – and for how long

 

will it grow?

 

what is the average object size?

 

how do you need to access it?

 

what data durability do you require? what is the “source of truth” for the data?

 

are there compliance requirements?

 

what latency requirements – how many concurrent users

 

what is the data model – how will you query the data, will there be data joins? structured, semi-structured?

 

strong schema, flexible, reporting needs? searching? RBDMS or NoSQL?

 

what about license costs – cloud native db such as Aurora possible or not?

 

Overview of Database Types on AWS

 

RDBMS such as sql, oltp, this means: RDS, or Aurora, esp good for joins

NoSQL DBs such as DynamoDB – json, elasticache – key/value pairs or Nepture – good for graphs, but no joins and no sql

 

Object Stores: S3 for big objects, Glacier for backup and archives – may not seem like a DB but it works like one

 

Data Warehouse solutions eg SQL Analytics/BI, Redshift (OLAP), Athena

 

Search solutions: eg ElasticSearch (json), for free text unstructureed searches

 

Graph solutions: Neptune – this displays relationships between data

 

 

Overviews of AWS DB Solutions

 

RDS Overview

 

its a managed db on the postgresql/myswl/Oracle/SQL level

 

you must however an ec2 instance and ebs vol type and sufficient size

 

it supports read replicas and multi-AZ
security is via iam and security groups, kms, and ssl in transit
backup, snapshot and point in time restores all possible

 

managed and scheduled maintanance

 

monitoring available via cloudwatch

 

use cases include:

 

storing relational datasets rdbms/oltp performing sql queries, transactional inserts, update, delete is possible

 

rds for solutions architect, considerations include these “5 pillars”:

 

operations
security
reliability
performance
cost

 

 

operations_ small downtimes when failover happens, when maintenance happens, when scaling read replicas, ec2 instances, and restoring from ebs, this requires manual intervention, and when application changes

 

 

security: aws is responsible for os security, but we are responsible for setting up kms, security groups, iam policies, authorizing users in db and using ssl

 

 

reliability: the multi-az feature makes rds v reliable, good for failover in failure situations

 

performance: dependent on ec2 instance type, ebs vol type, can add read replicas, storage autoscaling is possible, and manual scaling of instances is also possible

 

costs: is pay per hour based on provisioned number and type of ec2 instances and ebs usage

 

 

Aurora Overview

 

Compatible API for PostgreSQL and MySQL

 

Data is held in 6 replicas across 3 AZs
It has auto-healing capability
Multi-AZ Auto Scaling Read Replicas
Read Replicas can be global

 

Aurora DB can be global for DR or latency purposes
auto Scaling of storage from 10GB to 128 TB

 

Define EC2 instance type for Aurora instances

 

same security/monitoring/maintenance as for RDS but also has

 

Aurora Serverless for unpredictable/intermittent workloads
Aurora Multi-Master for continuous writes failover

 

use cases: same as for RDS but with less maintenance/more flexibility and higher performance

 

Operations: less operations, auto scaling storage

 

security: AWS as per usual, but we are responsible for kms, security groups, iam policy, user auth and ssl

 

reliability: multi AZ, high availability, serverless and multimaster options

 

performance 5x performance, plus max 15 read replicas (vs only 5 for rds)

 

cost: pay per hour acc to ec2 and storage usage, can be lower cost than eg oracle

 

 

ElastiCache Overview

 

it is really just a cache not a database

 

is a managed Redis or Memcached – similar to RDS but for caches

 

its an in-memory data store with very low latency
you must provision an ec2 instance type to use it

 

supports redis clustering and multi-az, plus read replicas using sharding

 

security is via iam, security groups, kms, and redis auth – NO IAM

 

backup, snapshot, point in time restores
managed and scheduled maintenance
monitoring via cloudwatch

 

use case: key/value store, frequent reads, less writes, cache results for db queries, storing of session data for websites, but cannot use sql – latter is important to be aware of.

 

you retrieve the data only by key value, you can’t query on other fields

 

 

operations: same as rds
security: usual can use iam policies,for users Redis Auth, and ssl

 

reliability: clustering and multi AZ

 

performance: in memory so extremely fast, read replicas for sharding, very efficient

 

cost: similar to rds pricing based on ec2 and storage usage

 

 

DynamoDB

 

proprietary to AWS
a managed NoSQL DB
serverless, provisioned, auto-scaling, on-demand capacity

 

can replace elasticache as a key-value store, eg for storing session data

 

performance is slower than eg rds

 

highly available, multi-AZ by default, read–writes decoupled, DAW available for read cache

 

2 options for reads: eventually consistent or strongly consistent

 

security, authorization-authentication all done via iam

 

dynamodb streams integrate with lambda

 

backup-restore and global table feature – if you enable streams

 

monitoring via cloudwatch

 

important
but you can only query on primary key, sort key or indexes – exam q!
so you cannot query on “any” attribute – only the above.

 

use case:

 

serverless apps development, small docs 100s kb, distrib serverless cache, doesn’t have sql query language available, has transaction capability now built in

 

 

 

S3 Overview

 

acts as a simple key/value store for objects

 

great for big objects up to 5TB, not so good for small objects

 

serverless, scales infinitely, strong consistency for every operation

 

tiers for migrating data: s3 standard, s3 IA, s3 one-zone IA, Glacier for backups

 

features include: versioning, encryption, CRR – cross-region replication

 

security: IAM, bucket policies, ACL

 

encryption: SSE-S3, SSE-KMS, SSE-C, client side encryption, SSL in transit

 

use case: static files, key-value stores for big files and website hosting

 

operations: no operations necessary!
security: IAM, bucket policies, ACL, encryption set up correctly,

 

reliability: extremely high, durability also extremely good, multi-AZ and CRR

 

performance: scales easily, very high read/write, multipart for uploads

 

cost: only pay for storage used, no need to provision in advance, plus network costs for transfer/retrieval, plus charge per number of requests

 

 

Athena Overview

 

fully serverless database with SQL capability
used to query data in S3
pay per query
outputs results back to S3
is secured via IAM

 

use cases: one-time sql queries, serverless queries on S3, log analytics

 

 

Operations: no ops needed! is serverless

 

security: via S3 using bucket policies, IAM

 

reliability: is managed, uses Presto engine, highly available
performance: queries scale based on data size

 

cost: pay per query/per TB of data scanned, serverless

 

 

Redshift Overview

 

is a great data warehouse system

 

based on PostgreSQL but not used for oltp

 

instead, its an OLAP – online analytical processing — analytics and data warehousing

 

10x better performance than other data warehouses, scales to PBs of data

 

columnar storage of data rather than row-based

 

it is MPP – uses a massively parallel query execution engine which makes it extremely fast

pay as you go acc to no of instances provisioned
has an SQL interface for performing the queries.

 

data is loaded from S3, DynamoDB, DMS and other DBs,
1 to 128 nodes possible with up to 128 TB of space per node!

 

leader node used for query planning and results aggregation
compute node : performs the queries and sends results to leader node

 

Redshift Spectrum: performs queries directly against S3 with no load to load the data into the redshift cluster

 

Backup & Restore, SecurityVPC, IAM, KMS, monitoring

 

Redshift Enhanced VPC Routing: Copy / Unload goes through VPC, this avoids public internet

 

Redshift Snapshots and DR

 

has no multi-AZ node

 

the snapshots are p-i-t- point in time backups of a cluster, stored internally in S3

 

they are incremental — only changes are saved, makes it fast

can restore to a new cluster

automated snapshots: every 8 hrs every 5gb or according to a schedule, with a set retention period

 

manual snapshots: snapshot is retained until you delete it

 

neat feature:
you can configure Redshift to auto copy snapshots to a cluster of another region
either manually or automatically

 

 

loading data into redshift:

 

there are three possible ways:

 

1. use kinesis data firehouse loads data into redshift cluster via an s3 copy automatically to an S3 bucket

 

2. using the copy command manually without Kinesis
from S3 bucket via internet – ie without using enhanced vpc routing to redshift cluster
or via vpc not via internet, using enhanced vpc routing

 

 

3. from ec2 instance to redshift cluster using jdbc driver
in this case it is much better to write the data in batches rather than all at once.

 

 

Redshift Spectrum

 

must already have a redshift cluster operational

 

Spectrum is a way to query data that is already in S3 without having to load it into Redshift

 

you submit the query which is submitted to thousands of Redshift spectrum nodes for processing

 

 

Operations: similar to rds
security: uses iam, vpc, kms, ssl as for rds
reliability: auto healing, cross region snapshop copies possible
performance: 10x performance of other data warehousing systems, uses compression
cost: pay per node provisioned about 10% of cost of other dw systems
vs athena: is faster at querying, does joins, aggregations thanks to indexes

 

redshift = analytics/BI/Data Warehouse

 

 

Glue Overview

 

is a managed extract transform and load ETL service

 

used to prepare and transform data for analytics
fully serverless service

 

fetch data from s3 bucket or rds, send to glue which does extracting, transforming and loading to redshift data warehouse

 

glue data catalog: catalog of all the datasets you have in aws – ie metadata info about your data

 

you deploy glue data crawler to navigate s3, rds, dynamo DB, it then writes the metadata to the glue data catalog.

 

this can then be used b Glue Jobs eTL

 

data discovery -> Athena, Redshift Spectrum, EMR for analytics

 

all you need to know about Glue for exam

 

Finally,

 

Neptune Overview

 

tip:
if you hear graphs mentioned in exam, this refers to neptune!

 

is a fully managed graph database

 

used for
high relationship data
social networking, eg users friends with users, replied to comment on post of user and likes other comments

 

knowledge graphs eg for wikipedia – links to other wiki pages, lots of these links
this is graph data – giving a massive graph

 

highly available across 3 AZs with up to 15 read replicas available

 

point in time recovery, continuous backup to S3

 

support for kms encryption, https, and ssl

 

operations: similar to rds
security: iam, vpc, kms, ssl – similar to rds, plus iam authentication
reliability: multi-AZ and clustering
performance: best suited for graphs, clustering to improve performance

 

cost: similar to rds, pay per provisioned node

 

remember neptune = graphs database

 

 

AWS OpenSearch

 

– the successor to AWS ElasticSearch

 

eg dynamodb only allows search by primary key or index,

 

whereas with OpenSearch you can search ANY field

 

often used as a complement to other db

 

also has usage for big data apps
can provision a cluster of instances
built-in integrations: various – kinesis data firehose, aws IoT, CloudWatch Logs

 

 

comes with visualization dashboard

 

operations: similar to rds

 

security: is via Cognito, IAM, KMS encryption, SSL and VPC
reliability: multi-AZ and clustering
performance: based on elasticsearch project – open source, pbyte scale
cost: pay per provisioned node

 

all you need to remember:

 

used to search and index data

 

Question 1:
Which database helps you store relational datasets, with SQL language compatibility and the capability of processing transactions such as insert, update, and delete?

 

RDS

 

Question 2:
Which AWS service provides you with caching capability that is compatible with Redis API?

 

ElastiCache

Good job!
Amazon ElastiCache is a fully managed in-memory data store, compatible with Redis or Memcached.

 

 

Question 3:
You want to migrate an on-premises MongoDB NoSQL database to AWS. You don’t want to manage any database servers, so you want to use a managed NoSQL database, preferably Serverless, that provides you with high availability, durability, and reliability. Which database should you choose?

 

Amazon DynamoDB

 

Good job!

Amazon DynamoDB is a key-value, document, NoSQL database.

 

 

Question 4:
You are looking to perform Online Transaction Processing (OLTP). You would like to use a database that has built-in auto-scaling capabilities and provides you with the maximum number of replicas for its underlying storage. What AWS service do you recommend?

 

Amazon Aurora

 

Good job!

Amazon Aurora is a MySQL and PostgreSQL-compatible relational database. It features a distributed, fault-tolerant, self-healing storage system that auto-scales up to 128TB per database instance. It delivers high performance and availability with up to 15 low-latency read replicas, point-in-time recovery, continuous backup to Amazon S3, and replication across 3 AZs.

 

 

 

Question 5:
As a Solutions Architect, a startup company asked you for help as they are working on an architecture for a social media website where users can be friends with each other, and like each other’s posts. The company plan on performing some complicated queries such as “What are the number of likes on the posts that have been posted by the friends of Mike?”. Which database do you recommend?

 

Amazon Neptune

 

Good job!
Amazon Neptune is a fast, reliable, fully-managed graph database service that makes it easy to build and run applications that work with highly connected datasets.

 

 

Question 6:
You have a set of files, 100MB each, that you want to store in a reliable and durable key-value store. Which AWS service do you recommend?

 

Amazon S3

 

Good job!
Amazon S3 is indeed a key-value store! (where the key is the full path of the object in the bucket)

 

Question 7:
You would like to have a database that is efficient at performing analytical queries on large sets of columnar data. You would like to connect to this Data Warehouse using a reporting and dashboard tool such as Amazon QuickSight. Which AWS technology do you recommend?

 

Amazon Redshift

 

Good job!
Amazon Redshift

 

 

Question 8:
You have a lot of log files stored in an S3 bucket that you want to perform a quick analysis, if possible Serverless, to filter the logs and find users that attempted to make an unauthorized action. Which AWS service allows you to do so?

 

Amazon Athena

 

Good job!
Amazon Athena is an interactive serverless query service that makes it easy to analyze data in S3 buckets using Standard SQL.

 

 

Question 9:
As a Solutions Architect, you have been instructed you to prepare a disaster recovery plan for a Redshift cluster. What should you do?

 

enable automated snapshops, then configure your redshift cluster to autocopy the snapshots to another aws region

 

Good job!

 

 

Question 10:
Which feature in Redshift forces all COPY and UNLOAD traffic moving between your cluster and data repositories through your VPCs?

 

 

Enhanced VPC Routing

 

Good job!

 

Question 11:
You are running a gaming website that is using DynamoDB as its data store. Users have been asking for a search feature to find other gamers by name, with partial matches if possible. Which AWS technology do you recommend to implement this feature?

 

ElasticSearch

 

Good job!
Anytime you see “search”, think ElasticSearch.

 

 

Question 12:
An AWS service allows you to create, run, and monitor ETL (extract, transform, and load) jobs in a few clicks.

 

AWS Glue

 

Good job!
AWS Glue is a serverless data-preparation service for extract, transform, and load (ETL) operations.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table of Contents