AWS – Choosing an AWS Database
AWS offers several db solutions.
A number of questions need to be considered when choosing an AWS DB solution:
Questions to Consider
What is your need – read heavy, write heavy, balanced?
Will this change, does it need to scale, will it fluctuate up and down?, is it predictable or not?
How much data volume to store – and for how long
will it grow?
what is the average object size?
how do you need to access it?
what data durability do you require? what is the “source of truth” for the data?
are there compliance requirements?
what latency requirements – how many concurrent users
what is the data model – how will you query the data, will there be data joins? structured, semi-structured?
strong schema, flexible, reporting needs? searching? RBDMS or NoSQL?
what about license costs – cloud native db such as Aurora possible or not?
Overview of Database Types on AWS
RDBMS such as sql, oltp, this means: RDS, or Aurora, esp good for joins
NoSQL DBs such as DynamoDB – json, elasticache – key/value pairs or Nepture – good for graphs, but no joins and no sql
Object Stores: S3 for big objects, Glacier for backup and archives – may not seem like a DB but it works like one
Data Warehouse solutions eg SQL Analytics/BI, Redshift (OLAP), Athena
Search solutions: eg ElasticSearch (json), for free text unstructureed searches
Graph solutions: Neptune – this displays relationships between data
Overviews of AWS DB Solutions
its a managed db on the postgresql/myswl/Oracle/SQL level
you must however an ec2 instance and ebs vol type and sufficient size
it supports read replicas and multi-AZ
security is via iam and security groups, kms, and ssl in transit
backup, snapshot and point in time restores all possible
managed and scheduled maintanance
monitoring available via cloudwatch
use cases include:
storing relational datasets rdbms/oltp performing sql queries, transactional inserts, update, delete is possible
rds for solutions architect, considerations include these “5 pillars”:
operations_ small downtimes when failover happens, when maintenance happens, when scaling read replicas, ec2 instances, and restoring from ebs, this requires manual intervention, and when application changes
security: aws is responsible for os security, but we are responsible for setting up kms, security groups, iam policies, authorizing users in db and using ssl
reliability: the multi-az feature makes rds v reliable, good for failover in failure situations
performance: dependent on ec2 instance type, ebs vol type, can add read replicas, storage autoscaling is possible, and manual scaling of instances is also possible
costs: is pay per hour based on provisioned number and type of ec2 instances and ebs usage
Compatible API for PostgreSQL and MySQL
Data is held in 6 replicas across 3 AZs
It has auto-healing capability
Multi-AZ Auto Scaling Read Replicas
Read Replicas can be global
Aurora DB can be global for DR or latency purposes
auto Scaling of storage from 10GB to 128 TB
Define EC2 instance type for Aurora instances
same security/monitoring/maintenance as for RDS but also has
Aurora Serverless for unpredictable/intermittent workloads
Aurora Multi-Master for continuous writes failover
use cases: same as for RDS but with less maintenance/more flexibility and higher performance
Operations: less operations, auto scaling storage
security: AWS as per usual, but we are responsible for kms, security groups, iam policy, user auth and ssl
reliability: multi AZ, high availability, serverless and multimaster options
performance 5x performance, plus max 15 read replicas (vs only 5 for rds)
cost: pay per hour acc to ec2 and storage usage, can be lower cost than eg oracle
it is really just a cache not a database
is a managed Redis or Memcached – similar to RDS but for caches
its an in-memory data store with very low latency
you must provision an ec2 instance type to use it
supports redis clustering and multi-az, plus read replicas using sharding
security is via iam, security groups, kms, and redis auth – NO IAM
backup, snapshot, point in time restores
managed and scheduled maintenance
monitoring via cloudwatch
use case: key/value store, frequent reads, less writes, cache results for db queries, storing of session data for websites, but cannot use sql – latter is important to be aware of.
you retrieve the data only by key value, you can’t query on other fields
operations: same as rds
security: usual can use iam policies,for users Redis Auth, and ssl
reliability: clustering and multi AZ
performance: in memory so extremely fast, read replicas for sharding, very efficient
cost: similar to rds pricing based on ec2 and storage usage
proprietary to AWS
a managed NoSQL DB
serverless, provisioned, auto-scaling, on-demand capacity
can replace elasticache as a key-value store, eg for storing session data
performance is slower than eg rds
highly available, multi-AZ by default, read–writes decoupled, DAW available for read cache
2 options for reads: eventually consistent or strongly consistent
security, authorization-authentication all done via iam
dynamodb streams integrate with lambda
backup-restore and global table feature – if you enable streams
monitoring via cloudwatch
but you can only query on primary key, sort key or indexes – exam q!
so you cannot query on “any” attribute – only the above.
serverless apps development, small docs 100s kb, distrib serverless cache, doesn’t have sql query language available, has transaction capability now built in
acts as a simple key/value store for objects
great for big objects up to 5TB, not so good for small objects
serverless, scales infinitely, strong consistency for every operation
tiers for migrating data: s3 standard, s3 IA, s3 one-zone IA, Glacier for backups
features include: versioning, encryption, CRR – cross-region replication
security: IAM, bucket policies, ACL
encryption: SSE-S3, SSE-KMS, SSE-C, client side encryption, SSL in transit
use case: static files, key-value stores for big files and website hosting
operations: no operations necessary!
security: IAM, bucket policies, ACL, encryption set up correctly,
reliability: extremely high, durability also extremely good, multi-AZ and CRR
performance: scales easily, very high read/write, multipart for uploads
cost: only pay for storage used, no need to provision in advance, plus network costs for transfer/retrieval, plus charge per number of requests
fully serverless database with SQL capability
used to query data in S3
pay per query
outputs results back to S3
is secured via IAM
use cases: one-time sql queries, serverless queries on S3, log analytics
Operations: no ops needed! is serverless
security: via S3 using bucket policies, IAM
reliability: is managed, uses Presto engine, highly available
performance: queries scale based on data size
cost: pay per query/per TB of data scanned, serverless
is a great data warehouse system
based on PostgreSQL but not used for oltp
instead, its an OLAP – online analytical processing — analytics and data warehousing
10x better performance than other data warehouses, scales to PBs of data
columnar storage of data rather than row-based
it is MPP – uses a massively parallel query execution engine which makes it extremely fast
pay as you go acc to no of instances provisioned
has an SQL interface for performing the queries.
data is loaded from S3, DynamoDB, DMS and other DBs,
1 to 128 nodes possible with up to 128 TB of space per node!
leader node used for query planning and results aggregation
compute node : performs the queries and sends results to leader node
Redshift Spectrum: performs queries directly against S3 with no load to load the data into the redshift cluster
Backup & Restore, SecurityVPC, IAM, KMS, monitoring
Redshift Enhanced VPC Routing: Copy / Unload goes through VPC, this avoids public internet
Redshift Snapshots and DR
has no multi-AZ node
the snapshots are p-i-t- point in time backups of a cluster, stored internally in S3
they are incremental — only changes are saved, makes it fast
can restore to a new cluster
automated snapshots: every 8 hrs every 5gb or according to a schedule, with a set retention period
manual snapshots: snapshot is retained until you delete it
you can configure Redshift to auto copy snapshots to a cluster of another region
either manually or automatically
loading data into redshift:
there are three possible ways:
1. use kinesis data firehouse loads data into redshift cluster via an s3 copy automatically to an S3 bucket
2. using the copy command manually without Kinesis
from S3 bucket via internet – ie without using enhanced vpc routing to redshift cluster
or via vpc not via internet, using enhanced vpc routing
3. from ec2 instance to redshift cluster using jdbc driver
in this case it is much better to write the data in batches rather than all at once.
must already have a redshift cluster operational
Spectrum is a way to query data that is already in S3 without having to load it into Redshift
you submit the query which is submitted to thousands of Redshift spectrum nodes for processing
Operations: similar to rds
security: uses iam, vpc, kms, ssl as for rds
reliability: auto healing, cross region snapshop copies possible
performance: 10x performance of other data warehousing systems, uses compression
cost: pay per node provisioned about 10% of cost of other dw systems
vs athena: is faster at querying, does joins, aggregations thanks to indexes
redshift = analytics/BI/Data Warehouse
is a managed extract transform and load ETL service
used to prepare and transform data for analytics
fully serverless service
fetch data from s3 bucket or rds, send to glue which does extracting, transforming and loading to redshift data warehouse
glue data catalog: catalog of all the datasets you have in aws – ie metadata info about your data
you deploy glue data crawler to navigate s3, rds, dynamo DB, it then writes the metadata to the glue data catalog.
this can then be used b Glue Jobs eTL
data discovery -> Athena, Redshift Spectrum, EMR for analytics
all you need to know about Glue for exam
if you hear graphs mentioned in exam, this refers to neptune!
is a fully managed graph database
high relationship data
social networking, eg users friends with users, replied to comment on post of user and likes other comments
knowledge graphs eg for wikipedia – links to other wiki pages, lots of these links
this is graph data – giving a massive graph
highly available across 3 AZs with up to 15 read replicas available
point in time recovery, continuous backup to S3
support for kms encryption, https, and ssl
operations: similar to rds
security: iam, vpc, kms, ssl – similar to rds, plus iam authentication
reliability: multi-AZ and clustering
performance: best suited for graphs, clustering to improve performance
cost: similar to rds, pay per provisioned node
remember neptune = graphs database
– the successor to AWS ElasticSearch
eg dynamodb only allows search by primary key or index,
whereas with OpenSearch you can search ANY field
often used as a complement to other db
also has usage for big data apps
can provision a cluster of instances
built-in integrations: various – kinesis data firehose, aws IoT, CloudWatch Logs
comes with visualization dashboard
operations: similar to rds
security: is via Cognito, IAM, KMS encryption, SSL and VPC
reliability: multi-AZ and clustering
performance: based on elasticsearch project – open source, pbyte scale
cost: pay per provisioned node
all you need to remember:
used to search and index data
Which database helps you store relational datasets, with SQL language compatibility and the capability of processing transactions such as insert, update, and delete?
Which AWS service provides you with caching capability that is compatible with Redis API?
Amazon ElastiCache is a fully managed in-memory data store, compatible with Redis or Memcached.
You want to migrate an on-premises MongoDB NoSQL database to AWS. You don’t want to manage any database servers, so you want to use a managed NoSQL database, preferably Serverless, that provides you with high availability, durability, and reliability. Which database should you choose?
Amazon DynamoDB is a key-value, document, NoSQL database.
You are looking to perform Online Transaction Processing (OLTP). You would like to use a database that has built-in auto-scaling capabilities and provides you with the maximum number of replicas for its underlying storage. What AWS service do you recommend?
Amazon Aurora is a MySQL and PostgreSQL-compatible relational database. It features a distributed, fault-tolerant, self-healing storage system that auto-scales up to 128TB per database instance. It delivers high performance and availability with up to 15 low-latency read replicas, point-in-time recovery, continuous backup to Amazon S3, and replication across 3 AZs.
As a Solutions Architect, a startup company asked you for help as they are working on an architecture for a social media website where users can be friends with each other, and like each other’s posts. The company plan on performing some complicated queries such as “What are the number of likes on the posts that have been posted by the friends of Mike?”. Which database do you recommend?
Amazon Neptune is a fast, reliable, fully-managed graph database service that makes it easy to build and run applications that work with highly connected datasets.
You have a set of files, 100MB each, that you want to store in a reliable and durable key-value store. Which AWS service do you recommend?
Amazon S3 is indeed a key-value store! (where the key is the full path of the object in the bucket)
You would like to have a database that is efficient at performing analytical queries on large sets of columnar data. You would like to connect to this Data Warehouse using a reporting and dashboard tool such as Amazon QuickSight. Which AWS technology do you recommend?
You have a lot of log files stored in an S3 bucket that you want to perform a quick analysis, if possible Serverless, to filter the logs and find users that attempted to make an unauthorized action. Which AWS service allows you to do so?
Amazon Athena is an interactive serverless query service that makes it easy to analyze data in S3 buckets using Standard SQL.
As a Solutions Architect, you have been instructed you to prepare a disaster recovery plan for a Redshift cluster. What should you do?
enable automated snapshops, then configure your redshift cluster to autocopy the snapshots to another aws region
Which feature in Redshift forces all COPY and UNLOAD traffic moving between your cluster and data repositories through your VPCs?
Enhanced VPC Routing
You are running a gaming website that is using DynamoDB as its data store. Users have been asking for a search feature to find other gamers by name, with partial matches if possible. Which AWS technology do you recommend to implement this feature?
Anytime you see “search”, think ElasticSearch.
An AWS service allows you to create, run, and monitor ETL (extract, transform, and load) jobs in a few clicks.
AWS Glue is a serverless data-preparation service for extract, transform, and load (ETL) operations.