This page provides you with instructions on how to extract data from MariaDB and load it into Amazon Redshift on an ongoing basis. (If this manual process is a bit more involved than you’d prefer, check out Stitch, which can do all the heavy lifting for you in just a few clicks.)
What is MariaDB?
MariaDB is an open source database based off of MySQL. Its development is led by the original developers of MySQL after it was acquired by Oracle. MariaDB is often used as an open source alternative to a traditional MySQL database.
Pulling Data Out of MariaDB
There are several methods for extracting data from MariaDB, and the one you use will probably be dependent upon your needs (and skill set).
The most common way is simply writing queries. SELECT queries allow you to pull exactly the data you want by specifying filters, ordering, and limiting results. If you have a specific subset of data in mind or are looking to continuously monitor a subset of a specific table, SELECT queries may be a good fit.
If you’re just looking to export data in bulk, however, there may be an easier way. A handy command-line tool called mysqldump allows you to export entire tables and databases in a format you specify (i.e. delimited text, CSV, or SQL queries that would restore the database if run).
Preparing MariaDB Data for Redshift
Here’s the tricky part: for every table in your MariaDB database, you need a matching table in Redshift to receive the data. Thankfully, Redshift’s syntax is based on the syntax of Postgres, another relational database built on the SQL standard. If you run a command like SHOW CREATE TABLE, you will receive syntax that will serve as a good starting point for creating a Redshift table that can receive the data.
That said, however, it’s not a one-to-one match. You’ll need to familiarize yourself with the Redshift CREATE TABLE statement and the implications of, for example, selecting certain sort keys at creation time. Redshift is a very different beast than MariaDB and it’s important that you appreciate the performance implications that your table structure can create.
Inserting MariaDB Data into Redshift
With a table built, it may seem like the easiest way to add your exported data (especially if there isn’t much of it), is to build INSERT statements to add data to your Redshift table row-by-row. If you have any experience with SQL, this will be your gut reaction. But beware! Redshift isn’t optimized for inserting data one row at a time, and if you have any kind of high-volume data being inserted, you would be much better off loading the data into Amazon S3 and then using the COPY command to load it into Redshift.
Keeping Data Up-To-Date
So, now what? You’ve built a script that pulls data from MariaDB and loads it into Redshift, but what happens tomorrow when you have new and updated records in your MariaDB database?
Depending on how you’ve built your script, you may be forced to load your entire database into Redshift again. This might be slow and painful, or even have performance implications on your MariaDB instance.
The key is to build your script in such a way that it can also identify incremental updates to your data. If your MariaDB tables have fields like modified_at or auto-incrementing primary keys, you can build a script that can quickly identify records that are new or changed since your last update (or since the newest record you’ve copied into Redshift). You can set your script up as a cron job or continuous loop to keep pulling down new data as it appears.
Other Data Warehouse Options
Redshift is totally awesome, but sometimes you need to start smaller or optimize for different things. In this case, many people choose to get started with Postgres, which is an open source RDBMS that uses nearly identical SQL syntax to Redshift. If you’re interested in seeing the relevant steps for loading this data into Postgres, check out MariaDB to Postgres
Easier and Faster Alternatives
Here’s the deal: if you have all the skills necessary to go through this process, chances are building and maintaining a script like this isn’t a very high-leverage use of your time.
Thankfully, products like Stitch were built to solve this problem automatically. With just a few clicks, Stitch starts extracting your MariaDB data, building a matching structure in your Amazon Redshift data warehouse, and inserting that data automatically as it changes.