StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POInformix to Postgres, continuous data replication algorithm
text
Body
copied!<p>The master server is Informix, version varies from 9.40 to the latest, database is <strong>unlogged</strong> by design that can't be changed. Slave server is the latest PostgreSQL. Master and slave are separate machines, network latency is unpredictable. Master schema is statically defined, well known and does not change, so it's only the data that needs to be replicated. In the master, there are three types of tables:</p> <ol> <li>Numeric data tables, usually one date column, one time column and 15-300 int columns keyed by 2-3 primary keys. The data is never changed, only added once in a set interval (15, 30, or 60 minutes) and deleted when the retention point is reached. Replication data set can be up to 80,000 rows but usually is in the range of hundreds. This data needs to be replicated one way, master to slave. There is about 30 tables of this type and they need to be replicated all at once and as fast as possible, typically in under one minute after new interval set has been committed to the master.</li> <li>Mixed data tables, with date, time, int, and string types, 30-100 columns, again 2-3 primary keys. This data is also never changed, added continuously and is deleted when the retention point is reached. The data set is up to 100,000 rows per hour. One way replication is needed, master to slave. There are a few tables like that, less than 5 usually.</li> <li>Mixed data tables, with int and string types, less than 10 columns, 2-3 primary keys. The data largely stays intact, with occasional additions, edits or deletions. The usual replication set size is unpredictable, but probably will be in low hundreds of rows. This data needs to be replicated both ways, as fast as possible. There are a few tables of this type, and they need to be synched independently.</li> </ol> <p>I've been looking for an existing tool that could do what I need, but it looks like there is none that is open source. I'm probably going to write one for my needs, and I'm looking for advice from DB gurus on how to approach this task.</p> <p>In my estimate, there's probably no single algorithm that would cover all the use cases so I may be in fact looking for two or three algorithms. Here's what I found so far:</p> <ol> <li>Fire trigger on master changes, record row OIDs (does Informix have them?) to temp table, dump the changed rows to a file, transfer it and load up. Question: how to buffer the trigger? The master DB is unlogged (no transactions), so trigger will fire upon each INSERT. Additional strain on the master, not good.</li> <li>Add a cron job on the slave that will pull latest date/time keys from the master, and if the data is newer, pull it. Problem: although the update interval is defined, in reality it's based on the <em>data source</em> clock (not master DB clock) which is guaranteed to vary from slave server clock. More of it, there can be several data sources, each with varying clocks, and the data needs to be replicated ASAP. The only way here that I see is to constantly poll the master from the slave, hoping that by the time the poll comes in, the data is <em>all</em> committed (no transactions, remember?). Kludgy, slow, not good.</li> <li>Add Informix as foreign data wrapper in the Postgres and run queries directly instead of bothering with replication. Pros: simplicity. Cons: Informix connector seems to be in alpha stage, and the whole approach is an unknown factor at best.</li> </ol> <p>I've been researching this topic for some time, and it seems that the core of the problem is the lack of transactions on the master side. If the master DB was logged, it would be much easier to replicate it, but without transactions the task suddenly becomes much more complicated. For one, how do I ensure that there are no dupes? Another one, how to avoid update loops in type 3 tables? Considering all that, how to make replication as fast-reacting as possible? I mean the delay between data update and sync start here, data transfer is another topic altogether.</p> <p>Any input is appreciated.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload