-
Notifications
You must be signed in to change notification settings - Fork 4
Data Pipeline
You're looking at the docs for the Openbridge Data Pipeline product! The Pipeline product allows non-technical users a simple and automated toolset to deliver, process and store data of all sizes to a private warehouse. Let's dive in.
- What Is A Data Pipeline?
- Loading Files with the Data Pipeline
- Compressed Files
- Encoding Your Files
- How To Deliver Data
- Error Handling
- Connection Considerations
- Reference
- FAQs
A data pipeline is a series of well designed, intuitive, and cohesive components--built on top of he Openbridge platform. If you are familar with loading data to relational data stores you'll find this familiar, but simpler and more automated.
An important concept for data pipelines is understanding how your data is delivered and organized. Typically, you will want to make sure that the relationship between the file(s) being delivered align with how you want to have them stored in the database. Not only does this insure accuracy and consistency, it is a key element to the "hands-off" automation of your data pipeline workflows.
The first step is to log in to your Data Pipeline location using the credentials you were provided and create a directory for each unique set of data you will be loading. The name of each directory is very important as it will be the name of the table in the Redshift database (or other database) where you will access the data. It should therefore be a name that is descriptive enough for the end user to understand its contents. The folder name must also meet certain requirements. These requirements are outlined below...
- Must start with a letter
- Must contain only lowercase letters
- Can contain numbers and underscores
('_')
but no spaces or special characters (e.g.'@'
,'#'
,'%'
)
Valid folder name:
"customer_inquiries_jan_2014"
Invalid folder name:
"Customer inquiries 1/14"
The Openbridge platform will attempt to correct issues it detects with invalid directory naming. For example, it will automatically remove spaces in names. A folder like this:
"Customer crm file"
Would result in a folder called:
"customercrmfile"
Below is an example of a simple directory structure with unique directories for each type of data. First, we have a parent customers
directory. Within customers
, there are four subdirectories for address
, offers
, rewards
and transactions
. The parent and subdirectories reflect a structure that aligns with how data needs to be organized for delivery by external systems.
customers/
├── address/
├── offers/
├── rewards/
└── transactions/
With the structure in place, data can be delivered. However, it is important to understand that there needs to be some organizing principles on what data is delivered and where it is delivered. For Openbridge to properly process and route your data, each pipeline needs to have a definition that maps to the object and/or files being delivered.
The Openbridge Data Pipeline accepts any UTF-8 encoded delimited text format (e.g. comma, tab, pipe, etc.) with column headers (header row) and a valid file type extension (.csv
or .txt
).
File names should meet the 3 criteria outlined below.
-
Descriptive: The file name should include the data source, data description and date/date range associated with the data in each file (e.g.
socialcom_paidsocial_20140801.txt
orsocialcom_paidsocial_20140701_20140801.txt
) -
Unique: Unique files will be stored and accessible for auditing purposes. Files posted with the same name will be overwritten, making auditing impossible (e.g.
socialcom_paidsocial_20140801.txt
,socialcom_paidsocial_20140901.txt
,socialcom_paidsocial_20141001.txt
-
Consistent: File naming patterns should be kept consistent over time to enable automated auditing (e.g.
socialcom_paidsocial_20140801.txt
,socialcom_paidsocial_20140901.txt
,socialcom_paidsocial_20141001.txt
This will facilitate any required auditing/QC of data received. It is highly likely that use of non-standard file naming conventions will cause your import pipeline to fail or result in an outcome different than expected.
Let's walk through an example, based on the customers
directory structure we defined previously, when a collection of csv
and zip
files have been delivered to these directories.
customers/
├── 2016-01-01_customers.csv
├── 2016-01-02_customers.csv
├── 2016-01-03_customers.csv
├── address/
│ ├── address_dma_lookup/
│ ├── address_weather_lookup/
│ ├── 2016-01-01_address.csv
│ └── 2016-01-02_address.csv
├── offers/
│ ├── offers_open/
│ └── offers_targeted/
│ ├── 2016-01-01_offers.zip
│ ├── 2016-01-02_offers.zip
├── rewards/
│ ├── 2016-01-01_rewards.zip
│ ├── 2016-01-02_rewards.zip
│ ├── 2016-01-03_rewards.zip
│ ├── 2016-01-04_rewards.zip
│ └── 2016-01-05_rewards.zip
└── transactions/
├── 2016-01-01_transactions.csv
├── 2016-01-02_transactions.csv
├── 2016-01-03_transactions.csv
├── 2016-01-04_transactions.csv
├── 2016-01-05_transactions.csv
└── 2016-01-06_transactions.csv
In this example, a collection of files has been delivered to the parent directory customers
and the address
, offers
, rewards
and transactions
subsdirectories. In this use case a table is generated for each directory for which a file is delivered. In the above example you will have a customers
, address
, offers_targeted
and transactions
tables. The content of each table will be based on the corresponding .csv
or zip
files placed in each directory.
Please note, that each .csv
within each directory must share the same data structure/layout. For example, all the "*_transactions.csv"
files located in the customers/transactions/
directory share the following structure:
"ID","DATE","CUSTOMERID","PURCHASE","STOREID"
"0122","December 10, 2015","123432100A","8.43","4985"
"0123","December 10, 2015","543421111A","2.61","3212"
When additional "*_transactions.csv"
files are delivered to customers/transactions/
they will be loaded to a table named transactions
that looks like this...
id | date | customerid | purchase | storeid |
---|---|---|---|---|
0122 | December 10, 2015 | 123432100A | 8.43 | 4985 |
0123 | December 10, 2015 | 543421111A | 2.61 | 3212 |
... | ... | ... | ... | ... |
The file column headers are important as they will also be the field names used in the corresponding table. As with directory names, the column headers should be descriptive enough for the end user to understand its contents. The column headers must meet the same syntax requirements. These requirements are outlined below...
- Must start with a letter
- Must contain only lowercase letters
- Can contain numbers and underscores
('_')
but no spaces or special characters (e.g.'@'
,'#'
,'%'
)
Valid column header:
"inquiries_to_purchase_pct"
Invalid column header:
"Inquiries to Purchase %"
If the field names in the batch file do not meet these requirements, the system will automatically perform the following transformations...
- Uppercase letters will be converted to lowercase
- Spaces will be converted to underscores
('_')
- Special characters will be dropped
If the system can not automatically perform this transformation data will fail to load or load in a manner that is inconsistent with expectations.
The first file delivered to each directory acts as a 'blueprint' for the corresponding database table, defining the field layout (number and order of fields), field formats (text, integer, date, decimal) and expected field delimiter (tab, comma, pipe). The successful load of each subsequent file delivered to that directory is dependent on it following the layout established by the initial 'blueprint' file. Let's walk through an example to illustrate...
Let's assume that the initial file for the tranactions
directory looked like the sample file below and *_transactions.csv
files following this format were delivered to the tranactions
directory on a daily basis for most of the year:
"ID","DATE","CUSTOMERID","PURCHASE","STOREID"
"0122","December 10, 2015","123432100A","8.43","4985"
"0123","December 10, 2015","543421111A","2.61","3212"
However, let us say that in November a change was made upstream which changed the layout of the *_transactions.csv
file being delivered . A file named 2016-11-15_transactions.csv
was delivered to customers/transactions/
and the file contained an additional field called LOYALTYID
:
"ID","DATE","CUSTOMERID","PURCHASE","LOYALTYID","STOREID"
"0122","December 10, 2015","123432100A","8.43","A102B","4985"
"0123","December 10, 2015","543421111A","2.61","A904F","3212"
The addition of theLOYALTYID
field creates a mismatch between the old structure and the new. The load process for2016-11-15_transactions.csv
(or any other file like it) to the transactions
table will fail because the underlying structure is different.
If this situation arises, please contact Openbridge Support ([email protected]).
Openbridge supports the delivery and processing of compressed files in zip
, gzip
or tar.gz
formats. However, since we do not know the contents of an archive prior to it arriving and us unpackaging it, custom processing is needed to ensure we are handling the contents in the manner you want us to. This is especially important when a single archive contains a number of distinct files that have different data types and structures. Here are a few different use cases that arise with compressed files.
In this example we have *_rewards.zip
archives that are delivered to customers/rewards/
.
The 2016-01-01_rewards.zip
archive contains three files; 2016-01-01a_rewards.csv
, 2016-01-01b_rewards.csv
and 2016-01-01c_rewards.csv
. The archive structure looks like this:
customers/
├── rewards/
├── 2016-01-01_rewards.zip
├── 2016-01-01a_rewards.csv
├── 2016-01-01b_rewards.csv
├── 2016-01-01c_rewards.csv
The system will unpack 2016-01-01_rewards.zip
according to the contents of the archive. This would result in three directories, one for each file.
customers/
├── rewards/
├── 2016-01-01_rewards.zip
├── 2016-01-01a_rewards/
├── 2016-01-01a_rewards.csv
├── 2016-01-01b_rewards/
├── 2016-01-01b_rewards.csv
├── 2016-01-01c_rewards/
├── 2016-01-01c_rewards.csv
The unpacked directory is based on the name of the file. For example, a filename of 2016-01-01a_rewards.csv
results in a directory named 2016-01-01a_rewards/
In this example all *_rewards.zip
archives share a common schema. This means 2016-01-01a_rewards.csv
, 2016-01-01b_rewards.csv
and 2016-01-01c_rewards.csv
share the same exact structure:
"ID","DATE","CUSTOMERID","PURCHASE","LOYALTYID","STOREID"
"0122","December 10, 2015","123432100A","8.43","A102B","4985"
"0123","December 10, 2015","543421111A","2.61","A904F","3212"
So regardless of the unpacking directory hierarchy, everything in rewards/
share a schema and will route all data to a common rewards
table.
In this example we have more complex archives called *_offers.zip
. These archives are delivered to customers/offers/ofers_targeted
.
The 2016-01-01_offers.zip
archive contains three different files; 2016-01-01_source.csv
, 2016-01-01_campaigns.csv
and 2016-01-01_costs.csv
.
The unpacked archive structure would follow the same pattern as the simple file example. Each file would be unpacked into its own directory with the name of the directory use the name of the file.
customers/
├── offers/
├── offers_open/
└── offers_targeted/
├── 2016-01-01_offers.zip
├── 2016-01-01_source/
├── 2016-01-01_source.csv
├── 2016-01-01_campaigns/
├── 2016-01-01_campaigns.csv
├── 2016-01-01_costs/
├── 2016-01-01_costs.csv
The files 2016-01-01_source.csv
, 2016-01-01_campaigns.csv
and 2016-01-01_costs.csv
have the following structure.
Source
"ID","DATE","CHANNELID", "CAMPAIGNID","PARTNERID"
"0122","December 10, 2015","123432100A","8.43","A102B","4985"
"0123","December 10, 2015","543421111A","2.61","A904F","3212"
Campaigns
"ID","DATE","CAMPAIGNID","RESPONSE","ROI","OFFERID"
"0122","December 10, 2015","123432100A","8.43","A102B","4985"
"0123","December 10, 2015","543421111A","2.61","A904F","3212"
Costs
"ID","DATE","SPEND","BUDGET","ADID","CAMPAIGNID"
"0122","December 10, 2015","123432100A","8.43","A102B","4985"
"0123","December 10, 2015","543421111A","2.61","A904F","3212"
Unlike the earlier rewards
example, these files represent three related, but different classes of data. This means they are not stored into a single offers table like the rewards
example due to these differences.
The table names are user defined. This allows flexibility in how you want to handle this situation. For example, you may want 2016-01-01_source.csv
, 2016-01-01_campaigns.csv
and 2016-01-01_costs.csv
routed to offers_source
, offers_campaigns
and offers_costs
tables.
To properly support delivery of complex compressed files, please contact Openbridge support so the proper handling and routing of compressed data can occur.
Openbridge supports UTF-8 filename character sets. It also supports ASCII, a subset of UTF-8. The system will perform any character set decoding checks as the file is being delivered and will automatically discover the encoding used.
While we may accept the delivery of a file, there is no guarantee that downstream data processing pipelines will work when sending data that is not UTF-8 encoded. A data pipeline will automatically attempt to convert encoding to UTF-8. However, it might not work in all cases. We suggest, if feasible, to send data UTF-8 pre-encoded. Most systems generate UTF-8 encoded data.
Not sure how to check encoding? Here is an example using the command line. To check the encoding of a file called foo.csv
, in your terminal you can type the following;
For Linux you would type:
$ file -i foo.csv
For OS X it is almost the same command:
$ file -I foo.csv
The output should look contain a charset=
which indicates the character encoding for the file. In the case of our foo.csv
the results are as follows:
$ text/plain; charset=us-ascii
Notice the charset=us-ascii
says the file is ASCII. Perfect! The file is ready to be transferred.
Pipeline supports the following transfer methods: SFTP
and SCP
. These are simple methods for transferring files between systems. FTP
is an insecure protocol and not supported. SFTP
is integrated into many graphical tools or can be done directly via the command line
SFTP
is based on the SSH2 protocol, which encodes activity over a secure channel. Unlike FTP, SSH2 only uses a single TCP connection and sends multiple transfers, or "channels", over that single connection.
Openbridge currently supports public key and password authentication for both SFTP
and SCP file transfer protocols. If you need to use public key authentication, please submit a support ticket to [email protected] and we can set that up for you. We do NOT support any shell access. Also, not all SFTP
commands are accepted.
Connection Details:
Host: pipeline.openbridge.io
Port: 2222 or 443
Protocol: SFTP
User: Provided separately
Password: Provided separately
Openbridge no longer supports the use of FTP
. We recognize that there are some systems that can only deliver data via FTP
. For example, many of the Adobe export processes typically occur via FTP
. However, it should be noted that the use of FTP
offers no encryption for connection and data transport. Using the FTP
protocol is regarded to be unsafe. We therefore require the use of SFTP
or SCP
connections to ensure that data is securely transferred.
If you are having connection difficulties, please make sure you can make outbound network connections. For those using SFTP
make sure outbound port:2222
or port:443
is open.
To help protect the integrity of the data sent to Openbridge, we do not allow delivery of certain files. For example, you are now allowed to send a file called app.exe
or my.sql
. This reduces the potential for introducing unwanted or malicious software threats. The following is a sample of the blocked files:
ade|adp|app|ai|asa|ashx|asmx|asp|bas|bat|cdx|cer|cgi|chm|class|cmd|com|config|cpl|crt|csh|dmg|doc|docx|dll|eps|exe|fxp|ftaccess|hlp|hta|htr|htaccess|htw|html|htm|ida|idc|idq|ins|isp|its|jse|ksh|lnk|mad|maf|mag|mam|maq|mar|mas|mat|mau|mav|maw|mda|mdb|mde|mdt|mdw|mdz|msc|msh|msh1|msh1xml|msh2|msh2xml|mshxml|msi|msp|mst|ops|pdf|php|php3|php4|php5|pcd|pif|prf|prg|printer|pst|psd|rar|reg|rem|scf|scr|sct|shb|shs|shtm|shtml|soap|stm|tgz|taz|url|vb|vbe|vbs|ws|wsc|wsf|wsh|xls|xlsx|xvd
If you attempt to send a file that matches a blocked file then the transfer will not be allowed.
Hidden files are not allowed on the server. Hidden files have a dot prefix .
in the file name. For example, .file.txt
or .foo
would be considered hidden files and be blocked.
The following are examples of names using a dot prefix:
.file.txt
.file.csv
.file
..file
If you attempt to send a file that contains a .
prefix the transfer will not be allowed.
In most cases, file transfers occur without incident. However, there are situations where an error may arise. Troubleshooting an error can be a difficult and time consuming endeavor, especially in cases where a system may be sending 1000s of files a day.
Openbridge does not have visibility into the what should be sent from a source system. It only knows what was sent by the source system. For example, if there are 100 files in a source system and only 50 were sent, Openbridge is only aware of the 50 delivered files. We will not know that an additional 50 files were not delivered.
The source system should be tracking what was delivered and what was not delivered. We suggest that a manifest of files is maintained in the source system. This manifest would identify the files to be delivered and their state (success, failure, pending). The manifest procedure allows the source system to recover from certain errors such as failure of network, source/host system or in the transfer process. In the event of an error, the source system would know to attempt a redeliver for any file that had not received a successful "226" code from Openbridge.
Your client will normally be sent a response code of "226" to indicate a successful file transfer. However, there are other status codes that may be present. See below:
- A code 226 is sent if the entire file was successfully received and stored
- A code 331 is sent due to a login error. The name is okay, but you are missing a password.
- A code 425 is sent if no TCP connection was established. This might mean the server is not available or some other network error. Change from PASV to PORT mode, check your firewall settings
- A code 426 is sent if the TCP connection was established but then broken by the client or by network failure. Retry later.
- A request with code 451, 452, or 552 if the server had trouble saving the file to disk. Retry later.
- A 553 code means the requested action not taken because the file name sent is not allowed. Change the file name or delete spaces/special characters in the file name.
If you are stuck with an error condition, reach out to Openbridge Support ([email protected]) for help.
Openbridge uses checksum (or hashes) validation to ensure the integrity of the file transfer. Openbridge uses a 128-bit MD5 checksum, which is represented by a 32-character hexadecimal number. This only applies to files once they are in our possession. We suggest that source systems also calculate MD5 checksum in advance of file delivery to increase the confidence that the integrity of the file to be delivered is met. Depending the tools employed by a source system, the checksum process might be built-in.
Employing file integrity checks will help you cross check that the files delivered form the source system match what was delivered to Openbridge.
While the use of MD5 is the default, Openbridge can support other checksum commands when sending data to us:
- XCRC (requests CRC32 digest/checksum)
- XSHA/XSHA1 (requests SHA1 digest/checksum)
- XSHA256 (requests SHA256 digest/checksum)
- XSHA512 (requests SHA512 digest/checksum)
If you need assistance for any of these these, please contact Openbridge Support.
Openbridge employs a DNSBL (commonly known as a 'Blocklist"). This is a database that is queried in realtime for the purpose of obtaining an opinion on the origin of incoming hosts. The role of a DNSBL is to assess whether a particular IP Address meets acceptance policies of inbound connections. DNSBL is often used by email, web and other network services for determining and rejecting addresses known to be sources of spam, phishing and other unwanted behavior.
More information on DNS blacklists can be found here:(http://en.wikipedia.org/wiki/DNSBL)
Openbridge employs an anti-virus toolkit to scan for viruses, trojans, and other questionable items to prevent them from being uploaded to our system. The process is designed to detect in real-time threats present in any incoming files. This means any file sent to Openbridge will be scanned as it is streamed to us prior to allowing the file upload to be fully written to the filesystem.
Any files uploaded meeting the criteria as a threat will result the transfer being rejected.
Openbridge employs a dynamic "ban" lists that prevents the banned user or host from logging in to the server. This will occur if our system detects 4 incorrect login attempts. The ban will last for approximately 30 minutes at which time you can attempt to login again. If you continue to have difficulties please contact support.
Openbridge sets the maximum number of seconds a connection between the our server and your client can exist after the client has successfully authenticated. This is typically 10 minutes or 600 seconds in length. If you are idle for longer than this time allotment the server will think you are finished and disconnect your client. If this occurs you will simply need to reconnect.
There are plenty of FTP clients to choose from. They are super easy to use and provide a wealth of features to help you manage the process of transferring files. There are also CLI tools or programatic methods of files transfer too.
- WinSCP, aka Windows Secure Copy, is a free, open-source FTP client.
- FireFTP is a Firefox extension that integrates a powerful FTP client directly into our favorite browser.
- Cyberduck is a free, open-source FTP client for Mac OS X
- FileZilla is a free, open-source FTP client for Windows, Mac, and Linux.
- Transmit Mac OS X client packed to the brim with innovative features.
- SmartFTP is Windows FTP (File Transfer Protocol), FTPS, SFTP, WebDAV, S3, Google Drive, OneDrive, SSH, Terminal client.
- CuteFTP sits alongside FileZilla as the best-known names in FTP. Cost is about $40.
If you prefer the command line, then is curl (https://curl.haxx.se/) is an excellent choice. curl is used in command lines or scripts to transfer data. Here are some sample code snippets showing a bash function leveraging curl:
Sending a file via FTP
function ftp_send_file() {
echo "STARTING: FTP testing..."
( curl -T /home/customers.csv ftp://pipeline.openbridge.io:21/customer/ -u user:password )
if [ $? -ne 0 ]; then echo "ERROR: FTP transfer test failed" && exit 1; else echo "success"; fi
}
Send file via SFTP
function sftp_send_file() {
echo "STARTING: SFTP testing...."
( curl -T /home/customers.csv -u user:password sftp://pipeline.openbridge.io:2222/customer/ -k )
if [ $? -ne 0 ]; then echo "ERROR: SFTP transfer test failed" && exit 1; else echo "success"; fi
}
You can embed file transfers into your programs too. FOr example, Paramiko is a great choice for Python users. What is Paramiko? Paramiko is a Python (2.6+, 3.3+) implementation of the SSHv2 protocol, providing both client and server functionality. While it leverages a Python C extension for low level cryptography (Cryptography), Paramiko itself is a pure Python interface around SSH networking concepts.
Here is a demo of a SFTP client written with Paramiko.
If you don’t see data that you are expecting to see in one or more Redshift tables, there are a few steps you can take to diagnose the issue…
-
Verify that the file(s) containing the data in question was successfully posted to the FTP location. Most FTP software includes search functionality that allows you to search the relevant folder for the filename or a portion of the filename (e.g. filename contains ‘foo’)
Resolution: If the file was not posted to the FTP location, attempt to re-post (or re-request) the data and check the Redshift table to see if it has loaded. Depending on the size of the file and other files in the processing queue, this could take anywhere from a few seconds to a several minutes.
-
If the file was successfully posted to the FTP location, download a copy of the file to your desktop and open the file in a text editor (e.g. Notepad) or Excel to check the following issues that could result in a load failure:
-
Does the file contain data?
-
Is the file layout (i.e. columns and column order) the same as its corresponding Redshift table?
-
Is the file delimited properly (e.g. tab or comma-quote) and are the delimiters consistent with the initial load file for the table?
-
Are the values for each field formatted properly (e.g. text, number, date) and consistent with the initial load file for the table?
Resolution: Fix the error(s) in the file, re-post the file to the FTP location with a new name (e.g. if original file was named some_data_20150101.csv
rename new file to something like some_data_20150101_2.csv
) and check the Redshift table to see if it has been successfully loaded.
-
If the file passes the above checks, please submit a support ticket by emailing [email protected] so that a support agent can assist with the diagnosis and resolution of the load error. To facilitate the diagnostic efforts please be sure to include the following in the email:
-
Redshift table name with the missing data
-
Criteria that can be used to identify the missing row(s) in the table (e.g. table ‘foo’ is missing data for the date = xx/xx/xxxx)
-
Filename (and FTP path name if known) for the unloaded data
The initial load file is the first file that is posted to a particular folder. This file is what generates and defines the properties for the respective Redshift table (field names, order and formats). Subsequent files posted to a folder need to have the same specifications as the initial load file to be loaded successfully.
The Openbridge platform creates a unique id (ob_transaction_id
) based on the hash for each row of data processed. If a row of data is subsequently processed with the same hash value, it will automatically be excluded from the load process for the table in question. However, if any field values differ, the hash will also be different and the data will be loaded. Let's look at an example...
Let's say on 12/11/15 you posted a file named 20151211_ad_performance.csv
to a table named ad_performance
:
"ADID","DATE","CLICKS","IMPRESSIONS","CAMPAIGNID"
"0123","December 10, 2015","12","120","A102B"
"4567","December 10, 2015","25","125","A904F"
The records loaded to the ad_performance
table would look like this...
|adid|date|clicks|impressions|campaignid|ob_transaction_id|ob_modified_date| |---|---|---|---|---|---|---|---| |0123|December 10, 2015|12|120|A102B|abcd1234|12/11/15 8:45:20| |4567|December 10, 2015|25|125|A904F|defg5678|12/11/15 8:45:20|
You will see that the sytem fields 'ob_transaction_id' and ob_modified_date
have been added to your table and represent the unique id for that record and the timestamp that the record was loaded to the table.
Then let's say the next day, 12/12/15 that you posted another file named 20151212_ad_performance.csv
that included ad performance data for both 12/10/15 and 12/11/15:
"ADID","DATE","CLICKS","IMPRESSIONS","CAMPAIGNID"
"0123","December 10, 2015","12","120","A102B"
"0123","December 11, 2015","18","100","A102B"
"4567","December 10, 2015","25","125","A904F"
"4567","December 11, 2015","20","180","A904F"
The Openbridge platform will analyze this file and identify the 1st and 3rd records on this file as duplicate records because all field values are the same and exclude these records from the load process into the ad_performance
table. The ad_performance
table will now look like this...
adid | date | clicks | impressions | campaignid | ob_transaction_id | ob_modified_date |
---|---|---|---|---|---|---|
0123 | December 10, 2015 | 12 | 120 | A102B | abcd1234 | 12/11/15 8:45:20 |
0123 | December 11, 2015 | 18 | 100 | A102B | hijk9012 | 12/12/15 10:05:10 |
4567 | December 10, 2015 | 25 | 125 | A904F | defg5678 | 12/11/15 8:45:20 |
4567 | December 11, 2015 | 20 | 180 | A904F | lmno3456 | 12/12/15 10:05:10 |
However, let's say that your data changes slight over time or reflects lifetime values for a particular metric and the file you posted on 12/12/15 looks like this...
"ADID","DATE","CLICKS","IMPRESSIONS","CAMPAIGNID"
"0123","December 10, 2015","12","121","A102B"
"0123","December 11, 2015","18","100","A102B"
"4567","December 10, 2015","25","125","A904F"
"4567","December 11, 2015","20","180","A904F"
In this case, the value in the impressions
field for ADID
on 12/10/15 changed from 120 to 121. As a result, the row of data will no longer have the same hash value as the row from the previous file and it will be loaded to the ad_performance table which will now look like this...
adid | date | clicks | impressions | campaignid | ob_transaction_id | ob_modified_date |
---|---|---|---|---|---|---|
0123 | December 10, 2015 | 12 | 120 | A102B | abcd1234 | 12/11/15 8:45:20 |
0123 | December 10, 2015 | 12 | 121 | A102B | pqrs7890 | 12/12/15 10:05:10 |
0123 | December 11, 2015 | 18 | 100 | A102B | hijk9012 | 12/12/15 10:05:10 |
4567 | December 10, 2015 | 25 | 125 | A904F | defg5678 | 12/11/15 8:45:20 |
4567 | December 11, 2015 | 20 | 180 | A904F | lmno3456 | 12/12/15 10:05:10 |
As you can see, there are now two rows of data for ADID
0123 on 12/10/15 with different values for ob_transaction_id
and ob_modified_date
. Depending on how your reporting queries are structured this could result in overtstating impressions for ADID
0123 on 12/10/15.
If your data changes over time, it is best practice to only load data for a particular date one time. If you are faced with this situation you have a couple of options...
-
Create a view against the
ad_performance table
(or modify your query logic) so that it returns the row of data with theMAX
(orMIN
)ob_modified_date
for eachADID
and date combination. -
Contact Openbridge Support ([email protected]) to create a customized view or you can delete the duplicate data from the
ad_performance
table based onob_modified_date
.