gap intelligence has been gathering Great Freakin’ Data (GFD) for its clients for over a decade so it should not be surprising that the database that warehouses this information, GFDB, is vast in size and stores million and millions of records. As software engineers, when we work on building new features for our applications we need to be able to work with a relevant dataset on our local development environments. In the past, we’ve been able to copy the data from GFDB onto our machine, but as GFDB has increased exponentially in size, this is no longer a reasonable option as it consumes too much space.
To solve this issue, I used a recent Dev Day to create a shell script that automates creating a partial copy of our data on GFDB. In my initial research for writing this script I found that while there were plenty of resources for copying whole databases or tables, there were few for creating partial or limited copies. I hope that sharing my approach will be helpful to another developer in the future:
Copy the database structure
Create a function to copy your database schema. GFDB uses postgreSQL so I leveraged postgreSQL pg_dump command with the –schema-only option: (See the docs for more information options about pg_dump options)
Ex)
BACKUP_SCHEMA() {
pg_dump -U database_username -h database_host --schema-only database_name
| psql local_database
}
Copy full Tables (if needed)
Create a function to copy any tables in full if needed. You can also use pg_dump for this with the -t (table) and -a (data only) options to copy all the data from specified tables.
Ex)
COPY_FULL_TABLES() {
pg_dump -U database_username -h database_host -a -t table_1_name -t table_2_name
database_name | psql local_database
}
Copy the partial datasets
Determine which tables you need to copy partially and write a function that will 1) query the table for the dataset you want and 2) output the results of the query into CSV file store in a temporary folder.
Ex)
COPY_PARTIAL_TABLE_A() {
psql -h database_host -d database_name -U database_username
-c "\\COPY (SELECT * FROM table_name ORDER BY updated_at DESC LIMIT 2000)
TO 'path_to_temp_folder/table_name.csv' WITH (DELIMITER ',', FORMAT CSV)"
}
Import the dataset
Create a function that will import the csv file into your local database.
Ex)
IMPORT_PARTIAL_TABLE_A() {
psql -h localhost -d local_db_name -U local_db_user
-c "\\COPY table_name FROM 'path_to_temp_folder/table_name.csv' CSV"
}
Repeat the copy and import steps for as many tables as you need
Execute these functions in your script.
Call the functions above at the end of your script file. The function that copies your database structure should be executed first.
Ex)
***partial_db_copy_file.sh***
BACKUP_SCHEMA
COPY_FULL_TABLES
COPY_PARTIAL_TABLE_A
IMPORT_PARTIAL_TABLE_A
COPY_PARTIAL_TABLE_B
IMPORT_PARTIAL_TABLE_B
COPY_PARTIAL_TABLE_C
IMPORT_PARTIAL_TABLE_C
etc . .
Additional tip
If you’re using postgreSQL and your database requires credentials for a connection, it helps to create a .pgpass file in your home directory. This will make it so that you don’t have to enter your credentials for every function that requires a database connection. The file should follow the format below:
hostname:port:database:username:password
The final version of my Dev Day script to make a partial copy of GFDB was more complicated, with several copy and import functions, longer queries, and additional functions to create materialized view tables. I’ve outlined and simplified the process above in the hope that it will the task more straightforward for another developer in the future.
For more than 17 years, gap intelligence has served manufacturers and sellers by providing world-class services monitoring, reporting, and analyzing the 4Ps: prices, promotions, placements, and products. Email us at info@gapintelligence.com or call us at 619-574-1100 to learn more.