Anonymize Dataset

Populate or anonymize a dataset for development with Faker

Developed by @JasonRikard

Updated: April 14, 2014

Two common situations in a team’s workflow involve having a sample dataset and an anonymized production sized dataset. It’s easy to generate these datasets using Faker. Once you have a dataset, you can specify its location in your Chef-gm environments. It will automatically populate the local databases of your developers when their development VMs are provisioned. Glorious.

Faker

Faker is a PHP library that generates fake data for you. There are versions for other languages too like Ruby. I use it to quickly create scripts that I can incorporate into my flow.

Once you begin using the library, you may run into a few nuances with string formatting and availability so feel free to manipulate the output to fit your column definition. Here is an example detailed in Ruby by thoughtbot.

Boilerplate Example

The best way to get started is by creating a new project and installing the library with composer. Then, you are free to write your script. I created a boilerplate to make this step faster called faker-this.

Setup

$ git clone https://github.com/Jsnrkd/faker-this
$ cd faker-this
$ composer install

Code your logic

$ nano fake.php

Anonymize

$ php fake.php <host> <mysql_user> <password> <dbPrefix>

Distribute

$ mysqldump -u [user] -p[password] --all-databases | gzip > fake_data.sql.gz

Import

$ gunzip < fake_data.sql.gz | mysql -u [user] -p[password]

Keep configured project for later

$ rm -R .git
$ git init
$ git add --all
$ git commit -am 'Initial commit of faker utility.'

Push remotely if desired.

Distributing the Dataset

Configuring Chef-gm