4 min read

Using AI for Data Analysis: A Deep Dive into Google's Gemini 1.5 Pro

Discover how Google's Gemini 1.5 Pro tackles real-world data challenges like duplicate customer records. See practical examples of AI-powered data analysis and learn when it shines and where human expertise still matters.
Using AI for Data Analysis: A Deep Dive into Google's Gemini 1.5 Pro

Google recently announced the world's longest context window in AI. A context window this long is useful, because we now can make sense of a large documents. Can AI then perform data analysis on a large dataset and derive data insights, like human does? In this blog, we will take a look at Google's latest Gemini Model 1.5 Pro and see how well it performs data analysis.

Say Goodbye to Duplicate Data Headaches

Let's start with a very common scenario - duplicate data! It clogs up databases, skews analysis, and makes it difficult to get a clear picture of your customers.
Wouldn't it be good if we can tell AI to tackle this age-old problem?
Ok, here we go. In Gemini AI Studio, let's start with a system instruction to give our AI cousin a persona, like this:

You're an assistant designed to identify potential duplicates in large dataset.
Users will paste a csv text consisting of multiple customer records and you'll respond with potential duplicate groupings. 
Explain how you arrive at the groupings.

Temperature: 0.8 (A value less than 1 is preferred, as we don't want to AI to be too creative in data analysis task!)

Specify the output we want to see

Our AI cousin is like us, they want to know what is the report format we want to see. We could either specify them in writing, or provide them with
some sample outputs (in AI, this is called few shots learning), which honestly is the simpler thing to do:

User prompt

CUST_ID First Name Middle Name Last Name DOB TFN House No Street City State Country Credit Card Number
599 Charles Jessica Wade 12/09/2004 903166839 64483 George Dale Lake Thomasview Queensland Australia 4718492351287510
599 Charles Jessica Wade 12/09/2004 903166832 4627 Casey Club Port Marcusfort New South Wales Australia 4718492351287510
599 Charles Jessica Wade 12/09/2004 903166839 64483 George Dale Lake Thomasview Queensland Australia 4718492351287510
1 Amy Bryan Fitzgerald 26/09/1973 567042755 53315 Dylan Streets Lake Tinamouth Western Australia Australia 6538822699865230
2 Erika Jesse Gibbs 2/09/1974 771803753 931 Sullivan Camp Jessemouth New South Wales Australia 5575683130284680
3 Steven Ashley Green 23/11/1984 236952690 7620 Shepherd Tunnel Jasonfort Tasmania Australia 6011406732918180
4 Amanda Tiffany Nelson 19/10/1952 547399051 705 Terri Cape Alvaradoport Victoria Australia 2575718427801270
5 Wendy Carl Hall 7/07/1952 889917142 572 Ryan Bridge East Lisaport Tasmania Australia 213130649985643
6 Patricia Linda Thompson 27/06/1976 380810740 8768 Christopher Groves New Jamesview Queensland Australia 3569631294279980
7 Linda William Fletcher 11/08/1991 176447589 854 Tracy Coves Masonport Victoria Australia 4752175679472640
8 Kelsey Jill Baker 19/09/2004 646188273 24799 Jones Alley Gibsonville Queensland Australia 4919192731457300
9 Matthew Amanda Pennington 9/12/1970 463387631 725 Black Cliff New Melissa South Australia Australia 676191496600
10 James Gina Thompson 20/08/1971 697372205 5784 Sanders Harbors Port Jaclyn Victoria Australia 3522324210776590
11 Albert Emily Lopez 3/02/1993 290025911 308 Tiffany Fields South Suefurt Victoria Australia 566677122822
12 Gina Christina White 18/07/1972 772607628 124 Farmer Ways Rubenchester New South Wales Australia 346309889995633
13 Rebecca Timothy Chapman 1/07/1974 584767009 57 David Prairie Kathleenhaven Victoria Australia 38230725252018

Model response

GrpID CUST_ID First Name Middle Name Last Name DOB TFN House No Street City State Country Credit Card Number
1 599 Charles Jessica Wade 12/09/2004 903166839 64483 George Dale Lake Thomasview Queensland Australia 4718492351287510
1 599 Charles Jessica Wade 12/09/2004 903166832 4627 Casey Club Port Marcusfort New South Wales Australia 4718492351287510
1 599 Charles Jessica Wade 12/09/2004 903166839 64483 George Dale Lake Thomasview Queensland Australia 4718492351287510

Let the magic happens

For this test, I have a python script to generate fake customer data and also performs some error functions to micmic real-life data entry errors.
These data entry errors are appended with the parent record id, to check our AI's results.

import pandas as pd
from faker import Faker
import random
from datetime import date, timedelta

fake = Faker()

# --- Functions for data generation ---
def random_dob(start_year=1950, end_year=2005):
    year = random.randint(start_year, end_year)
    month = random.randint(1, 12)
    day = random.randint(1, 28)
    return pd.to_datetime(f"{year}-{month:02}-{day:02}")

def generate_address(state_probs):
    state = random.choices(list(state_probs.keys()), weights=list(state_probs.values()))[0]
    return {
        'House No': fake.building_number(),
        'Street': fake.street_name(),
        'City': fake.city(),
        'State': state,
        'Country': 'Australia'
    }

def generate_credit_card():
    return fake.credit_card_number(card_type=None)

def generate_tfn():
    return fake.random_number(digits=9, fix_len=True)

def introduce_tfn_error(tfn):
        tfn = str(tfn) # Convert to string 
        position = random.randint(0, 8)
        new_digit = str(random.randint(0, 9))
        return int(tfn[:position] + new_digit + tfn[position + 1:]) # Convert back to int

# --- Simulation settings ---
start_year = 2013
end_year = 2023
initial_num_customers = 500
state_probabilities = {
    'Victoria': 0.3,
    'New South Wales': 0.25,
    'Queensland': 0.2,
    'South Australia': 0.1,
    'Western Australia': 0.1,
    'Tasmania': 0.05
}

# --- Data storage ---
customers = []
next_cust_id = 1

# --- Generate initial customer data ---
for _ in range(initial_num_customers):
    customer = {
        'CUST_ID': next_cust_id,
        'First Name': fake.first_name(),
        'Middle Name': fake.first_name(),
        'Last Name': fake.last_name(),
        'DOB': random_dob(),
        'TFN': generate_tfn(),  # Introduce potential TFN error
        **generate_address(state_probabilities),
        'Credit Card Number': generate_credit_card(),
        'PARENT_CUST_ID': 0
    }
    customers.append(customer)
    next_cust_id += 1

# --- Simulate changes over time ---
for year in range(start_year + 1, end_year + 1):
    for i in range(len(customers)):

        # 2% chance of moving
        if random.random() < 0.02:
            new_customer = customers[i].copy()
            OLD_ID = new_customer['CUST_ID']
            new_customer.update(generate_address(state_probabilities))
            new_customer['CUST_ID'] = next_cust_id
            new_customer['PARENT_CUST_ID'] = OLD_ID
            customers.append(new_customer)
            next_cust_id += 1

        # 1% chance of marriage (and last name change)
        if random.random() < 0.01:
            new_customer = customers[i].copy()
            OLD_ID = new_customer['CUST_ID']
            new_customer['Last Name'] = fake.last_name()
            new_customer['CUST_ID'] = next_cust_id
            new_customer['PARENT_CUST_ID'] = OLD_ID
            customers.append(new_customer)
            next_cust_id += 1

        # 2% chance of TFN errors
        if random.random() < 0.02: 
            new_customer = customers[i].copy()
            OLD_ID = new_customer['CUST_ID']
            new_tfn = introduce_tfn_error(new_customer['TFN'])
            new_customer['TFN'] = new_tfn
            new_customer['CUST_ID'] = next_cust_id
            new_customer['PARENT_CUST_ID'] = OLD_ID
            customers.append(new_customer)
            next_cust_id += 1

# --- Create DataFrame ---
df = pd.DataFrame(customers)
print(df.head())
df.to_csv('crm_data_with_tfn_errors.csv', index=False)

Below is the screenshots taken from AI Studio - all outputs are generated by the AI:
ai_studio_output.png

Pretty impressive stuff, if you asked me.

Summary

AI did not even break a sweat while churning out the analysis above. However, when I subjected it to further complex duplicate matching logic (example 2 way match, with blank TFN or DOB), it was unable to provide the correct analysis. So, not to worry my friend, AI is not taking over our job anytime soon.

  1. Google AI Studio Quick Start
  2. Google Gemini 1.5 Pro announcement