top of page

Memory efficient, faster way to extract JSON data

Updated: Jun 15, 2022

Memory efficient, faster way (<10 min) to extract JSON data

In this notebook I'll show you a solution for handling the JSON event-data. The goal is a fast, memory-efficient way to load and prepare the train (or test) dataframe.

It loads and converts the selected JSON arguments into a dataframe less then 10 minutes.It keeps the memory usage as low as possible (the final train dataframe is ~500 Mb)In [1]:

import pandas as pd
import numpy as np
import json
import csv
import gc

from collections import OrderedDict
from tqdm import tqdm_notebook as tqdm

# dtypes for pd.read_csv
# These are help to reduce the memory usage.
DTYPES_RAW = {
    'event_id': 'object',
    'game_session': 'object',
    'installation_id': 'object',
    'event_count': np.uint16,
    'event_code': np.uint16,
    'game_time': np.uint32,
    'type': 'category',
    'world': 'category',
    'title': 'category',  
}

In [2]:

# Extract these arguments from JSON.
# There is not enough memory to extract everything with this method.
# You should try it whether it can process the private test set too
# with your selected arguments
FIELDS = {
    # Extras from JSON
    # If you add more data, do not forget
    # to add default values below.
    'level': np.uint8,
    'round': np.uint8,
    'correct': np.int8,
    'misses': np.int8,
    
    # Nested object separated by '_'
    # for example: {'coordinates': {'x': 12, 'y': 12}}
    # 'coordinates_x': np.uint16
    # 'coordinates_y': np.uint16
}

DTYPES = OrderedDict( (dt[0], (dt[1], i)) for i, dt in enumerate(FIELDS.items()))

In [3]:

# This only needs if you want to show a TQDM progress bar.
import subprocess

def file_len(fname):
    """Returns the number of lines in a file.
       @see: https://www.kaggle.com/szelee/how-to-import-a-csv-file-of-55-million-rows
    """
    p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE, 
                                              stderr=subprocess.PIPE)
    result, err = p.communicate()
    if p.returncode != 0:
        raise IOError(err)
    return int(result.strip().split()[0])+1

In [4]:

def flatten(dct, res, separator='_'):
    """Flatten a dictionary.
       @see: https://stackoverflow.com/a/34094630/4158850
    """
    queue = [('', dct)]

    while queue:
        prefix, d = queue.pop()
        for k, v in d.items():
            key = prefix + k
            if not isinstance(v, dict):
                if key in FIELDS.keys():
                    res[0][DTYPES[key][1]] = v
            else:
                queue.append((key + separator, v))

    return res

def records_from_json(fh, n_rows, event_ids_to_drop):
    """Yields the records from a file object."""
    rows = csv.reader(fh, delimiter=',')
    skip_header = next(rows)
    
    # define dtype for more memory-efficiency.
    dtype = dict(names=list(FIELDS.keys()), formats=list(FIELDS.values()))
    defrow = np.zeros((1,), dtype=dtype)

    for event_id, game_session, timestamp, event_data, installation_id, event_count, event_code, game_time, title, typ, world in tqdm(rows, total=n_rows):
        
        # It is more memory-efficient if we don't use the the train df's columns yet.
        row = defrow.copy()

        # Default (required because of the copy above) values for the extracted data
        # you can use np.nan too (in this case the dtype should be np.float64)
        row[0][DTYPES['level'][1]] = 0
        row[0][DTYPES['round'][1]] = 0
        row[0][DTYPES['correct'][1]] = -1
        row[0][DTYPES['misses'][1]] = -1

        if event_id not in event_ids_to_drop:
            row = flatten(json.loads(event_data), row)

        yield row[0]

def from_records(path, event_ids_to_drop):
    n_rows = file_len(path)
    with open(path) as fh:
        return pd.DataFrame.from_records(records_from_json(fh, n_rows, event_ids_to_drop))

Extract JSON event data


You can speed up processing significantly by not parsing rows that do not have json data of interest. To determine which event_ids to drop, just read the spec.csv file! Using this trick allows you to parse all relevant data in a small number of minutes.

In [5]:

specs = pd.read_csv('/kaggle/input/data-science-bowl-2019/specs.csv')
specs.args = specs.args.apply(lambda x: json.loads(str(x)))
eventIdsToDrop = []

for _, spec in specs.iterrows():
    j = pd.io.json.json_normalize(spec.args)
    vals = j.loc[(j.name.isin(FIELDS.keys()))].name.values

    if len(vals) == 0:
        eventIdsToDrop += [spec.event_id]

set(eventIdsToDrop)
print(len(eventIdsToDrop))
149

Train

In [6]:

extras_df = from_records('/kaggle/input/data-science-bowl-2019/train.csv', eventIdsToDrop)

In [7]:

train_df = pd.read_csv('/kaggle/input/data-science-bowl-2019/train.csv', parse_dates=['timestamp'], dtype=DTYPES_RAW, usecols=['timestamp'] + list(DTYPES_RAW.keys()))

In [8]:

train_df = train_df.merge(extras_df, left_index=True, right_index=True)
train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11341042 entries, 0 to 11341041
Data columns (total 14 columns):
event_id           object
game_session       object
timestamp          datetime64[ns, UTC]
installation_id    object
event_count        uint16
event_code         uint16
game_time          uint32
title              category
type               category
world              category
level              uint8
round              uint8
correct            int8
misses             int8
dtypes: category(3), datetime64[ns, UTC](1), int8(2), object(3), uint16(2), uint32(1), uint8(2)
memory usage: 508.3+ MB

In [9]:

train_df.to_csv('train_extras.csv', index=False)

In [10]:

del extras_df
del train_df
gc.collect()

%reset -f Out
Flushing output cache (0 entries)

Test

In [11]:

extras_df = from_records('/kaggle/input/data-science-bowl-2019/test.csv', eventIdsToDrop)

In [12]:

test_df = pd.read_csv('/kaggle/input/data-science-bowl-2019/test.csv', parse_dates=['timestamp'], dtype=DTYPES_RAW, usecols=['timestamp'] + list(DTYPES_RAW.keys()))
test_df = test_df.merge(extras_df, left_index=True, right_index=True)

test_df.to_csv('test_extras.csv', index=False)

In [13]:

del extras_df
del test_df
gc.collect()

%reset -f Out
Flushing output cache (0 entries)

Thanks for reading



ree

Comments


©2019 by  NGiannakoulis 

bottom of page