Skip to content

Pivot seems to not respect lazy evaluation #163

@alberto-i

Description

@alberto-i

Hello, is this the expected behavior?

I'm running the code below, using a composition of groupBy, select and inflate and comparing it to a pivot call, both returning the same result. The first call runs in 0.235 ms while the pivot one runs in 146.8 ms, a 62,000% slower. A call to "toArray" takes 51.27 ms with the groupBy and 34.456 ms using pivot. 48 % faster.

Dataset is a 1.5 Mbytes file containing 27k rows.

const dataForge = require('data-forge');
require('data-forge-fs');

let start = process.hrtime();

const elapsed_time = function(note) {
    const precision = 3; // 3 decimal places
    const elapsed = process.hrtime(start)[1] / 1000000; // divide by a million to get nano to milli
    console.log(process.hrtime(start)[0] + " s, " + elapsed.toFixed(precision) + " ms - " + note); // print message + time
    start = process.hrtime(); // reset the timer
}

const df = dataForge
    .readFileSync('./data.csv')
    .parseCSV({ dynamicTyping: true })
    .withIndex((row) => `${row.meeting_id}_${row.item_id}_${row.user_id}_${row.source_id}`)

elapsed_time('parsecsv')

const sintetico = df
    .groupBy((row) => `${row.meeting_id}_${row.item_id}_${row.vote}`)
    .select((group) => ({
        meeting_id: group.first().meeting_id,
        item_id: group.first().item_id,
        vote: group.first().vote,
        stock: group.deflate(row => row.stock).sum(),
    }))
    .inflate()

elapsed_time('groupBy, select, inflate')

const sinteticoPivot = df.pivot(['meeting_id', 'item_id', 'vote'], {
    stock: dataForge.Series.sum
})

elapsed_time('pivot')

const data = sintetico.head(5).toArray()

elapsed_time('groupBy, select, inflate => toArray')

const data2 = sintetico.head(5).toArray()

elapsed_time('groupBy, select, inflate => toArray again')

const data3 = sinteticoPivot.head(5).toArray()

elapsed_time('pivot => toArray')

const data4 = sinteticoPivot.head(5).toArray()

elapsed_time('pivot => toArray again')

These are the outputs:

0 s, 183.236 ms - parsecsv
0 s, 0.235 ms - groupBy, select, inflate
0 s, 146.789 ms - pivot
0 s, 51.270 ms - groupBy, select, inflate => toArray
0 s, 1.200 ms - groupBy, select, inflate => toArray again
0 s, 34.456 ms - pivot => toArray
0 s, 13.261 ms - pivot => toArray again

Is this intended? Should I dig deeper to fix it and make a pull request?

Thanks,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions