> import geopandas as gpd
> import pandas as pd
> from shapely.geometry import Point
> d = pd.read_csv('data/tracks/2024_01_01.csv')
> d.shape
(3690166, 4)
> list(d)
['user_id', 'timestamp', 'lat', 'lon']
> %%timeit -n 1
> d.to_csv('/tmp/test.csv')
14.9 s ± 1.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
> d2 = gpd.GeoDataFrame(d.drop(['lon', 'lat'], axis=1), geometry=gpd.GeoSeries([Point(*i) for i in d[['lon', 'lat']].values]), crs=4326)
> d2.shape, list(d2)
((3690166, 3), ['user_id', 'timestamp', 'geometry'])
> %%timeit -n 1
> d2.to_file('/tmp/test.gpkg')
4min 32s ± 7.5 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
> %%timeit -n 1
> d.to_csv('/tmp/test.csv.gz')
37.4 s ± 291 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> ls -lah /tmp/test*
-rw-rw-r-- 1 culebron culebron 228M мар 26 21:10 /tmp/test.csv
-rw-rw-r-- 1 culebron culebron 63M мар 26 22:03 /tmp/test.csv.gz
-rw-r--r-- 1 culebron culebron 423M мар 26 21:58 /tmp/test.gpkg
CSV saved in 15s, GPKG in 272s. 18x slowdown.I guess your dataset is countries borders, isn't it? Something that 1) has few records and makes a small r-tree, and 2) contains linestrings/polygons that can be densified, similar to Google Polyline algorithm.
But a lot of geospatial data is just sets of points. For instance: housing per entire country (couple of million points). Address database (IIRC 20+M points). Or GPS logs of multiple users, received from logging database, ordered by time, not assembled in tracks -- several million per day.
For such datasets, use CSV, don't abuse indexed formats. (Unless you store it for a long time and actually use the index for spatial search, multiple times.)