Tricks for coercing Pandas into parquet

Tue 29 May 2018

For coercing pandas date times (stored as numpy datetime):

for col in df.columns[df.dtypes == np.dtype('<M8[ns]')]:
    # apply(lambda x: x.replace(microsecond=0))
    df[col] = df[col].values.astype('datetime64[s]')

For coercing python datetime (here, a, there may be other options with datetime.datetime (I’ve included my failed attempts that may work there as comments)):

# pd.Timestamp(
# df.loc[,'date'].values.astype(np.int64)
# pd.Timestamp([0],unit='s')
# pd.Timestamp([0].timestamp(),unit='s')
# x: x.isoformat())
df['date'] = df.loc[,'date'].apply(lambda x: x.isoformat())

For timedeltas in pandas, timedelta64[ns]:

df["timedelta_days"] = df.timedelta.dt.days

For mixed float and string, encoded as pandas object type and np.nan for nulls (this throws error):

df.loc[~df.col.apply(isfloat),"col"] = np.nan

You need these two, inefficient helpers:

def isint(x):
        return True
        return False
def isfloat(x):
        return True
        return False

When things come in with bytes type, and you get a memoryview error, hit your dataframe with this:

def stringify_df(df: pd.DataFrame):
    for col in df.columns:
        if df[col].dtype == "O":
            if type(df[col].values[0]) == memoryview:
                df.loc[~df[col].isnull(), [col]] = df.loc[~df[col].isnull(),:].apply(lambda x: x[col].tobytes().decode("ascii", "ignore"), axis=1)
                df.loc[~df[col].isnull(), [col]] = df.loc[~df[col].isnull(),:].apply(lambda x: x[col].encode("ascii", "ignore").decode("ascii", "ignore"), axis=1)
    return df