PySpark timezone offset from ISO 8601 without UDF
Contents
Background
During my work at Stone Three upgrading some of our data pipelines with Azure DataBricks, I was quite disappointed to learn that Spark 3.5.3’s only timezone-capable timestamp type always stores timestamps converted to UTC, and always displays these timestamps converted to the session-global timezone.
Admittedly, the situation is equally bad with PostgreSQL’s TIMESTAMPTZ
type, leading to designs where the actual timezone offset must be stored as a separate timezone offset column.
Having a timezone-aware timestamp type that could store the input timezone natively, and which would always display with its full timezone information, would have been much more useful. In our case, we not only need to know the exact timepoint that one of our vision sensors generated a measurement, but we also need to know what the exact local time was for that specific measurement.
To work around this frankly surprising deficiency in Spark, I demonstrate below how to calculate the timezone offset of an input ISO8601-formatted timestamp using only native Spark functions. The idea would be to store that offset along with the timestamp of the measurement in question.
Solution
The solution makes use of the relatively new TIMESTAMP_NTZ
type that was introduced in Spark 3.4.
By converting the input timestamp into timezone-unaware timestamps in the source timezone and in UTC and then subtracting the corresponding seconds since epoch, we can calculate the offset in signed minutes.
With this formulation, you should get much better performance than any UDF-based solution.
|
|
(Yes, UTC-03:30 is used in Newfoundland and Labrador, Canada.)
Conclusion
Spark people, please give me a timestamp type that can store its timezone natively and which always displays with its full timezone information!
Anyone else, please let me know if there’s a better way to calculate this offset.