Background

During my work at Stone Three upgrading some of our data pipelines with Azure DataBricks, I was quite disappointed to learn that Spark 3.5.3’s only timezone-capable timestamp type always stores timestamps converted to UTC, and always displays these timestamps converted to the session-global timezone.

Admittedly, the situation is equally bad with PostgreSQL’s TIMESTAMPTZ type, leading to designs where the actual timezone offset must be stored as a separate timezone offset column.

Having a timezone-aware timestamp type that could store the input timezone natively, and which would always display with its full timezone information, would have been much more useful. In our case, we not only need to know the exact timepoint that one of our vision sensors generated a measurement, but we also need to know what the exact local time was for that specific measurement.

To work around this frankly surprising deficiency in Spark, I demonstrate below how to calculate the timezone offset of an input ISO8601-formatted timestamp using only native Spark functions. The idea would be to store that offset along with the timestamp of the measurement in question.

Solution

The solution makes use of the relatively new TIMESTAMP_NTZ type that was introduced in Spark 3.4.

By converting the input timestamp into timezone-unaware timestamps in the source timezone and in UTC and then subtracting the corresponding seconds since epoch, we can calculate the offset in signed minutes.

With this formulation, you should get much better performance than any UDF-based solution.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
s = '2024-10-15T02:44:07.490000-0330'

# LHS and RHS variables showed separately here for illustration
# get ntz timestamp in -03:30 TZ
timestamp_ntz_in_source_tz = spark.sql(f"select to_timestamp_ntz('{s}')").take(1)[0][0]
display(timestamp_ntz_in_source_tz)
# get ntz timestamp for the same timestamp, but in UTC TZ
timestamp_ntz_in_utc = spark.sql(f"select to_timestamp_ntz(to_utc_timestamp('{s}', current_timezone()))").take(1)[0][0]
display(timestamp_ntz_in_utc)
# subtract seconds since epoch, convert to minutes 
# i.e. (unix_timestamp(timestamp_ntz_in_source) - unix_timestamp(timestamp_ntz_in_utc)) / 60
# this should give you -210 minutes = -03h:30m
spark.sql(f"select (unix_timestamp(to_timestamp_ntz('{s}')) - unix_timestamp(to_timestamp_ntz(to_utc_timestamp('{s}', current_timezone())))) / 60").take(1)[0][0]

(Yes, UTC-03:30 is used in Newfoundland and Labrador, Canada.)

Conclusion

Spark people, please give me a timestamp type that can store its timezone natively and which always displays with its full timezone information!

Anyone else, please let me know if there’s a better way to calculate this offset.