Submitted by thermalben on Mon, 05/29/2017 - 07:27
Last Monday, after four years of non-stop, unfaltering service, our wave model inexplicably stopped producing data.
Before I detail what went wrong (and how we fixed it), I need to describe our wave model system as it’s a little complex to those who don’t dabble in the technical side of meteorology and oceanography.
Swellnet runs an in-house version of NOAA’s popular WaveWatch III model. Almost every single other surf forecasting website in the world runs the exact same wave model, however there are several points of difference between each site.
Firstly, we run the wave model in-house, which means we implement the source code on our own servers, and have control over some of its parameters - like how often it is run, and how fast it takes to produce data. Every time our WaveWatch III model is run, it needs to retrieve input data from NOAA - which is essentially the 10m wind forecast from the GFS atmospheric model, for the next two weeks - and then ingest this into its system.
However, some websites choose to use NOAA’s native output from its public servers. This is not necessarily less accurate than running the model yourself (and it is a lot less costly, because you don’t have to pay for the servers, nor require the technical skills to run the model), but it doesn’t allow for any level of customisation that we need as a complex, global surf forecasting website.
WaveWatch III - in conjunction with a complex graphics software system - then produces the colourful WAMS you see on our website. It also produces the “Swell Train Analysis” data, a time series forecast of each swell train that is modelled to reach every one of our two-thousand-odd surf forecast locations.
However, this does not produce any information that necessarily corresponds to surf height.
Some surf forecast websites output the raw WaveWatch III data - for example, “2.4m @ 9.6 seconds”, but what does that really mean? Over time, an experienced user may become familiar with how this information correlates to surf height at their local break, but as a surf forecasting website, we needed an intuitive, graphical way to represent surf conditions between each surf region.
So, many years ago, we built “Wave App”, a proprietary surf model that takes the raw output from WaveWatch III and then converts the data into useful surf height approximations. Wave App is seperate from WaveWatch III and runs on its own individual servers - yet another cog in the machine.
The algorithms underpinning Wave App are very complex, and we’re very pleased with the results - it’s not 100% perfect however in most instances, any inaccuracies in its predicted surf height is often related to poor quality source information from WaveWatch III, rather than our Wave App. We’re pretty confident that it’s the best surf forecasting tool for anywhere in the world.
The output from Wave App - a time series “surf forecast” height on our forecast graphs, and the Swell Train Analysis - is displayed on our website. This runs on yet another series of complex web servers, which delivers the final product to you, when you look at our website or Apps.
So! What went wrong last week?
Well, most of our infrastructure resides in the Amazon cloud. On Saturday May 20th, we received an automated notification from Amazon that they had “detected degradation of the underlying hardware” of one of our servers, and as a result, this server “will be stopped” on Monday May 22nd.
Unfortunately, being a weekend (and a very small business), we were unable to act upon this notification in a timely manner. By the time we had a chance to investigate - Monday morning - the server had been shut down. Yes, we received only two days notice - all of it over a weekend.
But, we were initially confused at first.
You may remember that our surfcams also went offline last Monday morning for an hour or so. This was actually a coincidental but otherwise unrelated network outage - but at the time a red herring for me - as I then presumed the weekend’s server notification from Amazon was related to the surfcam downtime.
So when we fixed the surfcams, I presumed the problem was resolved.
I was wrong.
It wasn’t until late Monday night when we realised the wave model hadn’t updated all day. So we undertook a few routine checks, thinking it was a technical glitch in the system, which happens from time to time, and made a note to have another look on Tuesday.
I was busy helping to install our new Narrowneck surfcam all day Tuesday, so it wasn't until Tuesday evening when we worked out what the problem was - and realised that we had to rebuild this server. The clock was ticking, because the wave model wasn’t running any more.
But in order to rebuild the server, we had to touch base with the developers who initially set up the wave model four years ago. They were located overseas - which made it difficult to communicate in a timely manner. And in the meantime we couldn’t get access to own server, because they had the special access key that was used to set it up in the first place. An oversight from all of those years ago, but a scenario we didn’t envisage at the time.
So, after a few days of very slow communication with our developers, we finally had another version of the Wave Model up and running late Friday.
But it wasn’t working properly - something was tripping it up, and we couldn’t work it out. One of the main issues was that the wave model takes a couple of hours to run, so every time we thought we’d figured out the problem, we’d kick off the model again, but have to wait a few hours to see if we had fixed things.
This means we essentially only had a couple of attempts per day in trying to fix the wave model - in and around our otherwise busy schedules.
Finally, late Sunday we managed to get the model back up and running, and as you’ll see now we have returned to a complete 16-day forecast output.
Anyway, thanks for your patience... we have now put in measures to stop this from happening again in the future.
Also, there's just one small problem for the next few days.. with the model being down since last Monday, it’s starting from an uninitialised sea state.
Every time the wave model runs, it uses the previous model run’s data as baseline conditions - which means it has an active ocean state to work with, so the wind forecast that drives the model forecast picks up where it left off last time.
But the model doesn’t know that right now. We’ve just pressed the green button on a planet that is completely flat (as the model assumes, anyway).
This means the short term forecast for any pre-existing swells (i.e. already in the water) will be wrong - as it doesn’t know they exist. But any new swells generated from here on will be fine.
These images show it clearly - peak period has almost nothing across the South Pacific.
Whilst sig wave height and surface wind are almost identical (because, sig wav height is essentially 100% windswell, generated by surface winds - no previously underlying groundswell).
Thanks for the update...interesting and enlightening on how you are running the forecasts. I suppose, it would be interesting to see how accurate the models are over time and if not what tends to them tip off the prediction.
I've got next week off and it's looking pretty flat after Sunday.
Can you fix the models to get me some swell? :)
Ha! If only it worked that way...
Wow. I compare this to my server migration this week and the my hassles were minor.
Of course look at it the other way - it could always be much worse - like grounding a global airline and pissing off hundreds of thousands of customers because cutting corners on backup to the main server http://www.smh.com.au/world/british-airways-cancels-flights-from-london-...
Thanks for the update Ben, appreciate the transparency!
No wonder you guys are so accurate, great insight, thanks
Lots of surfers have been calling me for a surf forecast because they are devout swellnetters and cannot read any other forecast sites .
For those who are unaware about what happened with swellnet web crash its a catastrophe as they wait for days thinking the forecast is still broken .
Good to see people lose the ability to adapt and check a different website for alternative information.
In the future any major hacks of surf websites could be used to score good waves . Im still awaiting for a surfsite to be hacked and false information spread in the aim of beating the crowds.
Better than a fake shark fin .
I can just imagine the news headlines: "Wikileaks releases true forecast of 2.8m @ 16 seconds"
Interesting that stunet & craig were onto this quickly and dispatched to score a run of waves .
Meanwhile half the population read SN forecast and it said "Flat" so they didn't go surfing !
that's technology for you... keep up the good work guys, we all appreciate the hard work you are putting in...
Smart Web Solutions