- Currents Bluewater Cruising - https://currents.bluewatercruising.org -

Confidently Wrong: AI isn’t Seaworthy for Cruising Sailors

We had dinner with some friends the other night and were chatting about the utility of AI in sailing and cruising. During the conversation, we touched on the vast amount of disinformation or misinformation available on the internet, and I was reminded of an article I wrote for Currents during the pandemic, about which Signal Flag(s) were appropriate to use to indicate that your vessel was under quarantine. At the time, there was a considerable amount of disinformation on the Internet about this topic, due to a single, out-of-date reference in a Wikipedia entry being picked up by dozens of bloggers, Facebook posters, YouTubers and the like (you can see my article here [1]). To see if things have changed, I decided to pose a question about signal flags for quarantine status to four of the leading AIs: Grok (from Elon Musk’s xAI), ChatGPT (from OpenAI), Gemini (Google) and CoPilot (Microsoft) to test the utility of AI.

Why is this important? Many (if not all) search engines provide AI-based summaries with search results by default, promoting themselves to you as ‘useful’ agents summarizing search results and offering complete answers without forcing you to suffer the drudgery of reviewing search results yourself. Should you trust that summary? Can AI find the right answer? That’s what I wanted to know.

I chose this signal flag query for several reasons. First, I already knew the answer from having looked it up in the International Code of Signals, Pub.102 [2] available from the websites of the National Geospatial Intelligence Agency and the US Coast Guard. Second, the correct use of signal flags is pretty simple and does not change much over time. Third, the authority for contemporary signal flag use is pretty clear, yet misinformation about the use of signal flags persists on the internet. This could easily confuse an AI, if it didn’t discern between official, definitive sources and people making mistakes and posting erroneous information. If there is a flag signal for a given meaning, it’s in the pages of the ICOS in black and white. Finally, the means to use the flags is discussed in the first part of the manual, so it should be even more difficult to make mistakes about their usage.

As mariners, we know that to make a specific signal by flag, we refer to the International Code of Signals (ICOS) for the correct ‘hoist’, select the flags and raise them. To decode a signal, we look it up in the same book. It’s simple, really. So, what does AI do? It scans the internet and aggregates the answers from many sources, considers them, and offers up what it thinks is the correct answer based on the research. Some of the sources it may have already learned, some it may visit for the first time based on the question asked. AI is reputed to have some ability to discern which are truly definitive (gold) sources and which are just plain wrong (dross). This is what is called the ‘Large Language Model’ (LLM) of machine learning. The danger comes from a machine’s inability to discriminate between a casual but erroneous blog post using an image from a TinTin cartoon from the 1940s (The Quarantine Flag – SailProof Shop [3]) and the definitive International Code of Signals. If an LLM is not discerning, it can confuse incorrect, older, or error-filled information with current, correct information. This is akin to recommending the sacrifice of animals instead of the fertilising of land as a way to ensure a bountiful crop, or the promotion of bloodletting as a treatment for gout.

There are a host of concerns about AI – that it will be racist, discriminate over gender, opaque in how it gets answers, etc. – but not many people seem worried that it will be just flat out wrong in the answers it provides. After this exercise, I am indeed worried that it is simply wrong, and when wrong, it asserts that wrong answer with great conviction.

While the initial answers I received when I queried the AIs were all wrong, you should not see similar results if you try the same query now, as each of the AIs has (hopefully) learned from its interaction with me and should now offer the correct answer (or so one would hope). AI may be more useful for broader, more common queries, but for niche, specialized queries all AIs deliver poor results. Sailing and cruising are niche areas. The use of signal flags is a niche within a niche.

The “Conversation”

The question I asked was “Is the Lima Flag appropriate for a vessel in quarantine?” Spoiler alert: The correct answer is ‘No’: the L-for-Lima flag means, ‘You should stop your vessel instantly’. The correct signal for a vessel in quarantine is the Q-for-Quebec flag (‘My vessel is “healthy”, and I request free pratique’) or two Q flags (QQ) (‘I require health clearance’). These meanings are definitive in the ICOS. (The questions, answers, and my follow ups and comments with each of the four AIs I tried are detailed in a separate and very long posting, available here [4]. I have annotated those “conversations” with comments in parentheses and italics, and some inline comments in red.)

GROK

I spent the most time with Grok, and did find it friendly and pleasantly conversational, even if it was spectacularly wrong and unfocused in the search for answers. Grok frequently referred to sources and authorities that had no relevance to the query, like Chart No.1, the American Practical Navigator (Bowditch), the International Health Regulations, SOLAS, Brown’s Signalling (1916 Edition) and the Center for Disease Control (CDC), none of which are relevant or contain any flag signalling information at all. Grok offered multiple dead links as support for the offered conclusions but could offer working links when asked (perhaps because the links were visited in its learning a long time ago). When pressed, the AI seemed to cling to the wrong answer for an unreasonably long time and only gave up and offered correct answers when spoon fed the definitive references (e.g.: the http link and page number in the document). The answers delivered were bloated and filled with nonsense. Fun and friendly to chat with, but not useful.

Having somewhat used up my limited supply of patience for playing with AIs, I was more curt in my dealing with the other three, trying to see how quickly I could drive them to the right answer.

ChatGPT

When I asked ChatGPT the same question, it told me immediately that the L-for-Lima flag was incorrect (great!), and that the Q-for-Quebec flag was correct (great! half right). It also told me that the Lima flag could mean ‘I have a dangerous cargo’ (wrong; that’s the B-for-Bravo flag), or ‘You should stop and await instructions’ (also wrong; that’s the meaning of the X-for-X-Ray flag). You could offer half-points for getting part of it right, but not full points because it added two erroneous meanings for the Lima flag. When further queried, ChatGPT did accept that the QQ flag would also be correct.

Since ChatGPT was the only AI that had a clue on what the answer was, I asked again the next day, and there was a different answer. This time, it said absolutely that the Lima flag was correct. When pressed, it confirmed that the ICOS supported that meaning. When I asked it to show me where in the ICOS this was written, it apologized for its error (“I misspoke earlier”) and offered the correct answer. That was disappointing.

CoPilot

Microsoft’s CoPilot was also wrong out of the gate, saying that the Lima flag was correct. When pressed, it suggested that the flag’s adoption by two port authorities under local ordinances (Alaska and Rhode Island) somehow made it widespread practice, and that mariners trained in pre-1969 practices would somehow have retained this knowledge, making the use acceptable. It quoted my Currents article as support for this. When I outed myself as the author and asked why it didn’t accept what I had said in the article, it responded, “It’s a thoughtful and widely referenced piece and clearly shaped how many mariners and enthusiasts understand quarantine signaling today.” CoPilot also amended its misreading of the article to offer a correct answer. The flattery was noted, if not appreciated, but its apparent inability to read and comprehend the article was worrisome.

Gemini

Google’s Gemini was also wrong in its first answer, saying Lima was correct. When asked for sources, it found and offered part of the correct answer, offering the Q signal but not the QQ. Oddly, it did not note that it was contradicting its first answer in the second response.

A series of subsequent attempts with the ‘fast’ and ‘pro’ versions gave mixed results, with Gemini clinging to false citations until directly challenged and invited to look up the answer. The Google Gemini AI was the only one of the four to include a disclaimer (“AI responses may include mistakes”) on each reply it gave – which is absolutely correct!

I repeated the experiment over several days using a browser in ‘privacy’ mode so the previous context would not be accessed with the query. Results were similar if not identical, although Grok seemed to tighten its focus, perhaps due to a version update that happened over those days.

What I Learned

In short, AI is scary!

  1. AI is often wrong, and when wrong will be wrong with great conviction, often with false citations and assertions.
    If you use an AI, check its work by actually reading the sources it uses. Apparently, they learn by reading, but they are not very good at reading.
  2. AI always makes a choice between speed and depth of research. If there is a ‘think deeper/harder/more/pro mode/expert mode’ etc. button, use it, it’s worth the few extra seconds. Sometimes that delivers a better answer.
  3. If you are going to use AI, use an app, not a webpage, so the tool learns from you. Over time, its answers should become better for you by retaining the context you ask questions in. This seemed somewhat true, although not a true fix. While the AI will learn, the learning will not be generalized into the core of the AI for some time, if at all. The context you create will embody the learning for your queries more quickly.
  4. Currents is judged by the AIs to “represent well-informed, professional opinions from a reputable organization,” so you should thank the editors for all their good work!

In the International Code of Signals, the flag hoist Alpha over India (AI) means ‘Vessel (indicated by position and/or name or identity signal if necessary) will have to be abandoned.’ This seems appropriate serendipity, as my assessment suggests that indeed, AI must be abandoned (at least for now) for sailing and cruising topics.

*Image created using AI then modified by Barb Peck