An AI toolkit

Published

March 29, 2023

Introduction

In recent weeks there has been rapid progress in AI, with the widespread availability of powerful Large Language Models such as OpenAI GPT3.5 and GPT4. We have also seen the release of LLMs that we can run at home, such as LLaMA, Point Alpaca, fine-tuned Galactica, OpenAssistant Pythia, and Cerebras GPT.

Software development is being greatly accelerated by tools such a Github Copilot and OpenAI GPT. Copilot is able to complete code as the programmer types, saving a huge amount of time. GPT4 is able to write whole programs to spec in a matter of minutes, with relatively few bugs. We are, I think, at the start of an AI renaissance. Now is the time to have fun and achieve great things, before they’ve all been done by someone else, and before the robots and AI’s completely eclipse us!

I am admittedly more expert at toolkit programming than I am at AI model training, however I think that there is a place for the software-tools programmer in modern AI, and I intend to demonstrate that in this blog post.

The Toolkit Approach

Two very expressive programming languages are the shell, and make. Most people don’t use the shell for programming, and even fewer use make to execute programs. Those people might be missing out!

The shell enables to create pipelines of processes, which can solve problems much more concisely and in a modular fashion.

Make enables to set up relations between objects, then try to build any requested object. This is particularly good for exploratory programming or for systems that might be flaky, as we can correct a part of the program than re-run just the steps that remain rather than starting from scratch.

Both the shell and make are held back by fairly crufty and obscure syntax, but we can handle that. If a program seems better suited to Python, C, or some other language, we can use that language.

Here are two main ideas from the software toolbox approach:

  1. The Unix philosophy is documented by Doug McIlroy in the Bell System Technical Journal from 1978:
    • Make each program do one thing well.
    • To do a new job, build afresh rather than complicate old programs by adding new “features”.
    • Expect the output of every program to become the input to another, as yet unknown, program.
  2. The Practice of Programming, as described by Brian W. Kernighan and Rob Pike:
    • Simplicity, which keeps programs short and manageable;
    • clarity, which makes sure they are easy to understand, for people as well as machines;
    • generality, which means they work well in a broad range of situations and adapt well as new situations arise; and
    • automation, which lets the machine do the work for us, freeing us from mundane tasks.

What types of entities do we deal with in the digital world?

  • Simple Media
    • Data: Structured and unstructured data, including tabular, time-series, and geospatial data.
    • Text: Written content in various formats and styles.
    • Images: Visual content in various formats, such as photos, artwork, and diagrams; including bitmap and vector formats.
    • Audio: Sound content, including music, speech, and environmental sounds.
    • Video: Moving visual content, ranging from short clips to full-length films and animations.
  • Complex Media
    • Multi-Media: Content that combines multiple formats, such as web pages, presentations, and graphic novels.
    • Interactive Media: Content that requires user interaction, such as virtual reality experiences, simulations, data visualizations, and video games.
  • Active Agents
    • Software: Operating systems, applications, and tools used to create, manage, and interact with digital content.
    • AI Models: Artificial intelligence systems that generate, analyze, and enhance digital content.
    • Humans: Users who create, consume, and interact with digital content.
  • Infrastructure
    • Computers: Devices that store, process, and display digital content.
    • Networks: Systems that enable the sharing, distribution, and communication of digital content.

How can we connect these different types of entities?

G text text data data text–data image image text–image audio audio text–audio data–image data–audio video video data–video image–video audio–video
  • data is connected to all sorts of media
    • data→media
      • presenting and elucidating data in media
      • creating media from data
    • media→data
      • measuring and analysing media
      • encoding media in files
  • text is connected to audio and images
    • text→image
      • rendering
      • image generation
    • text→audio
      • speech synthesis
      • music synthesis
    • image→text
      • OCR
      • classification
      • description
    • audio->text
      • speech recognition
      • music analysis
      • classification
  • video consists of moving images with audio
    • images+audio🡘video
    • selecting best images
    • can also have text
      • subtitles, related to audio
      • content, related to images
G text text multimedia multimedia text–multimedia interactive interactive multimedia–interactive image image image–multimedia audio audio audio–multimedia video video video–multimedia data data data–multimedia software software software–text software–interactive humans humans humans–interactive models models models–interactive models–software models–humans computers computers computers–software networks networks networks–multimedia networks–computers
  • multimedia = text + data + image + audio + video
    • or some of the above
    • hyperlinks
    • some interactivity
  • interactive = multimedia + software + humans
    • forms
    • apps
    • dynamic content
      • random
      • customized
  • software
    • tools
    • apps
    • web apps
    • dynamic web pages
    • services / APIs
    • free vs proprietary
  • humans create and use media
    • editors, browsers
    • spreadsheets, forms, tables
    • cameras, displays
    • microphones, speakers
    • video cameras, displays
  • models can imitate human intelligence
    • can input and output all forms of media
    • can perform processing tasks
    • general intelligence
    • can plan
    • can solve problems
    • can discover things
    • can play games
    • can interface with software
    • can drive robots, cars, drones
  • computers store media, run software and models
    • can calculate much more rapidly than an human
    • includes phones, game consoles, and other devices
    • includes servers, cloud services
  • networks connect computers
    • connect humans, models and software via computers
    • transport media, software, and models; not humans, computers or networks
    • security and privacy become major concerns

It’s useful in a software project to consider what data types and data structures may be involved, before we get into the program code and algorithms.

text🡘text

  • format conversion
    • pandoc
    • w3m -dump
    • html scraping
  • manipulation
    • splitting / joining
    • paginating
    • formatting
  • LLM AI Processing
    • summary
    • correct
    • re-express
    • dot points
    • analyse / respond / criticise / assess / suggestions / feedback
    • expansion
    • narrative
    • ideas

text🡘image

  • Text to Image:
    • Stable Diffusion &c
      • txt2img: prompts for image generation
      • img2img: image + text -> image
      • embeddings, etc.
      • controlnet
  • Image to Text:
    • Clip Interrogation &c
      • image to prompt
      • image + words to weights
  • Text to Image: Rendering
    • e.g PDF to Image
      • ghostscript
    • word art
      • gimp or imagemagick?
      • ???
  • Image to Text: Optical Character Recognition
    • Tesseract
      • correct using LLM
      • need to adjust black and white levels in the input images
      • pre-process to extract columns
    • Number plates, signs, book covers and spines, displays
    • Handwriting recognition
      • MNIST: Digits and Post Codes
      • ???
      • Pen Entry (easier)

image🡘image

  • enhancement, processing, and format conversion
    • imagemagick, netpbm
    • gimp
    • super-resolution
    • PIL
    • OpenCV
  • AI tools
    • segmentation
    • deoldify
    • colouring
  • Image to Image
    • Stable Diffusion &c
      • img2img: image + text -> image
      • controlnet

text🡘audio

  • Text to Audio: Speech Synthesis
    • coqui-ai TTS
      • needs help with some phonetics
    • gtts-cli
      • not open source
  • Audio to Text: Speech Recognition, Transcription
    • whisper
  • Music Notation to Audio
    • midi synthesis
  • Audio to Music Notation
    • music analysis
  • Sound Effects (under Data?)

image🡘audio ?

  • Image to Audio
    • colors to notes?
    • sign language?
  • Audio to Image
    • spectrogram
      • useful as input to CNN models / classifiers
    • waveform, more like data

audio🡘audio

  • enhancement
    • noise reduction
    • volume normalisation
    • equaliser
    • stereo / mono
    • pitch
    • speed
  • effects
  • tools
    • ffmpeg
    • sox
    • audacity
  • split / join
    • detect and crop silence / sound

Appendix 1: More Examples of Eeach Entity Type

Text

  • plain text, preferably utf-8 with UNIX line endings
  • markdown, for rich text
  • HTML, for the web
  • LaTeX, for math notation
  • program code, preferably Python
  • PDFs, presentation format
  • books, e-books, manuals, papers
  • articles and online content
  • news
  • text chat, email, etc
  • recipes
  • blogs, home pages
  • keyboard input
  • terminal output
  • text in an editor
  • forms

Data

  • tabular / relational data
  • spreadsheets
  • time-series data
  • spectrograms
  • personal data
  • population statistics
  • scientific data
  • organisational data

Images

  • various formats: png, jpeg, webp, etc.
  • photos: portraits, landscapes, nature
  • artworks: drawings, paintings, manga
  • scans or photos of documents
  • screenshots
  • graphs, charts, diagrams and figures
  • infographics
  • maps
  • camera input

Audio

  • various formats: flac, wav, ogg, mp3, etc.
  • speech, including sythnetic
  • audio tracks from a video
  • music
  • microphone input, e.g. from the user
  • natural or environmental sounds
  • telephony / voice chat
  • podcasts and interviews
  • audio books

Video

  • various formats: webm, mkv, mp4, etc.
  • web videos: youtube, etc
  • short films, movies, TV series
  • news and documentaries
  • promotional videos
  • security footage
  • video calls / chat / meetings
  • animation, anime, demos
  • webinars
  • camera input

Multimedia

  • web pages
  • presentations
  • graphic novels
  • social media content
  • online courses

Interactive Media:

  • Virtual and augmented reality experiences
  • Interactive simulations and models (e.g., scientific simulations, architectural models)
  • Data visualizations and dashboards (e.g., interactive dashboards, heatmaps)
  • Virtual assistants, chatbots, and voice assistants
  • Educational games and quizzes
  • Interactive storytelling and narratives
  • Editors
  • flashcard decks
  • interactive resources
  • interactive fiction, adventure games
  • computer games

Software and Applications

  • Operating systems
  • Web applications
  • Mobile applications
  • Desktop applications

AI apps

  • AI-generated content and recommendations
  • Personalization and adaptive learning systems
  • Conversational interfaces and natural language processing
  • AI-driven content analysis and summarization
  • Collaborative filtering and recommender systems

What tools do we have?

Appendix 2: Some Additional Entity Types

Security and Privacy

This category addresses the protection of digital content, user information, and systems from unauthorized access, manipulation, or damage. It also considers user privacy, data confidentiality, and compliance with relevant regulations. Including this category highlights the importance of safeguarding digital assets, personal information, and trust in the digital world.

  • Firewalls
  • Cryptosystems
  • Public and private keys
  • Passwords
  • Two-factor authentication (2FA) methods
  • 2FA devices (e.g., security tokens)
  • Digital wallets
  • Virtual private networks (VPNs)
  • Secure Sockets Layer/Transport Layer Security (SSL/TLS) certificates
  • Intrusion detection/prevention systems (IDS/IPS)
  • Biometrics (e.g., fingerprint, facial recognition)
  • End-to-end encryption (E2EE)
  • Data anonymization techniques
  • Security policies and protocols
  • Antivirus and anti-malware software
  • Access control systems
  • Secure file storage and sharing (e.g., encrypted cloud storage)
  • Secure messaging applications
  • Security audits and assessments
  • Data breach detection and response tools

3D

  • 3D video
  • 3D game worlds
  • virtual reality
  • augmented reality

Other Senses

  • touch
  • smell
  • taste
  • balance
  • echo-location

Extra-Sensory

  • brain waves
  • internet
  • radio
  • GPS
  • radar, sonar, lidar

Geospatial Data

  • Geographical Information Systems (GIS) data
  • Satellite imagery
  • Geo-referenced data
  • Maps
  • Routing

Ephemeral Content

  • Social media stories (e.g., Instagram Stories, Snapchat Stories)
  • Live streaming (e.g., Twitch, YouTube Live)
  • Live event coverage (e.g., sports, concerts)

Metadata

  • Tags, keywords, and descriptors for various types of content
  • EXIF data for images
  • ID3 tags for audio files

Internet of Things (IoT) Data

  • Smart home devices (e.g., thermostats, lighting systems)
  • Wearable technology (e.g., smartwatches, fitness trackers)
  • Industrial sensors and systems