Sam’s AI Blog - An AI toolkit

Introduction

In recent weeks there has been rapid progress in AI, with the widespread availability of powerful Large Language Models such as OpenAI GPT3.5 and GPT4. We have also seen the release of LLMs that we can run at home, such as LLaMA, Point Alpaca, fine-tuned Galactica, OpenAssistant Pythia, and Cerebras GPT.

Software development is being greatly accelerated by tools such a Github Copilot and OpenAI GPT. Copilot is able to complete code as the programmer types, saving a huge amount of time. GPT4 is able to write whole programs to spec in a matter of minutes, with relatively few bugs. We are, I think, at the start of an AI renaissance. Now is the time to have fun and achieve great things, before they’ve all been done by someone else, and before the robots and AI’s completely eclipse us!

I am admittedly more expert at toolkit programming than I am at AI model training, however I think that there is a place for the software-tools programmer in modern AI, and I intend to demonstrate that in this blog post.

The Toolkit Approach

Two very expressive programming languages are the shell, and make. Most people don’t use the shell for programming, and even fewer use make to execute programs. Those people might be missing out!

The shell enables to create pipelines of processes, which can solve problems much more concisely and in a modular fashion.

Make enables to set up relations between objects, then try to build any requested object. This is particularly good for exploratory programming or for systems that might be flaky, as we can correct a part of the program than re-run just the steps that remain rather than starting from scratch.

Both the shell and make are held back by fairly crufty and obscure syntax, but we can handle that. If a program seems better suited to Python, C, or some other language, we can use that language.

Here are two main ideas from the software toolbox approach:

The Unix philosophy is documented by Doug McIlroy in the Bell System Technical Journal from 1978:
- Make each program do one thing well.
- To do a new job, build afresh rather than complicate old programs by adding new “features”.
- Expect the output of every program to become the input to another, as yet unknown, program.
The Practice of Programming, as described by Brian W. Kernighan and Rob Pike:
- Simplicity, which keeps programs short and manageable;
- clarity, which makes sure they are easy to understand, for people as well as machines;
- generality, which means they work well in a broad range of situations and adapt well as new situations arise; and
- automation, which lets the machine do the work for us, freeing us from mundane tasks.

What types of entities do we deal with in the digital world?

Simple Media
- Data: Structured and unstructured data, including tabular, time-series, and geospatial data.
- Text: Written content in various formats and styles.
- Images: Visual content in various formats, such as photos, artwork, and diagrams; including bitmap and vector formats.
- Audio: Sound content, including music, speech, and environmental sounds.
- Video: Moving visual content, ranging from short clips to full-length films and animations.
Complex Media
- Multi-Media: Content that combines multiple formats, such as web pages, presentations, and graphic novels.
- Interactive Media: Content that requires user interaction, such as virtual reality experiences, simulations, data visualizations, and video games.
Active Agents
- Software: Operating systems, applications, and tools used to create, manage, and interact with digital content.
- AI Models: Artificial intelligence systems that generate, analyze, and enhance digital content.
- Humans: Users who create, consume, and interact with digital content.
Infrastructure
- Computers: Devices that store, process, and display digital content.
- Networks: Systems that enable the sharing, distribution, and communication of digital content.

How can we connect these different types of entities?

data is connected to all sorts of media
- data→media
  - presenting and elucidating data in media
  - creating media from data
- media→data
  - measuring and analysing media
  - encoding media in files
text is connected to audio and images
- text→image
  - rendering
  - image generation
- text→audio
  - speech synthesis
  - music synthesis
- image→text
  - OCR
  - classification
  - description
- audio->text
  - speech recognition
  - music analysis
  - classification
video consists of moving images with audio
- images+audio🡘video
- selecting best images
- can also have text
  - subtitles, related to audio
  - content, related to images

multimedia = text + data + image + audio + video
- or some of the above
- hyperlinks
- some interactivity
interactive = multimedia + software + humans
- forms
- apps
- dynamic content
  - random
  - customized
software
- tools
- apps
- web apps
- dynamic web pages
- services / APIs
- free vs proprietary
humans create and use media
- editors, browsers
- spreadsheets, forms, tables
- cameras, displays
- microphones, speakers
- video cameras, displays
models can imitate human intelligence
- can input and output all forms of media
- can perform processing tasks
- general intelligence
- can plan
- can solve problems
- can discover things
- can play games
- can interface with software
- can drive robots, cars, drones
computers store media, run software and models
- can calculate much more rapidly than an human
- includes phones, game consoles, and other devices
- includes servers, cloud services
networks connect computers
- connect humans, models and software via computers
- transport media, software, and models; not humans, computers or networks
- security and privacy become major concerns

It’s useful in a software project to consider what data types and data structures may be involved, before we get into the program code and algorithms.

text🡘text

format conversion
- pandoc
- w3m -dump
- html scraping
manipulation
- splitting / joining
- paginating
- formatting
LLM AI Processing
- summary
- correct
- re-express
- dot points
- analyse / respond / criticise / assess / suggestions / feedback
- expansion
- narrative
- ideas

text🡘image

Text to Image:
- Stable Diffusion &c
  - txt2img: prompts for image generation
  - img2img: image + text -> image
  - embeddings, etc.
  - controlnet
Image to Text:
- Clip Interrogation &c
  - image to prompt
  - image + words to weights
Text to Image: Rendering
- e.g PDF to Image
  - ghostscript
- word art
  - gimp or imagemagick?
  - ???
Image to Text: Optical Character Recognition
- Tesseract
  - correct using LLM
  - need to adjust black and white levels in the input images
  - pre-process to extract columns
- Number plates, signs, book covers and spines, displays
- Handwriting recognition
  - MNIST: Digits and Post Codes
  - ???
  - Pen Entry (easier)

image🡘image

enhancement, processing, and format conversion
- imagemagick, netpbm
- gimp
- super-resolution
- PIL
- OpenCV
AI tools
- segmentation
- deoldify
- colouring
Image to Image
- Stable Diffusion &c
  - img2img: image + text -> image
  - controlnet

text🡘audio

Text to Audio: Speech Synthesis
- coqui-ai TTS
  - needs help with some phonetics
- gtts-cli
  - not open source
Audio to Text: Speech Recognition, Transcription
- whisper
Music Notation to Audio
- midi synthesis
Audio to Music Notation
- music analysis
Sound Effects (under Data?)

image🡘audio ?

Image to Audio
- colors to notes?
- sign language?
Audio to Image
- spectrogram
  - useful as input to CNN models / classifiers
- waveform, more like data

audio🡘audio

enhancement
- noise reduction
- volume normalisation
- equaliser
- stereo / mono
- pitch
- speed
effects
tools
- ffmpeg
- sox
- audacity
split / join
- detect and crop silence / sound

Appendix 1: More Examples of Eeach Entity Type

Text

plain text, preferably utf-8 with UNIX line endings
markdown, for rich text
HTML, for the web
LaTeX, for math notation
program code, preferably Python
PDFs, presentation format
books, e-books, manuals, papers
articles and online content
news
text chat, email, etc
recipes
blogs, home pages
keyboard input
terminal output
text in an editor
forms

Data

tabular / relational data
spreadsheets
time-series data
spectrograms
personal data
population statistics
scientific data
organisational data

Images

various formats: png, jpeg, webp, etc.
photos: portraits, landscapes, nature
artworks: drawings, paintings, manga
scans or photos of documents
screenshots
graphs, charts, diagrams and figures
infographics
maps
camera input

Audio

various formats: flac, wav, ogg, mp3, etc.
speech, including sythnetic
audio tracks from a video
music
microphone input, e.g. from the user
natural or environmental sounds
telephony / voice chat
podcasts and interviews
audio books

Video

various formats: webm, mkv, mp4, etc.
web videos: youtube, etc
short films, movies, TV series
news and documentaries
promotional videos
security footage
video calls / chat / meetings
animation, anime, demos
webinars
camera input

Multimedia

web pages
presentations
graphic novels
social media content
online courses

Interactive Media:

Virtual and augmented reality experiences
Interactive simulations and models (e.g., scientific simulations, architectural models)
Data visualizations and dashboards (e.g., interactive dashboards, heatmaps)
Virtual assistants, chatbots, and voice assistants
Educational games and quizzes
Interactive storytelling and narratives
Editors
flashcard decks
interactive resources
interactive fiction, adventure games
computer games

Software and Applications

Operating systems
Web applications
Mobile applications
Desktop applications

AI apps

AI-generated content and recommendations
Personalization and adaptive learning systems
Conversational interfaces and natural language processing
AI-driven content analysis and summarization
Collaborative filtering and recommender systems

What tools do we have?

Appendix 2: Some Additional Entity Types

Security and Privacy

This category addresses the protection of digital content, user information, and systems from unauthorized access, manipulation, or damage. It also considers user privacy, data confidentiality, and compliance with relevant regulations. Including this category highlights the importance of safeguarding digital assets, personal information, and trust in the digital world.

Firewalls
Cryptosystems
Public and private keys
Passwords
Two-factor authentication (2FA) methods
2FA devices (e.g., security tokens)
Digital wallets
Virtual private networks (VPNs)
Secure Sockets Layer/Transport Layer Security (SSL/TLS) certificates
Intrusion detection/prevention systems (IDS/IPS)
Biometrics (e.g., fingerprint, facial recognition)
End-to-end encryption (E2EE)
Data anonymization techniques
Security policies and protocols
Antivirus and anti-malware software
Access control systems
Secure file storage and sharing (e.g., encrypted cloud storage)
Secure messaging applications
Security audits and assessments
Data breach detection and response tools

3D

3D video
3D game worlds
virtual reality
augmented reality

Other Senses

touch
smell
taste
balance
echo-location

Extra-Sensory

brain waves
internet
radio
GPS
radar, sonar, lidar

Geospatial Data

Geographical Information Systems (GIS) data
Satellite imagery
Geo-referenced data
Maps
Routing

Ephemeral Content

Social media stories (e.g., Instagram Stories, Snapchat Stories)
Live streaming (e.g., Twitch, YouTube Live)
Live event coverage (e.g., sports, concerts)

Metadata

Tags, keywords, and descriptors for various types of content
EXIF data for images
ID3 tags for audio files

Internet of Things (IoT) Data

Smart home devices (e.g., thermostats, lighting systems)
Wearable technology (e.g., smartwatches, fitness trackers)
Industrial sensors and systems