Python

London Python meetup May 2024

The meetup was held at Microsoft’s Reactor offices near Paddington which have a great view down the canal towards Maida Vale. Attendees got an email with a QR code to get in through the gate which all felt very high-tech.

The first talk was not particularly Python related but was an introduction to vector databases. These are having a hot moment due to the way that machine learning categorisation maps easily into flat vectors that can then be stored and compared through vector stores.

Then can then be used to complement LLMs through the Retrieval Augmented Generation (RAG) which combines the LLM’s ability to synthesis and summarise content with more conventional search index information.

It was fine as it went and helped demystify the way that RAG works but probably this langchain tutorial is just as helpful as to the practical application.

The second talk was about langchain but was from a Microsoft employee who was demonstrating how to use Bing as a agent augmentation in the Azure hosted environment. It was practical but the agent clearly spun out of control in the demo and while the output was in the right ballpark I think it illustrated the trickiness of getting these things to work reliably and to generate reliable output when the whole process is essentially random and different each run.

It was a good shop window into the hosted langchain offering but could have done more to explore the agent definition.

The final talk was by Nathan Matthews CTO of Retrace Software. Retrace allows you to capture replay logs from production and then reproduce issues in other environments. Sadly there wasn’t a demo but it is due to be released as open source soon. The talk went through some of the approaches that had been taken to get to the release. Apparently there is a “goldilocks zone” for data capture that avoids excessive log size and performance overhead. This occurs at the library interface level with a proxy capture system for C integration (and presumably all native integration). Not only is lower level capture chatty but capturing events at a higher-level of abstraction makes the replay process more robust and easier to interact with.

The idea is that you can take the replay of an issue or event in production, replay it on a controlled environment with a debugger attached to try and find out the cause of the issue without ever having to go onto a production environment. Data masking for sensitive data is promised which then means that the replay logs can have different data handling rules applied to them.

Nathan pointed out that our currently way of dealing with unusual and intermittent events in production is invest heavily in observability (which often just means shipping a lot of low-level logging to a search system). The replay approach seems to promise a much simpler approach for analysing and understand unusual behaviour in environments with access controls.

It was interesting to hear about poking into the internals of the interpreter (and the OS) as it is not often that people get a chance to do it. However the issue of what level of developer access to production is the bigger problem to solve and it would be great to see some evidence of how this works in a real environment.

Work

January 2024 month notes

Water CSS

I started giving this minimal element template a go after years of using various versions of Bootstrap. It is substantially lighter in terms of the components it offers with probably the navigation bar being the one component that I definitely miss. The basic forms and typography are proving fine for prototyping basic applications though.

Node test runner

Node now has a default test runner and testing framework. I’ve been eager to give it a go as I’ve heard that it is both fast and lightweight, avoiding the need to select and include libraries for testing, mocking and assertions. I got the chance to introduce it in a project that didn’t have any tests and I thought it was pretty good although it’s default text output felt a little unusual and the alternative dot notation might be a bit more familiar.

It’s interesting to see that the basic unit of testing is the assertion, something is shares with Go. It also doesn’t support parameterised tests which again is like Go which has a pattern of table-driven tests implemented with for loops except that Go allows more control of the dynamic test case naming.

I’d previously moved to the Ava library and I’m not sure there is a good reason not to use the built-in alternative.

Flask blueprints

In my personal projects I’ve tended to use quite a few cut and paste modules and over the years they tend to drift and get out of sync so I’ve been making a conscious effort to learn about and start adopting Flask Blueprints. Ultimately I want to try and turn these into personal module dependencies that I can update once and use in all the projects. For the moment though it is interesting how the blueprints format is pushing me to do some things like logging better (to understand what is happening in the blueprint) and also structuring the different areas of the application so that they are quite close to Django apps with various pieces of functionality now starting to be associated with a url prefix that makes it a bit easier to create middleware that is registered as part of the Blueprint rather than relying on imports and decorators.

Web components

I’ve been making a bit of progress with learning about web components. I realised that I was trying to do too much initially which is why they were proving complicated. Breaking things down a bit has helped with an initial focus on event listeners within the component. I’m also not bringing in external libraries at the moment but have got as far as breaking things up into [ESM modules](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Modules) which has mostly worked out so far.

Work

October 2023 month notes

I’ve been learning more about Postgres as I have been moving things from Dataset to Psycopg3. It is kind of ridiculous the kind of things you can do with it when strip away the homogenising translation layer of things like ORMs. Return a set of columns from your update? No problem. Upsert? Straight-forward.

However after completing a CONFLICT clause I received a message that no conflict was possible on the columns I was checking and I discovered that I had failed to add a Primary Key to the table when I created it. It probably didn’t matter to the performance of the table as it was a link table with indexes on each lookup column but I loved the way that the query parsing was able to do that level of checking on my structure.

Interestingly I had a conflict clause in my previous ORM statement I was replacing and it had never had an issue so presumably it was doing an update then insert pattern in a transaction rather than using native features. For me this shows how native solutions are often better than emulation.

Most of the apps I’ve converted to direct use of queries are feeling more responsive now (including the one I use to draft these posts) but I’m not 100% certain whether this is because of switch to lower-level SQL or because I’ve been fixing the problems in the underlying relational model that were previously being hidden from me.

We’re going to need a faster skateboard

I have been thinking a lot about the Gold-plated Donkey Cart this month. When you challenge problems with solutions you often first have a struggle to get people to admit that there is a problem and even if it is admitted then often the first response is to try and patch or amend the existing solution than consider whether the right response might be.

We have additive minds so this tendency to patch what is existing is natural but sometimes people aggressively defend the status quo, even when it is counter-productive to their overall success.

Weakly typed

I’ve had some interesting experiences with Typescript this month, most notably an issue with a duplicated package which resulted in code that has been running in production for months but which has either not been correctly typed or has been behind the intended version by maybe four major versions. Typescript is interesting amongst type-hinted languages in that it has typing files that are often supplied separately from the code itself and in some cases which exist independently of the code itself. My previous experience of Python typing for example stopped the checker at the boundaries of third-parties and therefore only applied to the code you are writing yourself.

I’m uncertain of the value of providing type files for Javascript libraries as the compile-time and runtime contexts seem totally different. I found a Javascript dependency that had a completely broken unit test file and on trying to correct it I found that it couldn’t have the behaviour that the tests were trying to verify. Again I wondered about how this code was working in production and predictably it turned out that the executed code path never included the incorrectly specified behaviour. Dynamic code can be very resilient and at the same time a time bomb waiting to happen no matter what your

I think Typescript code would be better off if it was clearer that any guarantees of correctness can only be provided for the code you have totally under your control and which is being compiled and checked by you.

Frozen in time

I’ve been thinking a lot as well about a line from this talk by Killian Valkhof where he mentions that our knowledge on how to do things often gets frozen based on how we initially learnt to do things. For developers who learnt React for frontend will be the future people who learnt to do frontend via jQuery. I’ve been looking at Web Components which I thought were pretty terrible when they first came out but now look delightfully free of complex build chains and component models.

But more fundamentally it has made me think about when I choose or reject things am I doing so based on their inherent qualities in the present moment or based on the moment in time when I first learnt and exercised those skills. For CSS for example I’m relatively old-fashioned and I have never been a fan of the CSS-in-JS idea. However I think this approach, while maybe being outside contemporary preferences, is sound. Sound CSS applies across any number of frontend component models and frameworks and the work that goes into the CSS standards is excellent where as (ironically) the limitations of Javascript frameworks to express CSS concepts means that often it is a frozen subset that is usable.

I’ve never been entirely comfortable with Docker or Kubernates though and generally prefer PaaS or “serverless” solutions. Is that because I enjoyed the Heroku developer experience and never really understood the advantages of containerisation as a result.

Technology is fashion and therefore discernment is a critical quality for developers. For most developers though it is not judgement that they manifest but a toxic self-belief in the truth of whatever milieu they entered into the industry in. As I slog through my third decade in the profession doubt is something that I feel strongly about my opinions and trying to frame my judgements in the evidence and reasoning available now seems a valuable technique.

Programming, Python

Transcribing podcasts with Google’s Speech to Text API

I don’t really listen to podcasts, even now when I have quite a long commute. I generally read faster than I can listen and prefer to read through transcripts than listen, even when the playback speed is increased. Some shows have transcripts and generally I skim read those when available to see if it would be worth listening to segments of the podcasts. But what about the podcasts without transcripts? Well Google has a handy Speech to Text API so why not turn the audio into a text file and then turn it into a HTML format I can read on the phone on the tube?

tldr; the API is pretty much the same one as generates the Youtube automatic subtitling and transcripts. It can just about create something that is understandable as a human but its translation of vernacular voices is awful. If Youtube transcripts don’t work for you then this isn’t a route worth pursuing.

Streaming pods

I’m not very familiar with Google Cloud Services, I used to do a lot of App Engine development but that way of working was phased out in favour of something a bit more enterprise friendly. I have the feeling that Google Cloud’s biggest consumers are data science and analysis teams and the control systems intersect with Google Workspace which probably makes administration easier in organisations but less so for individual developers.

So I set up a new project, enabled billing, associated the billing account with a service account, associated the service account with the project and wished I’d read the documentation to know what I should have been doing. And after all that I created a bucket to hold my target files in.

You can use the API to transcribe local audio files but only if they are less than 60 seconds long. I needed to be using the long running asynchronous invocation version of the API. I also should have realised that I need to write the transcription to a bucket too, I ended up using the input file name with “.json” attached but until I started doing that I didn’t realise that my transcription was failing to recognise my input.

Learning the API

One really nice feature Google Cloud has is the ability to run guided tutorials in your account via CloudShell. You get a step by step guide that can simply paste the relevant commands to your shell. Authorising the shell to access the various services was also easier than generating credentials locally for what I wanted to do.

Within 10 minutes I had processed my first piece of audio and had a basic Python file setup. However the test file was in quite an unusual format and the example was the synchronous version of the API.

I downloaded a copy of the Gettysburg address and switched the API version but then had my CloudShell script await the outcome of the transcoding.

Can you transcribe MP3?

The documentation said yes (given a specific version) and while the client code accepted the encoding type, I never got MP3 to work and instead I ended up using ffmpeg to create FLAC copies of my MP3 files. I might have been doing something wrong but I’m not clear what it was and the job was accepted but it was returning an empty JSON object (this is where creating files for the output is much more useful that trying to print an empty response).

FLAC worked fine and the transcript seemed pretty on the money and converting the files didn’t seem that much of a big deal. I could maybe do an automatic conversion later when the file hit the bucket if I needed to.

However after my initial small files I found that waiting for the result of the API call resulted in hitting a timeout on the execution duration within the shell. I’ve hit something like this before when running scripts over Google Drive that copied directories. I didn’t have a smart solution then (I just skipped files that already existed and re-run the jobs a lot) and I didn’t have one now.

Despite the interactive session timing out the job completed fine and the file appeared in the storage bucket. Presumably this would have been where it would have been easier to be running the script locally or on some kind of temporary VM. Or perhaps I should have been able to get the run identifier and just have checked the job using that. The whole asynchronous execution of jobs in Google Cloud is another area where what you are meant to do is unclear to me and working on this problem didn’t require me to resolve my confusion.

Real audio is bobbins

So armed with a script that had successfully rendered the Gettysburg address I switched the language code to British English, converted my first podcast file to FLAC and set the conversion running.

The output is pretty hilarious and while you can follow what was probably being said it feels like reading a phonetic version of Elizabethan English. I hadn’t listened to this particular episode (because I really don’t listen to podcasts, even when I’m experimenting on them) but I did know that the presenters are excessively Northern and therefore when I read the text “we talk Bob” I realised that it probably meant “we are talking bobbins”. Other gems: “threw” had been rendered as “flu” and “loathsome” as “lord some”. Phonentically if you know the accent you can get the sense of what was being talked about and the more mundane the speech the better the transcription was. However it was in no way an easy read.

I realised that I was probably overly ambitious going from a US thespian performing a classic of political speechwriting to colloquial Northern and London voices. So next I chose a US episode, more or less the first thing I could get an MP3 download of (loads of the shows are actually shared on services that don’t allow you access to the raw material).

This was even worse because I lacked the cultural context but even if I had, I have no idea how to interpret “what I’m doing ceiling is yucky okay so are energy low-energy hi”.

The US transcript was even worse than the British one, partly I think because the show I had chosen seems to have the presenters talking over one another or speaking back and forth very rapidly. One of them also seems to repeat himself when losing his chain of thought or wanting to emphasise something.

My next thought was to try and find a NPR style podcast with a single professional presenter but at this point I was losing interest. The technology was driving what content I was considering rather than bringing the content I wanted to engage with to a different medium.

You Tube audio

If you’ve ever switched on automatic captioning in Youtube then you’ve actually seen this API in action, the text and timestamps in the JSON output are pretty much the same as what you see in both the text transcript and the in-video captioning. My experience is that the captioning is handy in conjunction with the audio but if I was fully deaf I’m not sure I would understand much about what was going on in the video from the auto-generated captions.

Similarly here, the more you understand the podcast you want to transcribe the more legible the transcription is. For producing a readable text that would reasonably represent the content of the podcasts at a skim reading level the technology doesn’t work yet. The unnatural construction of the text means you have to quite actively read it and put together the meaning yourself.

I had a follow-up idea of using speech to text and then automated translation to be able to read podcasts in other languages but that is obviously a non-starter as the native language context is vital for understanding the transcript.

Overall then a noble failure; given certain kinds of content you can actually create pretty good text transcriptions but as a way of keeping tabs on informal, casual audio material, particularly with multiple participants this doesn’t work.

Costs

I managed to blow through a whole £7 for this experiment which actually seemed like a lot for two podcasts of less than an hour and a seven minute piece of audio. In absolute terms though it is less than proverbial avocado on toast.

Future exploration

Meeting transcription technology is meant to be pretty effective including identifying multiple participants. I haven’t personally used any and most of the services I looked at seemed aimed at business and enterprise use and didn’t seem very pay as you go. These however might be a more viable path as there is clearly a level of specialisation that is needed on top of the off-the-shelf solutions to get workable text.

August 2023 month notes

I have been doing a GraphQL course that is driven by email. I can definitely see the joy of having autocompletion on the types and fields of the API interface. GraphQL seems to have been deployed way beyond its initial use case and it will be interesting to see if its a golden hammer or genuinely works better than REST-based services outside the abstraction to frontend service. It is definitely a complete pain in the ass compared to HTTP/JSON for hobby projects as having to ship a query executor and client is just way too much effort compared to REST and more again against maybe not doing a Javascript app interface.

I quite enjoyed the course, and would recommend it, but it mostly covered creating queries so I’ll probably need to implement my own service to understand how to bind data to the query language. I will also admit that while it is meant to be quite easy to do each day I ended up falling behind and then going through half of it on the weekend.

Hashicorp’s decision to change the license on Terraform has caused a lot of anguish on my social feeds. The OpenTerraform group has already announced that they will be creating a fork and are also promising to have more maintainers than Hashicorp. To some extent the whole controversy seems like a parade of bastards and it is hard to choose anyone as being in the right but it makes most sense to use the most open execution of the platform (see also Docker and Podman).

In the past I’ve used CloudFormation and Terraform, if I was just using AWS I would probably be feeling smug with the security of my vendor lock-in but Terraform’s extensibility via its provider mechanisms meant you could control a lot of services via the same configuration language. My current work uses it inconsistently which is probably the worst of all worlds but for the most part it is the standard for configuring services and does have some automation around it’s application. Probably the biggest advantage of Terraform was to people switching clouds (like myself) as you don’t have to learn a completely new configuration process, just the differences with the provider and the format of the stanzas.

The discussion of the change made we wonder if I should look at Pulumi again as one of the least attractive things about Terraform is its bizarre status as not quite a programming language, not quite Go and not quite a declarative configuration. I also found out about Digger which is attempting to avoid having two CI infrastructures for infrastructure changes. I’ve only ever seen Atlantis used for this so I’m curious to find out more (although it is such an enterprise level thing I’m not sure I’ll do much than have an opinion for a while).

I also spent some time this month moving my hobby projects from Dataset to using basic Pyscopg. I’ve generally loved using Dataset as it hides away the details of persistence in favour of passing dictionaries around. However it is a layer over SQLAlchemy which is itself going through some major point revisions so the library in its current form is stuck with older versions of both the data interaction layer and the driver itself. I had noticed that for one of my projects queries were running quite slowly and comparing the query time direct into the database compared to that arriving through the interface it was notable that some queries were taking seconds rather than microseconds.

The new version of Psycopg comes with a reasonably elegant set of query primitives that work via context managers and also allows results to be returned in a dictionary format that is very easy to combine with NamedTuples which makes it quite easy to keep my repository code consistent with the existing application code while completely revamping the persistence layer. Currently I have replaced a lot of the inserts and selects but the partial updates are proving a bit trickier as dataset is a bit magical in the way it builds up the update code. I think my best option would be to try and create an SQL builder library or adapt something like PyPika which I’ve used in another of my projects.

One of the things that has surprised me in this effort is how much the official Python documentation does not appear in Google search results. Tutorial style content farms have started to dominate the first page of search results and you have to add a search term like “documentation” to surface it now. People have been complaining about Google’s losing battle with content farms but this is the first personal evidence I have of it. Although I always add “MDN” to my Javascript and CSS searches so maybe this is just the way of the world now, you have to know what the good sites are to find them…

Programming

London Django Meetup May 2023

Just one talk this time and it was more of a discussion of the cool things you can do with Postgres JSON fields. These are indeed very cool! Everything I wanted to do with NoSQL historically is now present in a relational database without compromise on performance or functionality, that is an amazing achievement by the Postgres team.

The one thing I did learn is that all the coercion and encoding information is held in the Django model and query logic which means you only have basic types in the column. I previously worked on a codebase that used SQLAlchemy and a custom encoder and decoder which split custom types into a string field with the Python type hint (e.g. Decimal, UUID) and the underlying value. By comparison with the Django implementation which appears to just use strings this is a leaky abstraction where the structure of the data is compromised by the type hint.

Using the Django approach would have been easier when using direct SQL on the database and followed the principle of least surprise.

The speaker was trying to make a case for performing aggregate calculations in the database but via the Django ORM query language which wasn’t entirely convincing. Perhaps if you have a small team but the resulting query language code was more complex that the underlying query and was quite linked to the Postgres implementation so it felt that maybe a view would have been a better approach unless you have very dynamic calculations that are only applied for a fixed timespan.

It was based on an experience report so it clearly worked for the implementing group but if felt like the approach strongly coupled the database, the web framework and the query language.

Python

London Django Meetup April 2023

I’m not sure whether I’ve ever been to this Meetup before but it is definitely the first since 2020. It was hosted by Kraken Energy in their offices which have a plywood style auditorium with a nice AV setup for presentations and pizza and drinks (soft and hard) for attendees.

There were two talks: one on carbon estimates for websites built using Django and Wagtail; the other about import load times when loading a Django app into a shell (or more generally expensive behaviour in Python module imports).

Sustainable or low impact computing is a topic that is slowly gaining some traction in the wider development community and in the case of the web there are some immediate quick wins in the form of content negotiation on image formats, lazy loading and caching to be had.

One key takeaway from the talk is that the end user space is the area where most savings are possible. Using large scale cloud hosting means that you are already benefiting from energy efficiencies so things like the power required for a mobile phone screen matters because the impact of inefficient choices in content delivery is multiplied by the size of your audience.

There was a mention in passing that if a web application could be split into a Functions as a Service (FaaS) deployable then, for things like Django that have admin paths and end user paths, you can scale routes independently and save on overprovisioning. If this could be done automatically in the deployment build it would be seamless from the developer’s point of view. I think you can do this via configuration in the Serverless framework. It seems an interesting avenue for making more efficient deployments but at a cost in complexity for the framework builders.

There was quite an interesting research opportunity mentioned in the talk around serverless-style databases. For sites with intermittent or cyclical usage having an “always on” database represents a potentially big saving on cost and carbon. There was mention of the service neon.tech which seems to have a free personal tier which might be perfect for hobby sites where usage is very infrequent and a spin up time would be acceptable.

The import time talk was interesting, it focused on the developer experience of the Django shell boot time (although to be honest a Python shell for any major framework has the same issues). There were some practical tips on avoiding libraries with way too much going on during the import phase but really the issue of Python code doing expensive eager activity during import has been a live one for a long time.

I attended a talk about cold starts with Python AWS Lambdas in 2019 that essentially boiled down to much of the same issues (something addressed, but not very well in this AWS documentation on imports). Little seems to have improved since and assumptions about whether a process is going to be short or long-lived ultimately boils down to the library implementer and the web/data science split in Python means that code is run in very different contexts making sharing libraries across these two use cases hard.

The core language implementation is getting faster but consensus on good practice in import time behaviour is not a conversation that seems to be happening between the major library maintainers.

The performance enhancements for core Python actually linked the two talks because getting existing code onto more efficient runtimes helps reduce compute demands across all usage.

Blogging, Python

PyCon UK 2022

This was the first in-person PyCon since the start of the pandemic. It had slightly changed format as well, now being mostly single-track (plus workshops) and not having a teacher/youth day. Overall I found the changes welcome. I’m generally a big fan of single track conferences because they simplify the choices and help concentrate the conversation amongst the attendees.

A lot of talks were by developer advocates but some of the most interesting talks came from the core language maintainers talking about the challenges in balancing the needs of different parts of the community while the least interesting were from the company sponsors who generally didn’t seem to have refined their content for a general audience.Typescript

A simple lightning talk about the need to sanitise the input of the builtin int function was illuminating about the challenges of reconciling web needs versus data science. However the general point about whether unlimited number functions are necessary or not was interesting. Similarly the challenge of when to adopt new syntax and keywords in the case of the addition of Exception Groups to the language.

There were a few talks about how the process of maintaining a language and community actually works. How conferences get listed on the Python website, how performance benchmarks are built and used and the desire to have a single place to centre conversation about the language.

There were talks about how to interface to various technologies with Python (Kafka, Telegram) and the inevitable talk about how to improve the performance of your data science code by using NumPy. There were also quite a few talks about Hypothesis; which probably still isn’t used as much as it should be in the community. The library now includes a test generator that can examine code and suggest a test suite for it which is quite a cool feature and certainly would be a boon for those who dislike test-first approaches (whether it has the same quality benefits is another question).

The other talk that had a big impact on my was the introduction to using PyScript in static websites. Python is going to start targeting WebAssembly as a platform which will help standardise the various projects trying to bring a fully functional interpreter to the browser and provide a place to pool efforts on improvements.

PyScript aims to allow Python to also script DOM elements and be an alternative scripting language in the browser. That’s fun but not massively compelling in my view (having done many compile to Javascript languages in the past I think it’s easier to just do Javascript (with an honourable exception for Typescript). Being able to run existing code and models in the browser without change maximises the value and potential of your existing Python codebases.

Cardiff remains a lovely venue for the conference and I think it is worth taking time out from the tracks to enjoy a bit of the city and the nearby Bute Park.

Keeping the batteries included

Python is well-known for being a “batteries included” language which means it comes with a rich variety of modules that work right out of the box. However this recent post about the regular expression library shows the problems that can occur from shipping such a wide variety of code as part of the language core. Maintaining a wide-ranging codebase is challenging as is keeping it aligned to a language’s release cadence.

The problems suggest that really the language should focus just on the core tools and syntax of the language with everything else being on a different release cycle. However by itself that core isn’t too useful for pragmatic coding purposes. Curation of code and having a sensible selection of libraries is a challenge. Some languages like Clojure and Elm have a controlled ecosystem of libraries that are adjacent to a small language core. However here the difference between curation and gatekeeping is fine and it feels like only languages with a large community can do it effectively.

Perhaps the answer is to have a basic implementation of core functionality in the core but to use the language documentation to suggest alternative libraries. This moves the problem to the documentation team but hopefully this is a simpler arena to both maintain and curate.

Programming

Recommendation: use explicit type names

I attended an excellent talk recently at Lambdale about teaching children to code (not just Scratch but actual computer languages) and one of the points of feedback was that the children found the use of names in type declarations and their corresponding implementations in languages such as Haskell confusing because frequently names are reused despite the two things being completely different categories of things.

The suggestion was that instead of f[A] -> B it was simpler to say f[AType] -> BType.

Since I am in the process of introducing Python type checking at work this immediately sparked some interest in me and when I returned to work I ran a quick survey with the team that revealed that they too preferred the explicit type name with the suffix Type rather than package qualifier for types that I had been using.

Instead of kitchen_types.Kettle the preference was for KettleType.

The corresponding function type annotations would therefore read approximately:

def boil(water: WaterType, kettle: KettleType) -> WaterType

So that’s the format we’re going to be adopting.

Echo One

Sequentially arranged sentences composed of words (and punctuation)

Tag Archives: Python