For my Bachelor’s thesis I trained a model in a dataset of videos to predict which word was a person uttering in the video, I used three approaches, PCA + Support Vector Machine, PCA + Hidden Markov Model and neural networks. You can check the github repo of the project.