QA On Private Documents (RAG)

SkimLit

Food Vision Big

Predicting Coronary Heart Disease

  • Developed an advanced question-answering system utilising the OpenAI, Pinecone, and LangChain, enabling dynamic information retrieval from nonpublic or documents not covered in the model’s training data.

  • Implemented a retrieval-augmented generation to efficiently process and query large text corpora, integrating OpenAI’s text-embedding-ada-002 and the Pinecone vector database to manage and search document chunks.

  • Built a complete application capable of real-time querying on custom data, demonstrating the ability to generate accurate answers from documents published beyond the model’s training scope.

  • Replicated an NLP model from the 2017 paper "PubMed 200k RCT" to classify sentences in medical abstracts sequentially, using the dataset of 200,000 labelled RCT abstracts to enhance literature review efficiency.

  • Developed and iterated through multiple model architectures including TF-IDF classifiers, deep learning models with various embeddings, and multimodal models, that significantly aid in abstract skimming.

  • Integrated preprocessing and modelling techniques, including spaCy for text segmentation and neural network models for sentence classification, aiming to implement in extensions for real-time literature structuring.

  • Developed the Food Vision Big model using TensorFlow, surpassing the performance of the 2016 DeepFood CNN model with an accuracy of 80.2% on the Food101 dataset comprising 101,000 images.

  • Implemented advanced training techniques including prefetching and mixed precision training, reducing model training time to approximately 20 minutes compared to the 2-3 days reported in the DeepFood paper.

  • Utilised TensorFlow Datasets for efficient data handling, created preprocessing functions, optimised data batching, deployed feature extraction and fine-tuning transfer learning strategies to enhance model training.

  • Conducted research and developed a system under Dr Jixin Ma on the comparison of numerous classification models to predict coronary heart disease using past medical data from the UCI Machine Learning Repository.

  • Executed the project using Pandas, NumPy, Matplotlib, Seaborn, and Scikit-Learn, and evaluated the classification models using ROC Curve and AUC Score, Confusion Matrix, and Classification Report.

  • Implemented supervised machine learning algorithms such as Logistic Regression, K-Nearest Neighbour and Random Forest to achieve an accuracy of 89% by hyperparameter tuning and cross-validation.